2

Brandeis University

Division of Graduate Professional Studies

Rabb School of Continuing Studies

Course Syllabus

I. Course Information

1.  Introduction to probability and statistics

2.  RBIF-0103-G1

3.  01/21/2015- 04/25/2015

4.  Distant Learning Course Week: Wednesday through Tuesday

5.  Instructor, contact info: Michael B. Partensky, PhD Please contact me via email: ,

To avoid delays, please send your mail to both addresses if you want to contact me before 01/21/15. Later, please use the Brandeis address.

6.  Virtual office hours: Sunday, 11 am – 13 am (EST) [occasional changes are possible]

7.  Document Overview

This syllabus contains all relevant information about the course: its objectives and outcomes, the grading criteria, the texts and other materials of instruction, and of weekly topics, outcomes, descriptions of assignments, and due dates. Consider this your roadmap for the course. Please read through the syllabus carefully and feel free to share any questions that you may have. Please print a copy of this syllabus for reference.

8.  Course Description

·  Purpose and content. The course builds a foundation for the “probabilistic thinking” method, with applications to real life problems including bioinformatics, bio- and medical statistics, computational biology and biophysics, data analysis. The topics cover random numbers, discrete and continuous random variables, elements of Combinatorics, conditional probability, Bayes' formula, Markov chain, Binomial, Poisson and normal distribution, entropy and information, Monte-Carlo method, the central limit theorem, confidence interval and hypothesis testing, correlations, nonlinear regression and maximum likelihood. We will also learn some basics of Mathematica programming language and will be using it for the computational probabilistic experiments.
GI

·  Prerequisites. Solid knowledge of basic algebra, geometry and trigonometry would be very helpful for your success. If you are not fluent in basic math, please reserve more time for your weekly studies. Some familiarity with introductory calculus (functions, derivatives, integrals) is preferable, but not required. The lectures will provide you with the necessary background in calculus as needed.

·  Catching up with Math. On the first week of the class an introductory math quiz will be offered, aimed to help you refresh your math background, and allocate adequate time and efforts for your weekly studies. The test will cover the areas of basic and more advanced Math directly related to the class. The test is not graded, but required (the grade is 100 if you took it or 0 otherwise). Based on the outcome, you will be advised to refresh some of the materials if necessary. Mathematica (see section 9.3), an excellent educational and research software intensively used throughout the course, will also help in refreshing your Math skills. It is strongly advisable to start practicing Mathematica without delay.

9.  Instruction Materials

9.1.  Semi-Required Texts (mostly for the individual studies)

1. M.S. Spiegel,J.J. Schiller and R.A.Srinivasan , Schaum's Outline of Probability
and Statistics, Schaum’s Outline Series, McGraw-Hill, 3-d (2009), ISBN:9780071544252

2. E. Don, Schaum's Outline of Mathematica, Schaum’s Outline Series, McGraw-Hill 2-d (2009)
ISBN:9780071608282
3. C.M. Grinstead and J.L., Snell. Introduction to probability. Am. Math, 2-d (1997) ISBN:9780821894149 (this book can be also downloaded from the web for free Please send a thank-u note to the authors)

9.2  Recommended Text(s)


4. Bennett, D.J. 1998. Randomness. Harvard University Press, Cambridge, (1999), ISBN: 978-0674107465
Enjoyable supplementary reading. A lot of insights, paradoxes, peculiarities.

5. S. Wolfram Mathematica (9-th edition): the reference Source. It is included in e-format in the
standard Mathematica distribution).

6. W.J. Ewens and G.R. Grant, Statistical methods in bioinformatics (an introduction), Springer, 2-d,
(2005) ISBN-13:978-0387400822
(will be used only occasionally, but could be also handy in your future study of bioinformatics.)

7. R. Durrett, Probability: Theory and Examples (Cambridge Series in Statistical and Probabilistic
Mathematics), CUP (2010) ISBN-13:978-0521765398

8. N.N. Taleb, Fooled by randomness, Random House, 2-d, (2008) ISBN-13:978-1400067930
[Contains a lot of insights and cute examples]

9. W.W Hines et al., Probability and Statistics in Engineering, Wiley, 4-th (2009) ISBN:978-0471240877

10. R. Durbin, S.R. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis :
Probabilistic Models of proteins. Cambridge University Press; Reprint edition (1999),
ISBN:978- 0521629713 (the comment from #6 is also applicable here)

9.3  Required Software

GI
Mathematica 10. We will be using Mathematicafor the experiments with randomness. In addition Mathematica will help you to refresh some of the math required for the course. You will be able to purchase a student version of Mathematica 10 (which is fully functional) at a significant discount. To get an additional 15% discount please enter the promotion code PD1637 at checkout from the Wolfram Web Store at store.wolfram.com (If asked, please enter my name. This feature is provided to the members of the Wolfram Faculty Program).
Mathematica is an extremely powerful and elegant tool, and I am sure that some of you will find it very useful in your future work.


9.4 On-line Course Content

This course will be conducted completely online using Brandeis’ LATTE site, available at http://latte.brandeis.edu. The site contains the course syllabus, assignments, our discussion forums, links/resources to course-related professional organizations and sites, and weekly checklists, objectives, outcomes, topic notes, self-tests, and discussion questions. Access information is emailed to enrolled participants before the start of the course.

10.  Overall Course Outcomes

The course is designed to teach the probabilistic way of thinking. It provides a thorough background in the basics of probability theory and statistics, the major pillars of bioinformatics and biostatistics. We will utilize the multi-disciplinary approach by using the examples and examining the ideas from various fields, from statistical physics and computer modeling of proteins to the probabilistic aspects of evolution and biological data analysis. The class will strongly benefit from using Mathematica, the most advanced “computer aided thinking tool” which helps in understanding the major concepts of P&S, developing algorithms and running random experiments.

Course Outcome / Assignment / Assessment
1.  Apply the elements of set theory to the analysis of complex events and biological sequences / Lect. 2, 3; HW 2, 3
2.  Use Combinatorics for the analysis of various random selection problems, derivation of major probability distributions and grasping some major combinatorial problems of sequence analysis. / Lect. 3,4; HWs 3,4
Lect. 10; HW 10
In addition, various Combinatorial concepts are quite evenly distributed over the course, as one of the foundations of Probabilistic Thinking
3.  Apply Binomial, Poisson, geometric hyper-geometric, negative binomial, Normal, exponential and other probability distributions to the analysis of probabilities, sampling errors, sequence similarity. / Lect. 4, 5, 10, 12;
HW 4-6, 10-12
4.  Recognize and analyze phenomena described by conditional probability. Use the Bayes formula to analyze prior probabilities given the outcomes / Lect. 6,7; HW 6,7
5.  Apply non-linear regression (NLR) to data modeling; develop Mathematica-based applications of NLR for solving some real-life problems / Lect. 11; HW 11.
6.  Apply the concept of Maximum likelihood to the experimental data analysis. / Lect. 11; HW 11
7.  Analyze some archetypical paradoxes of probability (‘Monty Hall’, ‘prisoner’s dilemma’, second daughter) for the guidance in solving complex real-life statistical problems. / Lect. 2, 7
Multiple Q&A forum discussions
8.  Apply the measures of central tendency (mean, variance, e.t.c.) for the statistical estimates / Lect. 10; HW 10
9.  Analyze and simulate with Mathematica various Markov models as a foundation of the major algorithms of sequence analysis (HMM, Blast, e.t.c.) / Lect. 7,8; HW 8
10.  Use the central limit theorem for the analysis of sampling errors and confidence interval
11.  Apply the hypothesis testing technique to the analysis of statistical data / Lect. 12; HW 12; Test preparation problems.
Lect. 12; HW 12; Test preparation problems.
12.  Use relation between entropy and probability, and Boltzmann statistics as fundamental concepts behind the protein dynamics and energetics.
Elucidate relation between entropy, disorder, and information. / Lect. 13; Q&A forum discussions.
13.  Formulate basic principles underlying the Monte Carlo and Molecular dynamics modeling of molecular biological systems. / Lect. 5, 13 (+ Videos of MC simulations)
14.  Analyze and describe some statistical problems of genetics ( Hardy-Weinberg law, probabilities of genetically inherited diseases, applications of Bayesian statistics) / Lect. 7; HW 7
15.  Actively participate in the team work: problem solving in groups
16.  Use Mathematica as the programming, visualization and presentation environment / Weeks 2 - 13
Weeks 1-5 : intense introduction to Mathematica; practical applications of Mathematica are evenly distributed between the classes

Upon completion of the course students will be able to

q  Use general principles of P&S in preparation for future work in bioinformatics

-  Use the operational definition of probability to estimate the empiric probabilities for random events and biological sequences

-  Apply the elements of set theory to the analysis of complex events

-  Use Combinatorics for the analysis of various random selection problems, derivation of major probability distributions and grasping some major combinatorial problems of sequence analysis.

-  Apply Binomial, Poisson, Normal, geometric, hyper-geometric and negative binomial distributions to the analysis of probabilities, sampling errors, sequence similarity

-  Recognize and analyze phenomena described by conditional probability

-  Use the Bayes’ formula to analyze prior probabilities given the outcomes

-  Apply non-linear regression (NLR) to data modeling; develop Mathematica-based NLR applications for some practical examples

-  Apply the concept of Maximum likelihood to the experimental data analysis.

-  Analyze some archetypical paradoxes of probability (prisoner’s dilemma, Buffen needle, etc) for the guidance in the analysis of complex real-life statistical problems.

-  Apply the measures of central tendency (mean, variance etc) for the statistical estimates

-  Analyze and simulate with Mathematica various Markov and random walk models for better understanding of the major algorithms of sequence analysis (HMM, Blast, etc)

-  Use the central limit theorem for the analysis of sampling errors and confidence interval

-  Apply the hypothesis testing technique to the analysis of statistical data

-  Use the ORC curves approach to the test design

q  Apply probabilistic methods and concepts to the analysis of biological systems on different levels:

-  Use relation between entropy and probability, and Boltzmann statistics as fundamental concepts behind the protein dynamics and energetics

-  Formulate basic principles underlying the Monte Carlo and Molecular dynamics modeling of molecular biological systems.

-  Analyze the probabilistic basis of Mendelian genetics, distribution of alleles, Hardy-Weinberg (HW) theorem;

q  Participate in a team research work involving numerical statistical analysis and modeling, and communicate its results to colleagues; make presentations on various statistical topics

-  Team work in the class

-  Use Mathematica as the programming, visualization and presentation environment

11.  General Grading Criteria

The course grade will be based on homework (50%), tests (20%), student’s activity in class (30%). In addition, students can earn extra credits for various extra activities. This can be done, for instance, by completing the optional assignments offered in most of the lectures, making short presentations (papers + computer experiments), etc.

12.  Assignments and Tests: Description, Structure and Grading

13.1 Participation/Attendance All students are expected to participate regularly. The activities (forum discussions, group activities, reading and Home Work assignments) should be spread evenly over the week.

13.2 Communication, correspondence. All the emails related to this class will be sent to your Brandeis email account. However, almost everyone has and uses a primary personal account. For this reason it is extremely important to set up forwarding from your Brandeis account to the primary account.
GI
It is you responsibility to make sure that all the messages from the instructor and from the school are received on time. At the beginning of the class I will ask you to send me confirmations to make sure that everyone is tuned in.

13.3 Home assignments (content, early submission options, and grading).

General
Every week, a homework assignment will be offered. It typically includes a required part and an extra-credit. The deadline for the submission is Tuesday 11.30 pm. The late assignments are not accepted (graded F). In such cases, a make-up can be offered. However, it is highly recommended to submit on time because the class is quite intense and working on additional assignments can jeopardize your progress.

All the submissions should be done via the latte


Submission options.
Usually, you will be offered to choose one of two options:

a)  Submitting once (single file submission) for final grading. The only deadline in this case is Tuesday 11:30 pm.

b)  Submitting more than once (multiple file submission). As explained further, this option is also named the “Early submission” (ES), and involves two deadlines [for the first submission (see weekly assignments), and for the final submission, Tuesday, 11:30 pm].


GI
The “Early Submission” (ES) elaborated

ES implies “multiple file submission”, where the originally submitted assignment can be improved and resubmitted.

One who chooses this option must submit early, usually by 14:00 on Sunday preceding the class (unless otherwise is stated for a particular week). If the original submission is not perfect, it will be returned to you with the initial grade (we designate it G(1) ), with the score assigned to each of the problems, and with some questions and hints helping you to find and fix the errors. Then you are given an opportunity to resubmit and improve your grade.
The initial submission must be complete: you should provide solutions to all the required problems. The first grade G(1) is the starting point, and all the further grades depend on it. At the end, after the resubmission(s), your grade cannot be less than G(1), but you also (except for some rare occasions) cannot get 100% (assuming G(1) was less than 100%).
Each submission numbered n (n= 1, 2, 3…) is initially graded based purely on its quality. We name this the “unbiased” grade G(n). The “real” grade for each submission is defined as
Grealn=12(G(n)+G(1)) (1)
For instance, if the first percentile grade is G(1) = 60 and the second grade (first resubmission) is G(2) = 100, then the final grade is 80%. This approach should motivate you (in addition to submitting early) to receive the starting grade as high as possible.
The individual problems are graded on the scale 0 to 1. In each submission the total percentile grade G(n) is obtained as the total of the scores for the individual problems divided by the total number of the problems, times 100.