STA541: Categorical Data Analysis II

Required Textbooks:

Categorical Data Analysis Using the SAS System, 3rd ed., – 2012

Stokes, M,E,, Davis, C.S., and Koch, G.G. (2012). Categorical Data Analysis Using the SAS System, 3rd edition , Cary, NC: SAS Institute Inc.

GOALS of STA 541: Students completing this course should

  • Be competent on the analysis of categorical outcomes presented in STA 507
  • Be competent in logistic and categorical regression modeling, building, and diagnostics
  • Understand the difference underdispersion and overdispersion
  • Be introduced to inflated outcomes.
  • Introduced to the Receiver Operating Characteristic (ROC) curves and Optimal Operating Points (OOP).
  • Introduction to Mixture Models
  • Competence with PROC LOGISTIC, PROC GENMOD, PROC COUNTREG, PROC FMM, PROC NLMIXED, and PROC GLIMMIX.

TECHNOLOGY: Students will be using SAS 9.4 Students should be competent in using SAS prior to taking this course. The following skills are expected to be known:

  • Reading and creating SAS data sets.
  • Familiarity with PROC REG PROC LOGISTIC, and PROC GLM introduced in STA 512 and STA 507.
  • Able to save output files, SAS code, and SAS logs
  • Install SAS on their personal computer.
  • Comfortable in the SAS lab.

Student Learning Objectives:

1. Demonstrated an understanding of count outcomes.

2. Demonstrated the ability to apply the elementary methods of statistical analysis, namely those based on the count model ideas toperform data analysis for the purposes of statistical inference.

3. Demonstrated proficiency in the effective use of computers for research data management and for analysis of data with standard statistical software packages, particularly SAS.

4. Learned to develop and critically assess model fitting for count data.

5. Applied one or more methods of statistical inference to a particular area of interest, particularly the program in the elective concentration.

6. Gained practical experience in statistical consulting and communicating with non-

statisticians, culminating with interaction with research workers at a local company as partof the internship practicum.

Course Learning Outcomes: Students will be able to:

1. Determine the correct statistical analysis for a given set of data [SLO1,SLO2, SLO4]

2. Utilize statistical software to analyze linear models and correctly interpret the output. [SLO2, SLO3]

3. Utilize statistical software to perform goodness of link assessment for logistic regression model, RECEIVER OPERATING CHARACTERISTIC CURVES (ROC’S), and OPTIMAL OPERATING POINTS (OOP’s), and correctly interpret the output. [SLO2, SLO3,SLO4]

4. Utilize statistical software to analyze Count outcomesusing Poisson Regression, Negative Binomial Regression models, and Generalized Poisson regression models to address under-dispersion and over-dispersion for count data and correctly interpret the output. [SLO2, SLO3,SLO4]

5. Utilize statistical software to analyze inflated Count OUTCOMES models, though ZERO-INFLATED OUTCOME TECHNIQUES AND ZERO-ALTERED MODELS, and correctly interpret the output. [SLO2, SLO3,SLO4]

6. Utilize statistical software to perform FINITE MIXTURE MODELS to accommodate multimodal outcomes. [SLO2, SLO3,SLO4]

7. Discuss goodness-of-fit techniques for ZERO-ALTERED, ZERO-INFLATED MODELS, AND FINITE MIXTURE MODELSthrough VUONG’S TEST. [SLO2, SLO3,SLO4].

8. INTRODUCTION TO EXTENSION OF TECHNIQUES TO REPEATED MEASURES DATA. [SLO2, SLO3,SLO4]

9. INTRODUCTION TO CTAEGORIZED TIME TO EVENTS. [SLO2, SLO3,SLO4]

10. Communicate the results of these statistical analyses in a concise, simple way that would be understandable to a non-statistician. [SLO2, SLO4, SLO5, SLO6]

EXAMS: The first exam is xxx. The second exam is yyy.

EVALUATION COMPONENTS: Two in class tests at 25% each (TBATBA) and the final exam at 30%, individual project at 20%.

Evaluation:Exam 1 (in-class) [CLO1-CLO4, CLO7] 25%

Exam 2 ( in-class) [CLO1- CLO7] 25%

Final Exam (in-class) [CLO1-CLO9] 30%

Final Project [CLO9]20%

ATTENDANCE: Attendance is important and expected. Absence from a test is acceptable for illness/emergency/official University business. Please contact me ASAP by e-mail or phone. Written verification may be required.

DISHONESTY: Any instance of dishonesty will be dealt with according to University policy.

DISABILITIES: We at West Chester University wish to make accommodations for persons with disabilities. Please make your needs known to me and to the Office of Services for Students with Disabilities (3217). Sufficient notice is needed in order to make accommodations possible.

WITHDRAWAL:

TOPICS: We will follow the Stokes et al. (2012) covering materials not covered in STA507.

HOMEWORK

All Homework will be assigned at the end of each class and from the textbook. Homework is DUE. Subsequent week we will review my final SAS code and the relevant SAS output. I recommend you have your syntax & output available.

PROJECT

The project is worth 20% of your grade. Grading will be based on an oral presentation. See below for the schedule of oral presentations.
The presentation should include the following sections:

  1. Background. Give a short description of the problem and its significance.
  2. Data. Describe the variables and give the number of cases. Indicate any special characteristics concerning the experimental design.
  3. Model. Explain the statistical model that is the basis for your analysis.
  4. Results. Describe the results of your analysis. You can include short tables and graphical displays here.
  5. Conclusions. State your conclusions in terms of the context of the background information that you provided in the first section. Be concise and avoid technical jargon.

Note that there are 15 minutes between the start of each presentation. This means that you will have 10 or at most 12 minutes for your presentation. All presentations will be done in PowerPoint. POWERPOINT is the ONLY ITEM DUE.

Getting started.

  • DATA FROM THE NIDA COCAINE COLLABORATIVE STUDY, TREATMENT OF DEPRESSION STUDIES, or CLINICAL TRIALS NETWORK will be used
  • Outcome will be a categorical outcome following one of the type covered within the class,
  • Plan your analysis or potential Hypothesis
  • Do a first set of analyses including basic descriptive statistics with plots and charts as appropriate
  • Run your basic models; discuss the results and refine the analysis
  • Check model assumptions

Presentations are scheduled for during the Last week.

ACADEMIC & PERSONAL INTEGRITY

It is the responsibility of each student to adhere to the university’s standards for academic integrity. Violations of academic integrity include any act that violates the rights of another student in academic work, that involves misrepresentation of your own work, or that disrupts the instruction of the course. Other violations include (but are not limited to): cheating on assignments or examinations; plagiarizing, which means copying any part of another’s work and/or using ideas of another and presenting them as one’s own without giving proper credit to the source; selling, purchasing, or exchanging of term papers; falsifying of information; and using your own work from one class to fulfill the assignment for another class without significant modification. Proof of academic misconduct can result in the automatic failure and removal from this course. For questions regarding Academic Integrity, the No-Grade Policy, Sexual Harassment, or the Student Code of Conduct, students are encouraged to refer to the Department Graduate Handbook, the Graduate Catalog, the Ram’s Eye View, and the University website at

STUDENTS WITH DISABILITIES

If you have a disability that requires accommodations under the Americans with Disabilities Act (ADA), please present your letter of accommodations and meet with me as soon as possible so that I can support your success in an informed manner. Accommodations cannot be granted retroactively. If you would like to know more about West Chester University’s Services for Students with Disabilities (OSSD), please visit them at 223 Lawrence Center. The OSSD hours of Operation are Monday – Friday, 8:30 a.m. – 4:30 p.m. Their phone number is 610-436-2564, their fax number is 610-436-2600, their email address is , and their website is at

REPORTING INCIDENTS OF SEXUAL VIOLENCE

WestChesterUniversity and its faculty are committed to assuring a safe andproductive educational environment for all students. In order to meet this commitment and to comply with Title IX of the Education Amendments of 1972 and guidance fromthe Office for Civil Rights, the University requires faculty members to report incidentsof sexual violence shared by students to the University's Title IX Coordinator, Ms. Lynn Klingensmith. The onlyexceptions to the faculty member's reporting obligation are when incidents of sexualviolence are communicated by a student during a classroom discussion, in a writingassignment for a class, or as part of a University-approved research project. Facultymembers are obligated to report sexual violence or any other abuse of a studentwho was, or is, a child (a person under 18 years of age) when the abuse allegedlyoccurred to the person designated in the University protection of minors policy. Information regarding the reporting of sexual violence and the resources that areavailable to victims of sexual violence is set forth at the webpage for the Office of Social Equity at

EMERGENCY PREPAREDNESS

All students are encouraged to sign up for the University’s free WCU ALERT service, which delivers official WCU emergency text messages directly to your cell phone. For more information, visit To report an emergency, call the Department of Public Safety at 610-436-3311.

ELECTRONIC MAIL POLICY

It is expected that faculty, staff, and students activate and maintain regular access to University provided e-mail accounts. Official university communications, including those from your instructor, will be sent through your university e-mail account. You are responsible for accessing that mail to be sure to obtain official University communications. Failure to access will not exempt individuals from the responsibilities associated with this course.

TENTATIVE SCHEDULE:

Wk / Topic / Chapters
1 / Review the Logistic Regression Model, Discussion of ROC’s, OOPs / 8
CLO1,
CLO2
CLO3
2 / Goodness of Link issues with Logistic Regression models focus on Logit, Probit, Complementary Log/Log-Binomial model, Complementary Log-Log / 8
CLO1,
CLO2
CLO3
3 / Count Outcomes: Poisson, Negative Binomial, Generalized Poisson (over-dispersion / under-dispersion, offset) / 9,12
CLO1
CLO2,
CLO4
4 / Zero-inflated outcomes / 12
CLO2, CLO3,
CLO5
5 / Exam 1 / CLO1, CLO2, CLO3,
CLO4
6 / Zero-altered analysis / Goodness-of-fit (Vuong’s test) / 12
CLO1, CLO2, CLO5,
CLO7
7 / Zero-altered analysis / Goodness-of-fit continued (Vuong’s test) / 12
CLO1, CLO2,
CLO5,
CLO7
8 / Finite Mixture Models / 12,14
CLO1,
CLO2,
CLO6
9 / Finite Mixture Models
Continued / 12,14
CLO1,
CLO2,
CLO6
10 / EXAM 2 / CLO1,
CLO2,
CLO4,
CLO5,
CLO6,
CLO7
11 / Introduction to Longitudinal Categorical Data Analysis – Generalized Linear Mixed Model / 15
CLO8
12 / Generalized Estimating Equations for Count Outcomes and Zero-inflated Outcomes / 15
CLO1,
CLO2,
CLO8
13 / Nonlinear mixed models for Count Outcomes / 15
CLO1,
CLO2,
CLO8
14 / Categorized time to Events / 14
CLO2,
CLO9
PROJECT / CLO10
FINAL –TBA / CLO1-CLO10

Important Dates:

Last day to drop

Last day to withdraw

First Exam –

Second Exam –

Final Exam -

Project –

Bibliography

Atkins, D. C., Baldwin, S. A., Zheng, C., Gallop, R. J., and Neighbors, C. (2013). A Tutorial on Count Regression and Zero-Altered Count Models for Longitudinal Substance Use Data. Psychology of Addictive Behaviors, 27, 166-177.

Gallop, R.J., Rieger, R.H., McClintock, S., and Atkins, D.C. (2013). A model for extreme stacking of data at endpoints of a distribution: Illustration with W-shaped data, Statistical Methodology, 10, 29-45.

Atkins, D.C. and Gallop, R.J. (2007). Rethinking how family researchers model infrequent outcomes: A tutorial on count regression and zero-inflated models. Journal of Family Psychology, 21, 726-735.

Gallop, R.J., Crits-Christoph, P., Muenz, L.R., and Tu., X.M. (2003). Determination and Interpretation of the Optimal Operating Point for ROC Curves derived through Generalized Linear Models. Understanding Statistics, 2, 219-242.

Stokes, M,E,, Davis, C.S., and Koch, G.G. (2012). Categorical Data Analysis Using the SAS System, 3rd edition , Cary, NC: SAS Institute Inc.

Cook, R.D., and Weisberg, S. (1989). Regression Diagnostics with Dynamic Graphics, Technometrics, 31, 277-291.

Cook, R.D., and Weisberg, S. (1994). ARES Plots for Generalized Linear Models, Computational Statistics and Data Analysis, 17, 303-315.

Crits-Christoph, P., Connolly, M.B., Gallop, R., Barber, J., Tu, X., Gladis, M., and Siqueland, L. (2001). Early Improvement During Manual-Guided Cognitive and Dynamic Psychotherapies Predicts 16-Week Remission Status, Journal of Psychotherapy Practice and Research, 10, 145-154.

Dreiseitl, S., Ohno-Machado, L., and Binder, M. (2000). Comparing three-class diagnostic tests by three-way ROC analysis, Medical Decision Making, 20, 323-331.

England, W.L. (1988). An Exponential Model Used for Optimal Threshold Selection on ROC Curves, Medical Decision Making, 8, 120-131.

Fisher, L.D., and Van Belle, G. (1993). Biostatistics - A Methodology for the Health Sciences. New York: John Wiley and Sons.

Goetghebeur, E., Eiinev, J., Boelaert, J., and Vander, P.S., (2000). Diagnostic test analysis in search of their gold standard: Latent class analyses with random effects, Statistical Methods in Medical Research, 9, 231-248.

Halpern, E.J., Albert, M., Krieger, A.M., Metz, C.E., and Maidment, A.D. (1996). Comparison of Receiver Operating Characteristic Curves on the Basis of Optimal Operating Point, Academic Radiology, 3, 245-253.

Hanley, J.A. and McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29-36.

Hsieh, F., and Turnbull, B.W. (1996). Nonparametric and Semiparametric Estimation of the Receiver Operating Characteristic Curve, The Annals of Statistics, 24, 25-40.

Hui, S.L., and Walter, S.D. (1980). Estimating the error rates of diagnostic tests, Biometrics, 36, 167-171.

Ishwaran, H., and Gatsonis, C.A. (2000). A general class of hierarchical ordinal regression models with applications to correlated ROC analysis, Canadian Journal of Statistics, 28, 731-750.

Kufera, J. and Mitchell, K. (1999). Application of ROC curves using SAS Software. American Journal of Epidemiology, 149, 98 Supplemental S.

McClish, D.K. (1987). Comparing the Area under more than 2 independent ROC curves, Medical Decision Making, 7, 149-155.

Metz, C.E., Herman, B.A., and Shen, J. (1998). Maximum likelihood estimation of Receiver Operating Characteristic (ROC) Curves from continuously distributed data, Statistics in Medicine, 17, 1033-1053.

Mossman, D. (1999). Three-way ROCs, Medical Decision Making, 19, 78-89.

Peng, C.Y.J., and So, T.S.H. (2002). Logistic Regression Analysis and Reporting: A Primer, Understanding Statistics, 1, 31-70.

Pepe, M.S. (1997). A regression modelling framework for ROC curves in medical diagnostic testing, Biometrika, 84, 595-608.

Pepe, M.S. (1998). Three approaches to regression analysis of receiver operating characteristic curves for continuous test results, Biometrics, 54, 124-135.

Pepe, M.S. (2000). An Interpretation for the ROC Curves and Inferences Using GLM Procedures, Biometrics, 56, 352-359.

Piegorsch, W.W. (1992). Complementary LOG Regression for Generalized Linear Models, American Statistician, 46, 94-99.

Pregibon, D. (1980). Goodness of Link Test for Generalized Linear Models, Applied Statistics, 29, 15-23.

Qu, Y. and Hadgu,A. (1998). A Model for Evaluating Sensitivity and Specificity for Correlated Diagnostic Tests in Efficacy Studies with an Imperfect Reference Test, Journal of the American Statistical Association, 93, 920-928.

Riddle, D.L., and Stratford, P.W. (1999). Interpreting Validity Indexed for Diagnostic Tests: An Illustration Using Berg Balance Test, Physical Therapy, 79, 939-948.

Sainfort, F. (1991). Evaluation of Medical Technologies: A Generalized ROC Analysis. Medical Decision Making, 11, 208-220.

SAS Institute Inc. (1997). SAS/OR Technical Report: The NLP Procedure, Cary, NC: SAS Institute Inc.

Sahadevan, S., Lim, P.P.J., Tan, N.J.L., and Chan, S.P. (2000). Diagnostic performance of two mental status tests in the older Chinese: Influence of education and age on cut-off values, International Journal of Geriatric Psychiatry, 15, 234-241.

Schafer, H. (1989). Constructing a Cut-Off Point for a Quantitative Diagnostic Test, Statistics in Medicine, 8, 1381-1391.

Schulzer, M. (1994). Diagnostic Test: A statistical Review Muscle and Nerve, 17, 815-819.

Toledano, A.Y., Gatsonis, C. (1996) Ordinal regression methodology for ROC curves derivedfrom correlated data, Statistics in Medicine, 15,1807-1826.

Valenstein, P.N. (1990). Evaluating diagnostic tests with imperfect standards, American Journal of Clinical Pathology, 93, 252-258.

Zhou, X.H., Obuchowski, N.A., and McClish, D.K., (2002). Statistic Methods in Diagnostic Medicine, New York: John Wiley and Sons.

Zweig, M.H. and Campbell, G. (1993). Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine, Clinical Chemistry, 39, 561-577.

Box, G.E.P., and Cox, D.R. (1964). An analysis of Transformations. Journal of the Royal Statistical Society B, 26, 211-243.

Cheung Y.B. (2002). Zero-inflated models for regression analysis of count data: A study of growth and development. Statistics in Medicine, 21, 1461–1469

Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112, 155-159.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Hilbe, J. (2007). Negative binomial regression. New York: Cambridge.

Holtzworth-Munroe, A., Waltz, J., Jacobson, N. S., Monaco, V., Fehrenbach, P., & Gottman, J. M. (1992). Recruiting non-violent men as control subjects for research on marital violence: How easily can it be done? Violence and Victims, 7, 79–88.

Kemp, A. (1997). A Characterization of a Discrete Normal Distribution, Journal of Statistical Planning and Inference, 63, 223-229.

Khoshgoftaar T., Gao K., and Szabo R.M. (2001). An Application of Zero-Inflated Poisson Regression for Software Fault Prediction, 12th International Symposium on Software Reliability Engineering (ISSRE'01), 66.

Lee, M.K., Song, H.H., Kang, S.H., and Ahn, C.W. (2002). The determination of sample sizes in the comparison of two multinomial proportions from ordered categories, Biometrical Journal, 44, 395-409.

Littell R.C., Milliken G.A., Stroup W.W., Wolfinger R.D., Schabenberger O., (2006). SAS System for Mixed Models,2nd ed,Cary NC: SAS Institute Inc.

Nelder, J.A., and Wedderburn, R.W.M. (1972). Generalised Linear Models. Journal of the Royal Statistical Society A, 135, 370-384.

Schact, R.L., Dimidjian, S., George, W.H., and Berns, S.B. (2009). Domestic Violence Assessment procedures among couple therapists, Journal of Marital and Family Therapy, 35, 47-59.

Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57, 307-333.

Yau, K. K. W., Wang, K., & Lee, A. H. (2003). Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biometrical Journal, 45, 437–452.

Yau, K. K. W., Lee, A. H., & Carrivick, P. J. W. (2004). Modeling zero-inflated count series with application to occupational health. Computer Methods and Programs in Biomedicine, 74, 47-52.

Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., &White, J. S. (2009). Generalized linear mixed models: A practical guide for ecology and evolution. Trends in Ecology and Evolution, 24(3), 127–135.

Liang, K. Y. & Zeger, S. L. (1986).Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22. doi:10.1093/biomet/73.1.13

Molenberghs, G. & Verbeke, G. (2005). Models for discrete longitudinal data. New York, NY: Springer.

Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88(421), 9-25.

Coxe, S., West, S. G., & Aiken, L. S. (2009). The analysis of count data: A gentle introduction to Poisson regression and its alternatives. Journal of Personality Assessment, 91, 121-136.

STA541: Categorical Data Analysis II

Course Description:

This course will extend the information presented in STA 507 course. We will cover statistical methods for producing Receiver Operating Characteristic Curves and the Optimal operating point from logistic regression. Goodness-of-link and complex modeling issues for Count data such as overdispersion and underdispersion will be presented. Students will be exposed to discussion of techniques for both cross-sectional and longitudinal count data. Techniques to assess goodness of fit for count data will be introduced. Students will be exposed to various programming techniques to fit such data within the SAS software using procedures such as PROC GENMOD, PROC COUNTREG, PROC FMM, PROC GLIMMIX, and PROC NLMIXED. Upon completion of this second part of Categorical Data Analysis, student willbe comfortable with the analytical techniques for a variety of count outcomes in the real world setting. Proper communication and interpretation of these models is an essential component of the course.

Text Books:

Stokes, M,E,, Davis, C.S., and Koch, G.G. (2012). Categorical Data Analysis Using the SAS System, 3rd edition , Cary, NC: SAS Institute Inc.