Big data for federal agencies: Lab
SURV 699U
syllabus, April 23, 2015
Instructors
Julia Lane,
Frauke Kreuter,
TA: Joshua Tockle (software) and Christina Jones (data)
Office hour by appointment
Class Structure and Course Concept
The amount of digital data generated as a by-product in society is growing fast, e.g., data from satellites, sensors, transactions, administrative processes, social media and smartphones. This type of data is characterized by high volume, high velocity, high variety and is often called big data. The hope is to gain insights from this data for different areas such as e.g., health and crime prevention, planning of infrastructures, and business decisions. Big Data is of interest for agencies that produce statistics to find alternative data sources either to reduce cost, to improve estimates or to produce estimates in a more timely fashion. In particular on the economic statistics side, this interest in growing rapidly. The change in the nature of the new types of data, their availability, the way in which they are collected, and disseminated are fundamental. The change constitutes a paradigm shift for agencies that in the past relied primarily on survey research. However, data quality frameworks well established in statistics production still hold. Paired with a specially designed lecture (SURV699Y), this lab session will allow students to apply all learned techniques through a worked example relevant to core work of the Federal Statistical Agencies. In addition students will work in group projects on topics relevant to their individual agencies. These projects are specifically prepared with Agencies partners for this course. The two core instructors will bridge between computer science and the economic and social science applications.
Grading Method (regular, letter). Course will fulfill the requirement of an elective in the JPSM Master and Certificates Programs.
Credit Option (3.0 credit lab session)
Standard Semester 15 weeks
Prerequisites (familiarity with survey and administrative data; mastery of a statistical software package; completion of an online Python programming class or pre-class bootcamp offered by TAs)
Recommended (MA in statistics, econometrics, or survey methodology)
Learning outcome
Learn how to think about data analysis to solve social problems using and combining large quantities of heterogeneous data from a variety of different sources. Learn how to evaluate which data are appropriate to a given research question and statistical need. Learn the different data quality frameworks and learn how to apply them. Learn the basic computational skills required for data analytics (for text-mining, large-scale data integration and visualization), typically not taught in social science, economics, statistics or survey courses. Learn how to apply statistical and data quality frameworks to big data problems. Identify new approaches to creating and displaying information for Federal Agencies.
Weekly Structure
Friday 9am – noon new material will be introduced (see detailed schedule below)
Friday 1pm – 3pm Lab session on new material
Wednesday 9am - noon Lab session on homework problems and projects
Ongoing video lectures on python, machine learning
Location (Friday Session)
Summit LL, 601 New Jersey Ave NW
- with Ethernet connection
- Video connection capabilities
- Recording capabilities
- Data on secure server at NYU/CUSP
Wednesday Sessions on Data Projects via Adobe Connect
Grading
Weekly online quizzes (worth 10% total)
Class participation (10% of grade)
Course project and homework (60%) in one of two options
- Option a: Heavier emphasis on projects. Students self-select into groups or choose to do individual projects. Baseline homework expected to be done every week
- Option b: Heavier emphasis on homework. Baseline homework expected to be done every week; additional, more in depth, problems assigned tied to addressing agency problems
Class presentations (20% of grade)
Dates of when assignment will be due are indicated in the syllabus. Late assignments will not be accepted without prior arrangement with the instructors.
Readings
TBD
Academic Conduct
Clear definitions of the forms of academic misconduct, including cheating and plagiarism, as well as information about disciplinary sanctions for academic misconduct may be found at the University of Maryland Graduate School web site
http://www.graduate.umaryland.edu/policies/misconduct.html
Knowledge of these rules is the responsibility of the student and ignorance of them does not excuse misconduct. The student is expected to be familiar with these guidelines before submitting any written work or taking any exams in this course. Lack of familiarity with these rules in no way constitutes an excuse for acts of misconduct. Charges of plagiarism and other forms of academic misconduct will be dealt with very seriously and may result in oral or written reprimands, a lower or failing grade on the assignment, a lower or failing grade for the course, suspension, and/or, in some cases, expulsion from the university.
Class Schedule
9/9 1. Introduction and Motivation
In this first lab session project groups will be formed and groups project ideas discussed. Students will learn how to structure a research question and think about appropriate data for such research question. Key quality frameworks will be addressed in this discussion.
9/16 2. Database Basics
Students will practice with curated data on the course mime extracting data and bringing data together. Skills practice are SQL/MYSQL. Database taxonomies will be discussed. Data curation and data documentation will be explained and practiced as well as the use of Github.
9/23 3. Visualization
This lab session will focus on visualizations. Students will be introduced to Tableau. The will be working on the ipython exercises, discuss geography and the usefulness of displaying data by geography and networks. USPTO patent data will be part of the exercise set. Students will creat maps of NSF and NIH spending based on categories classifications etc.
9/30 4. Discussion of group projects
10/7 5. Understanding the Uses of Social Media and Using APIs
Students will learn how to use Web APIs. Exercises will include web-scraping. Pyalm, and PLOS API will be introduced.
10/14 6. Programming with Big Data
Exercises will include practice problems to be solved with MondoDB, MapReduce and Hadoop. NIH and NSF award data will be used. Students learn to build datasets, think about big data sample work flow, and explore record linkage techniques.
10/21 7. Networks
Practice problems include tools like Pajak, R, RGraph. Students will explore to visualize networks and identify notable features. The example data from UMETRICS will be used as well as USPTO patent and publication data.
10/28 8. Data Linkage
Students practice traditional Fellegi-Sunter and clustering-based record linkage. Ipython notebook is used to introduce name standardization and use of edit distance and Fellegi-Sunter to match university employees. Data include web-scraped faculty directories that will be matched to patents and NSF/NIH PI data.
10/4 9. Machine Learning
Examples will introduce students to general machine learning topics: training, error, and classifiers. In the discussion section links to more standard statistical techniques will be make. Lab will introduce python packages including numpy, scipy, pandas, and scikit-learn. NIH/NSF proposal text data for classification. CV classification.
11/11 10. Text Analysis
Ipython notebooks that walk the user through text classification exercises using skicit-learn and nltk. Students will apply different classification algorithms to a corpus of web pages.
11/18 Group Presentations
11/25 11. Non-random Samples and Statistical Inference
Lab session will work through a series of case studies and examines their inferential properties applying a total error framework. Issues of coverage will be discussed. Weight development will be practiced for web-scrape examples from earlier sessions.
12/2 Discussion of Group projects
12/9 13. Privacy and Confidentiality
Lab session discussion statistical disclosure techniques and apply some simple techniques to the data used in class. Consent requests will be reviewed and discussed. Discussion will also include the current general legal framework and likely changes within the U.S. and Europe.
1