Syllabus Description COSI 129a

Introduction to Big Data Analysis

Introduction. The amount of data produced across the globe has been increasing and will continue to grow at an accelerating rate for the foreseeable future. At companies across all industries, servers are overflowing with usage logs, message streams, transaction records, sensor data, business operations records and mobile device data. Effectively analyzing these huge collections of data, “big data” as it is commonly known, can create significant value creating strong demand for experts with ability to carry out such analysis.

Effective big data analysis requires skills in a range of computer science areas such as data storage and processing, statistical data analysis and computational linguistics, and the skill to combine this knowledge in novel ways. COSI 129a will allow students to combine principles from multiple domains (natural language processing, machine learning, distributed systems design, parallel programming) to analyze large volumes of unstructured datasets.

Course Content. COSI 129a will convey knowledge of the principles and practices underlying the state-of-the-art in Big Data Analysis. Initially, we will introduce big data challenges in the domain of computational linguistics as well as fundamentals of natural language processing. The course will then review available frameworks (MapReduce/Hadoop) for large-scale data collection, storage and processing, including recent advanced optimizations. It will also investigate scalable statistical machine learning techniques (e.g., clustering, classification, regression) as well as existing scalable machine learning tools (e.g., Mahout). Finally, the course will address how these mechanisms and technologies fit together to tackle natural language processing tasks on massive scale data sets.

Learning Goals. The learning objectives of this class is to:

●  Introduce computational linguistic techniques and tools for tackling natural language processing problems.

●  Introduce the state-of-the-art in scalable data management and processing. We will particularly focus on the MapReduce framework and its Hadoop implementation.

●  Study, practice, and implement statistical models and machine learning algorithms for the purpose of Big Data analytics.

●  Provide students with hands-on-experience in analyzing large volumes of unstructured data.

Covered Topics

·  Basics in NLP & available tools for NLP

·  The MapReduce programming framework

·  The Hadoop Implementation

·  Scalable statistical analysis & machine learning techniques

·  Storage/Cleaning/SW Practices

Audience. The course is addressed to upper-level undergraduate students as well as to graduate students that have solid background in programming and computer systems organization. Students are required to have taken COSI 12b (Advanced Programming Techniques) or its equivalent.

Prerequisites. COSI 12ba or the equivalent

Required Reading. There is no required textbook for the course. The course will rely mostly on published papers and online resources. The instructors will also make available lecture notes/slides on the topics covered on class.

Grading

30% Quizzes (3 x 10%)

70% Programming Assignments (10% for A0 and 20% for A1-A3)