Syllabus: CS 5xx/5xxG Big Data Analytics. 2/28/2015

Dr. John Neitzke

Office: VH 2248 Office 8:00 – 8:20 M W F

E-Mail:ours:10:30 – 12:20 M W F

Phone: x4529 1:30 – 2:20 M W

Web: jneitzke.sites.truman.edu and by appointment

Videoconferencing and telephone visits available during office hours and by appointment

Catalog Description: CS 516 Big Data Analytics.

Prerequisite: Intro to Data Science CS 510.

Exploration of data analysis of very large data sets. Problems of scalability, network failure, and ill-suited data sets. Examination of the capabilities and limitations of available tools.

Texts:
Principles and Best Practices of Scalable Realtime Data Systems,Nathan Marz and James Warren, Manning Publications 2015.

Big Data Made Easy, Michael Frampton, Apress 2014

Hadoop: The Definitive Guide, Tom White, O’Reilly, 2015.

Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More, Matthew Russell, O’Reilly, 2013.

Selected articles and online documentation.

This course focuses on the “big” part of big data. Data is only useful if it can be analyzed. A major issue with analysis of massive data sets is that techniques appropriate for smaller data sets may not scale up to permit analysis of large data sets in reasonable time frames.

The principal toolset we will use is Hadoop. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Within Hadoop, we will examine MapReduce techniques for parallel processing, and Pig and Hive, which provide SQL-like access to unstructured data. Another tool is NoSQL (Not only SQL), which uses alternatives to the tabular format of relational databases. We will also examine the public cloud as storage solutions exemplified by HBase for their critical features: speed of reads and writes, data consistency, and ability to scale to extreme volumes. We examine memory resident databases and streaming technologies which allow analysis of data in real time. We work with the public cloud as a resource for big data analytics. We will also consider the web as a data source, particularly the use of social media.

The underlying goal is for students to be able to design highly scalable systems that can accept, store, and analyze large volumes of unstructured data.

Topics:

  • Overview of Big Data
  • Storing and configuring data
  • Collecting data
  • Processing data
  • Scheduling and workflow
  • Moving Data
  • Monitoring Data
  • Cluster management
  • Challenges of scalability
  • Apache Hadoop project overview
  • Comparison with other systems
  • MapReduce
  • Hadoop Streaming
  • Hadoop Pipes
  • Hadoop distributed file system
  • Concepts
  • Command-Line Interface
  • Filesystem Interfaces
  • Java Interface
  • Data Flow
  • Hadoop I/O
  • Data integrity
  • Compression
  • Serialization
  • File-Based Data Structures
  • MapReduce Application Development
  • MapReduce Job Run
  • Failures
  • Job Scheduling
  • Shuffle and Sort
  • Task Execution
  • MapReduce Types and Formats
  • Input Formats
  • Output Formats
  • MapReduce Features
  • Counters
  • Sorting
  • Joins
  • Hadoop Clusters
  • Setup
  • Configuration
  • Benchmarking
  • Pig
  • Installing and Running
  • Data Processing Operators
  • Other Hadoop-related projects
  • Hive
  • HBase
  • Mining Social Media
  • Twitter
  • LinkedIn
  • FaceBook
  • Google+
  • Web Pages

Projects for students will largely use publicly-available datasets. Students who have access to proprietary databases may use these for their projects where appropriate, with written permission of the database owner.

Learning Outcomes: On completion, the student should be aware of the general capabilities of the techniques available, and be able to recognize the appropriate approach and apply an appropriate tool to a data analysis problem on a very large data set. Specifically, the student should be able to use features of Hadoop and other related data analysis tools to analyze very large data sets. The student should be able to analyze web data sources, including social media. The student should be able to recognize the limitations of the tools. The five specific areas are:

  • Big data concepts
  • Hadoop
  • Other tools related to Hadoop
  • Cloud Computing
  • Social Media

Competency Assessment:This is a competency-based course. This means that you work at your own pace. Students will be evaluated on each of thelearning outcome areas by several assessments, including assignments, tests or an identified portion of a test, and the project. A score of 80% (equivalent to a B) or above will signify competence for a given assessment, while a score of 90% (equivalent to an A) or above will signify mastery. Competence in all assessments for a given topic will be required to demonstrate competence for that topic, while competence in all course topics will be required to demonstrate competence for the course.You may complete the graded assessments whenever you think you are ready.Students scoring lower than 80% in the course will not be deemed to have achieved competency. Students may retake assessments until a grade signifying competency has been achieved. Failure to achieve this mark by the deadline announced at the start of the course will result in a transcripted grade of F.

General policies:

  • You will be required to demonstrate the correctness of your assignments and projects, with test data that exercises the full program and fairly discloses its limitations, if any.
  • I will use Blackboard for distribution of material and communications and recording of grades. Since Blackboard will send email only to Truman accounts, you should set up your Truman account to forward if you use another address. I have heard that some spam filters weed out the stuff from Blackboard, so be wary. I consider email an official method of communication and will assume that you receive everything I send.
  • Assignments should be prepared in LaTeX or another suitable word processor. Figures should be prepared with appropriate software.
  • Keep all your assignments until the end of the semester to verify that your recorded grades are correct.

Persons with Disabilities: If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and the Disability Services office (x4478) as soon as possible.

Collaboration Policy

  • Written homework is a means to learning and understanding the material, and is not an end in itself. I expect and in fact encourage collaboration on homework, but I also expect you to put forth a good-faith effort to solve problems by yourself first. Collaboration should be at the level of ideas, not the details of a solution. You can’t read someone’s answers. Do not fool yourself into thinking that copying down someone else’s work will give you as much as doing the work yourself. I do expect you to write down your solutions on your own, however. Also, please note the names of the people with whom you worked.
  • Programs and projects should be your own work, unless the assignment explicitly allows or requires group work. I don’t object to your sharing ideas or getting help in debugging; I do object to sharing code and solutions.

Academic integrity is the cornerstone of the academic community. I expect academic honor and integrity from my students. Violations of academic integrity are antithetical to the mission of the university and are deeply offensive to faculty and to other students. To put it bluntly, I will not tolerate cheating. All of your work in this class is to be your own. You are not permitted to share code with or receive code from another student, nor from any outside source. You cannot download code and present it as your own work.

  • You are allowed to talk about your assignments with other students currently in this course, but you are expressly prohibited from sharing code. You may not share your code via email, printouts, or any other means. Do not post your code in any online forum. You are not allowed to talk about details of your solutions to your assignments with people other than the current students or the instructor.
  • You must acknowledge all communication with other students when you submit the assignment and cite all sources, including the text and the instructor.
  • Anyone giving or receiving a program or assignment to or from another student will receive a zero for that program and a report will be filed with the chair of the Department of Computer Science. Further penalties are possible.
  • Contact me if you have any questions.