Technical Data Processing Project

Technology of Privacy

Fall 2017

Version 1.2

Due: Friday, November 3, 2017, 5:00 PM

GRADED Assignment

Worth 25% of final grade

The third, and final, graded assignment for this class is much like the problem sets you have completed. Unlike problem sets, however, this assignment will require you to hand in documents with specific answers. Unlike most graded assignments in law school, there are “right” and “wrong” answers for much (although not quite all) of this assignment.

You will be expected to turn in two documents:

1. A Linux shell script (described below)

2. An Excel workbook.

The only prompts below that require you to submit something and that will be graded are clearly labeled, in bold, as “Problem N” (e.g. “Problem 1” or “Problem 9”). Throughout this document, additional, small questions or instructions will be presented to help you understand the assignment. If they are not labeled “Problem N”, the answers to these questions should not be submitted and will not be graded.

You can (but are not obligated to) work on this assignment in groups of no more than three people each. Each person must independently produce and submit his or her own individual shell script and workbook.

The Dataset:

For this assignment, you will be analyzing a dataset available on our class Canvas. This dataset, called alldata.csv, contains several years’ worth of tweets posted to Twitter from people located on or around the Georgetown Law campus.

IMPORTANT: This dataset was shared with me with the strict understanding that it would not be shared broadly or posted publicly. Although all of the tweets in this file were posted to the public Twitter, it is a good reminder of a core theme of this class that aggregations of public data can sometimes raise new sensitivities or otherwise harm individual privacy. Please do not misuse this data!

About Shell Scripts

Shell scripts can be thought of as very lightweight computer programs. At their most basic, they are simply text files with Linux command line commands—the kind we were exploring earlier in the semester—listed, one per line. When “run,” a shell script will simply march through the file line-by-line, executing each command line in turn. It is like a macro that allows you to repeat a series of command line steps, saving you the burden of having to type them all in each time.

Part IV of the Shotts book provides an in-depth overview of shell scripts. Because our shell scripts will be rudimentary, you probably need read only Chapter 24 to learn all you need to know.

Shell scripts are created using Linux programs called “text editors.” Think of a text editor as a simplified word processor. The Shotts book recommends you use a text editor called “vi” (also called “vim”) to create and edit your shell scripts. It provides an introduction to vi in Chapter 12. Although vi is widely used and weirdly beloved, it’s also quite counter-intuitive, in my opinion. I use vi, but I think it would be overkill to inflict it on you by forcing you to learn it.

You might find it easier to work with “pico”, which is more like Microsoft Word in its commands and conventions. You can find lots of help online about how to use pico. One is http://www.cs.virginia.edu/diochnos/tips/terminal/pico_tutorial.pdf. You can use either editor (or any other you happen to know about) for this assignment.

Your shell script should reside in the same directory as alldata.csv and should operate on or analyze this file. If the problems below ask you to “print to the screen” a particular answer, your shell script should simply run the appropriate command. If a problem asks you to “create a file” your shell script should use the file redirect mark “>” after a Linux command line to create the correct file. In other words, you will be submitting neither the answers nor the printed output themselves. Instead, you’ll be submitting a shell script that can generate those answers or output.

Grading

Even though there are “correct” answers for almost every Problem below, I expect some variability in the way you present those answers, which may require some subjectivity on my part as the grader. Each Problem below is worth three points, meaning the entire assignment is worth 36 points. (Problems with multiple parts are worth three points cumulative.) For each problem, I will use the following rubric:

Points / Means
3 / Correct answer, correctly presented or generated.
2 / Minor errors in answer, presentation, or generation.
1 / Major errors in answer, presentation, or generation.
0 / Entirely incorrect or impossible to diagnose what happened.

To grade your shell script, I will literally copy it into a new directory containing only alldata.csv and execute your script. If it prints the correct output to the screen and creates the correct files, you will earn full credit. If it prints an incorrect output, fails to properly generate the correct file, or doesn’t work at all, I will read the script to see if I can find your mistake. In this case, I will deduct some credit.

To grade your Excel workbook, I will open it up and grade your answers by hand.

Setting up your work environment.

The following steps are not worth any points, but they are necessary predicates to complete the rest of the assignment.

1. Using VirtualBox, start your Linux virtual machine (VM).

2. Once it has started, use filezilla to upload alldata.csv to your VM. You may want to create a new folder for this file.

Next, log into your VM. Find alldata.csv and use the “more” or “less” command to view its contents. Spend some time acquainting yourself with the data. In particular, pay attention to the first line of the file, which we call the “header line”. See if you can decode each entry in that line, which will probably require you to glance at some of the data below.

What character is used to separate bits of information on each row? We call this character the “delimiter” and we say that the delimiter separates each row into multiple “fields” of information. What does “csv” in the file extension mean?

Mac and Linux users: it may be possible to use the Terminal program to generate the shell script, meaning you won’t need to use your VM at all. Feel free to do this, if you choose.

Shell Script Tasks

You should name your shell script data-<netID>.sh replacing <netID> with your Net ID. For example, my script would be called data-pko6.sh. Your shell script should do all of the following for full credit:

Problem 1. Print to the screen the number of total rows of data in alldata.csv (for this and all subsequent problems, just print the bare number on a line by itself).

Problem 2. Print to the screen the number of total rows of data in alldata.csv that contain each of the following words, printing each answer on its own row, or in total, print three rows.

2a. Clinton

2b. Trump

2c. Georgetown

Important: Your answers should be case insensitive, meaning it should ignore upper or lower case. In other words, a row should be counted for 2a if it includes “Clinton", “clinton”, “CLINTON”, “cLiNtOn”, etc.

Problem 3. Print out just the part of each row that contains the screen name field part of each row (called u_screen_name in the header line) but only for the last ten rows of alldata.csv.

Hint: You’ll need to use two Linux commands separated by a “pipe” (|) character to accomplish this. The first command is “cut” which is really complicated but powerful. See Shotts, Chapter 20 for more details.

Problem 4. How many unique screen names (u_screen_name) are present in this dataset? Print only the numeric answer to the screen.

Big Hint: You need to use the following four linux commands separated by pipe characters to accomplish this, in this order: cut, sort, uniq, and wc. sort and uniq are both described in Chapter 20. Why is “sort” necessary for this to work?

Problem 5. Create a new file called dates-all.csv containing each unique date appearing in the alldata.csv file in the “date” field as well as the number of times that date appears in the file.

Hint: The workflow is almost identical to Problem 4 with two differences. First, “uniq -c" will prepend each row with the count of how many times that value appears, as the problem requires. Second, rather than “wc”, end your pipeline with a redirect (“>”) followed by dates-all.csv

Hint 2: The first few lines of your file should look like:

6 2014-01-01

52 2014-01-02

73 2014-01-03

...

Problem 6. Three of the most common words tweeted in the file are “fitness”, “starbucks”, and “happy”. For each of these words, your script should create the same type of count/date file you created in Problem 5, but listing dates only for those rows containing the word anywhere in the row. Call these files, respectively, dates-fitness.csv, dates-starbucks.csv, and dates-happy.csv.

Hints: You need to use the redirect symbol “>” to create each file. Also, be sure to include rows that include those keywords, regardless of case. Reminder: your shell script should create these files each time it is run. In other words, you will be submitting a shell script that can generate these three files, rather than submitting these three files themselves.

Problem 7. Find another word that occurs at least 15 times in the dataset. Avoid short words like “and”, “or”, and “not”. For this word, create the same kind of count/date file you created in Problems 5 and 6, listing dates only for those rows containing your word anywhere in the row. Call this file dates-<word>.csv, where <word> is your word.

Problem 8. Generate a list of all of the unique hashtags that appear in tweets in the dataset. (For the uninitiated, a hashtag is a pound or hash sign (#) followed by a string of letters or numbers without spaces.) Unlike every other problem thus far, you should search only the content of the tweets, which are stored in the 22nd field of each row, the one labeled “text” in row 1. If a tweet contains more than one hashtag, just list the first one. Print the first ten (not first ten unique, just first ten counting duplicates as separate hashtags) that occur in the file to the screen.

Hint: You will need to use regular expressions with grep. Remember that the -E flag is necessary to use regular expressions. Also the -o flag is probably necessary.

Problem 9: Based on the answer to Problem 8, list the five most commonly used hashtags in this dataset. Print these to the screen. Assume that two hashtags that are spelled the same but capitalized differently (e.g. #Law and #law and #LAW) are different hashtags.

Hint: This is sort of a combination of Problem 5 and Problem 8.

Moving the data to Excel

The remainder of the assignment will be conducted on your desktop or laptop computer, using Microsoft Excel. You must complete the rest of the problems below inside a single Excel workbook, which you will submit with your shell script.

1. Using filezilla again, copy the following files from your VM to your local computer: alldata.csv, dates-all.csv, dates-fitness.csv, dates-starbucks.csv, dates-happy.csv, and the dates-<word>.csv file with the <word> you selected.

2. You may also want to copy your shell script (data-<netID>.sh) to your local computer at this time, to prepare to submit it for a grade later.

3. Open Microsoft Excel and create a new, blank workbook called data-<netID>.xlsx. For example, mine would be called data-pko6.xlsx. In this workbook, create two worksheets, one called “answers” and the other called “charts”.

Examining the data in Excel

One benefit of working with csv data is you can double-click a csv file on your computer, and it will load into a spreadsheet in Excel. Even better, it will recognize the commas as delimiters and use them to separate fields into columns.

Double click the alldata.csv file. Spend some time looking through the data, comparing what you see to your earlier review of this file in Linux.

The rest of the problems assume a general working knowledge of Excel, which I have been led to believe in class discussion you each already possess. I have also provided links on the class website to tutorial pages describing most of the skills you’ll need to use to complete the following. If after reading these, you still don’t know how to do some of the tasks described below, Google is your friend.

Problem 10. Use Excel’s “sort” functionality to sort various columns of alldata.csv, deriving answers to the following questions:

10a. Which user had posted the most tweets (u_statuses_count) from their account at the time they tweeted? Place that user’s screen name (u_screen_name) in cell A1 of the “answers” worksheet.

10b. Which users had the most and second most followers (u_followers) at the time they tweeted? Place the screen name (u_screen_name) for the user with the most followers in cell A2 of the “answers” worksheet and for the user with the second most followers in cell A3 of “answers”.

10c. Recall that these tweets are geolocated and restricted to people in or near the Georgetown Law campus. Which user was furthest north at the time they tweeted? Place that user’s screen name (u_screen_name) in cell A4 of the “answers” worksheet.

Excel Preprocessing of dates-*.csv files

Double click dates-all.csv and examine the data in Excel. Scroll all the way to the bottom of the data, and notice that the header field “1 created_at” appears last. Delete just this one cell.