DAI Data Cleanup Exercise

Appendix

Data Cleanup:

Data review and cleanup must be done before data analysis. MS Excel has data validation options that, if set up before data entry, can help to eliminate the amount of cleanup necessary.

This exercise will use an example dataset (NOT the actual data, but an example subset) adapted from the Leech Lake Ojibwe (thanks to Brandy Toft). The tribe worked over a period of time with different site operators and lab technicians, which resulted in a messy but fascinating dataset.

First, read a great MS article at

http://office.microsoft.com/en-us/excel/HA102218401033.aspx. (The examples in this link use Excel 2007.)

Topics Covered:

1. Initial Datasheet Cleaning (checking links, hidden values, finding duplicates, tracking data so none is lost)

2. Normalization (separating fields into separate fields, using data validation to limit entries to drop-down lists)

3. Documenting Your Clean Data

4. General Tips (macros, values-only, documentation, eliminating hidden characters)

1. Initial Datasheet Cleaning (checking links, hidden values, and duplicates)

1.a: Checking Links

Open the Dioxin 3-30-06 spreadsheet.

This spreadsheet contains links to other spreadsheets that are not available and so those links must be broken. Also, any reference to the outside data in the spreadsheet must be deleted.

You will see this box: (What you see may be different depending on your version of excel.)

Click on Update to see:

Click Edit Links and check to make sure they are referencing the correct sheets (in this example, all links are unknown to you, but in your own work, you may wish to keep the links so you would click Update Values), and click Break Links if that is appropriate. (In this example, it’s appropriate to break all links, but sometimes in the future you might keep some links if you have the other data. You may not be able to break all the links. You can also change the startup prompt if you want to keep the links and update them every time the file is opened, but be careful with automatic updating!)

Next, verify that there are not other links by:

Edit, Find, Options, Look In Formula. Type an opening square bracket (“[“) in the text box and click FindAll. Click on the Formula in the results area to select the cell where that formula is. This should find all the links that are in formulas. Delete or edit each link as appropriate. (Delete all links for this example. There are 5.) Close the dialog box.

Links may also be in names, so go to Insert, Name, Define, and go thru each name to see if it is linked to an external workbook. (There are 4 ghostName links that should be deleted in the Dioxin spreadsheet.)

FYI, links may also be in charts or even text boxes or graphics. The easiest way to find them is to select Edit, GoTo, Special… and select Objects. Right-click on the chart or object and select Edit Object, then click on various parts of the objects (series, title, axes, etc.) and look in the formula bar for what that refers to. Any external links (such as for the Series a, b and c in the GhostChart) are shown in the formula bar. (This is just FYI, you do not have to delete any formulas here.)

1.b: Checking for hidden cells

Next, make sure there are no hidden rows or columns by selecting Edit, GoTo, Special, Visible Cells only, and there will be a white line separating the blue areas of all visible cells, enabling you to see the hidden rows or columns. To unhide a row, column, or cell, select the area around it and go to Format, row column or cell, Unhide. (Column E is hidden in this example.) Alternatively, you can select the entire sheet (with the box in the upper-left corner, and go to Format, Column, Unhide and Format, Row, Unhide.

1.c: Check for duplicates:

Filter for unique records:

1 First, make a copy of the data by selecting all the data and copying it into a new sheet. (Use zoom-select to quickly select contiguous cells of data, by clicking on one corner of the data range, holding down ctrl-shift, and pressing the arrow keys to completely select the range. Name the new sheet “working data” and the original data with a name that you use for your original data. (You may also want to protect this sheet if you will be working with the entire workbook.)

1 Make sure AutoFilter is unchecked (off) (Data, Filter, AutoFilter). Click a cell anywhere in the range or list you want to filter.

1 On the Data menu, point to Filter, and then click Advanced Filter (excel should automatically select the entire range of $A$1:$D$141).

o  To filter the range or list in place, similar to using AutoFilter, click Filter the list, in-place (since you are working with a copy, this is safe.)

1 Select the Unique records only check box.

1 Compare the number of rows in the two tables. (Don’t be fooled by row #s! If rows are in blue then some rows have been filtered out. You can see where there are missing rows by a darker row delineator, in addition to the skipped row #s. The duplicate records were in rows 128-137.)

For the rest of the exercise, use the data without the duplicate records (Copy the 131 rows, including the headers. To copy only visible cells, select the cells shown, Copy, Paste Special, Values only; OR GoTo, Special, visible cells only, Copy, paste. Note, you must select a new location on the spreadsheet to paste the cells. Then select the columns you are no longer using (A-D) and Edit, Delete.)

1.d: Tracking data so none is lost

After you know there are no duplicate rows, you can start cleaning the data. Before editing the data in any way, assign each row in the spreadsheet an ID number. This way, if a row is deleted it is obvious from the gaps in ID numbers. Insert a column to the left of column A and input the number 1 into the cell next to the first row of data (A2). Highlight the cells that need to be numbered, click Edit|Fill|Series, input a step value of 1, and hit OK. In our example, the cells will now be numbered from 1 to 130.

1.e: Reviewing Data

First, use Autofilter to run thru the data and check for odd values. Select the header row, Data, Filter, Autofilter, and scroll thru the values in the drop down menus for each column. In this example, the only “odd values” are those with < in the value field, and with very high values. Don’t delete them—this is just an easy way to identify values that are different than the others.

2. Normalization

2.a Separating multiple headers out of the same column

Next, the date and the site IDs need to be separated into different columns. It looks like the date field is always the left-most 8 characters, but it must be kept in mind that we do not know for sure. Insert three columns to the right of the ID field. First, check if the ID field values always have the same number of characters. In column C, enter “length” as a header and enter in cell C2 the formula =LEN(B2). Copy this down the entire column, then use Autofilter to check the number of characters in the ID field. You can see that long IDs (some are as long as 24 characters) still have the date as the left-most 8 characters, but we still are not sure that this is ALWAYS the case.

Now it is time to parse out the date field. In column D, add the header “date.” Click in cell B2, and select Data, Text to Columns.

Click Fixed Width, Next, then slide the column break where you know you will capture all of the data (first break line at 8 on the scale, second break line at 24):

Click next, and tell Excel that you want the MDY format for date, and the destination for the new date is “=$D$2”:

You can see that it then puts the two parts of the ID into cells in columns D and E. Note that the auto-fill feature of dragging (or dbl-clicking) the fill handle does not work because this is a TEXT copying feature rather than a formula. (So if you auto-fill, all of the entries will be the same as the top entry.)

So, repeat the steps above for the text-to-columns wizard for the entire ID column, by selecting the data in column B and going to Data, Text-to-columns. Verify that it worked by looking at some of the longest ID fields and making sure the dates and site info is all there (note: you will have some fields (i.e. shipping blank) that do not have dates in them at all).

If this text-to-columns wizard is not appropriate for your situation, Excel has a number of functions that let you parse characters directly. For example, the LEFT(), RIGHT(), and MID() functions return a specified number of characters from a target. The difference between them is in where they begin counting, and in which direction. The LEFT() and RIGHT() functions pull the left-most and right-most n characters from a target, respectively. The MID() function starts at a point you specify within a target and returns the next n characters. For example, if we wanted to pull the area code information out of a phone number, we would enter the following formula into a new cell: =LEFT(F2, 3). This formula would then extract the left-most three characters from cell F2.

2.b Assign Unique IDs

Normalization means that all the sites called “C Chatfield” are called exactly that, and not sometimes “C Chat”. Obviously if there are different ways of recording site names, dates, etc. it is almost impossible to analyze the full set of data.

Label Column E (the new ID field) “ID2,” and set Autofilter on for all the headers. Look through the site IDs to see where there are IDs that need to be fixed. This can be done using Find and Replace, but it is important to do this carefully so that it does not further misrepresent the values, or lose information.

First, decide on the full list of site IDs. (Do this either from looking through the list of ID2s using Autofilter, or looking at which are unique as follows: copy all of the site IDs from just column E into a new sheet:

The new sheet is filled with just the list of site ID2s and is labeled ID2_list.

And this list can be filtered for unique fields only using Advanced Filter, and this will give you a starting point for deciding on ID2 names.

The unique ID2s are:

And then the complete list of ID2s that will be used are those shown in yellow:

Note that the ID2 in blue will have to be found in the original list of data, as there is a problem with that record. The date of 2005 in that ID was actually 20005, so find that cell and fix that date and ID in the full data list.

Now, the ID2s can be fixed in the full data list, using DATA VALIDATION. First, set up the list of ID2s that you want to use and copy it somewhere on the same sheet with the full set of data:

Then select the entire column of ID2s that need fixing, and go to Data, Validation, List,

and select those ID2s from the list that you want to use:

Which will give you a dropdown list when you click on any cell in the ID2s that you have set Data Validation for:

Here are two ways to check and correct each cell:

(1) Click on each cell, and select from the dropdown list, but be careful and do not click too fast. (If you have thousands of rows to fix, Find and Replace may be faster, but be sure to find and replace entire cell contents so that you do not end up with repeated ID2s in the same cell.

(2) Or, copy the “A STRG” value and then CTRL-Click selected all the A-variation cells and then pasted the value in that way, and again with the B STRG, C CHAT values. Also, instead of looking for all the values line by line, you can also look for some with the autofilter.

When you are finished, you can verify that all ID2s are normalized by using AutoFilter and looking at the selections in that column to make sure they match your original list:

Change the values in Values column with “lt” or “LT” or “less than” to < (whatever number, and one cell is just “less than”), so that all “less than” values are uniformly presented. This can be done by selecting the cells to change using Autofilter, and manually typing in the new values, or using Find/Replace (this replaces characters in only the visible cells). NOTE: one of the cells (that said just “less than”) did not have a number in it

3. Documenting Your Clean Data

3.a: Properties

When you save your file, consider using the file properties page, accessible via File | Properties. The Summary tab offers a number of useful fields for capturing information, including a large field for general comments. The Custom tab includes a number of specific fields, such as Project, Date Completed, and Checked by. Importantly, this information always travels with the file, so there is no risk associated with multiple pieces of documentation getting separated.