Practical Work No.3

Goal: at theend the practical you will know:

  • To create and interpret charts using MS EXCEL
  • To calculate the Pearson correlation coefficient

1. Histogram

Histograms are among the most important graphs in statistics, because they are rich in information about the distribution of values for a numerical column.

Suppose we want to represent a histogram of age distribution for patients in the CIRRHOSIS file. For this, we proceed as follows:

Open the CIRRHOSIS.XLSfile, if unopened.

  • Scroll down on column D, which recordspatients’ ages, up to cell D237.
  • In this cell, enter the formula = min (d2: d235), and after pressing ENTER, the program will calculate the lowest age.
  • In cell D238, type = max (d2: d235) and after pressing ENTER, the program will calculate the greatest age.
  • In cells C237 and C238 write minimum and maximum value to remember what represent the values calculated by the program. Between 26 and 78 is a gap of 52 years, so we can put in 5 classes of 5 years (if we count from 10 to 10 there will result too few classes).
  • In cell D240 typethe words“Age group” and write down the D241 class boundaries (boundaries between classes): 25, 30, 35 etc.,up to 80.
  • Note that the limits of the classes are located from D241 to D252, because the program will ask about them.

Use Tools  Data Analysisand from the window that appears, choose Histogram. Then press the OK button.

In the window that appears write the values as shown in the figure on the left (without worrying about the $ signs that the program inserted in the text):

•D1: D235, the Input Range.

•D240: D252 for Bin Range

•Check the Labelsbox

•D260, for Output Range

•Check the Chart Outputbox

•Then press OK

The result is the displaying of the table in the figure below, left, and of the histogram (right), which is a small graphic and not very appealing. Modifythis chart is to look more aesthetic.

The first operation is resizing the chart. Drag the corners or the sides,from their middle points, which are marked by small squares of dark color (there are 8 such markers, four corners and four middle points for each side). If these markers are missing, it means that the chart is not selected. In this case, select a chart by clicking on its surface,on a white area. Resize the chart so it will occupy a large area, even if it will cover the table created previouly.

The second step is to modify the font for the text son the chart area. Double click on each text in the chart, including the chart title. Choose a font from the left window, e.g. Arial font, Bold style, size 12.

Every time you make a choice, press the OK button.

The third step is to change the text displayed as Chart or Axes titles.

Click once on the chart title, after selecting it, and write the words“Age Histogram”. Just click the word “Frequency”, and type “Absolute frequency”. Then click on the Legend on the left side and delete it by pressing the Delete key.

If you managed to run these operations, the chart should look like at this time as the figure below.

The fourth step is to adjust the thickness of the bars. Double click on a column, and from the Format Data Series window that appears, select the last option (right) on the menu, Options, and change the Gap Width box, to a figure lower than the one displayed by the program (see the right image above). It is advisable to choose 0 (as it is customary for histograms). Then press the OK button

The fifth step is coloring the bars. Doubleclick on one of the columns, and from Format Data Series window, choose, if not already selected, the Patterns option (first from the left, see the left image above). Choose one of the colors (preferably a more vivid color, blue, red, orange, etc.). Then, click Fill Effects ... From the window that appears (see picture above right), choose as shown, One Color and Vertical. Then press the OK button. You will notice that the chart bars are colored with a gradient (gradient color), based on color chosen. You can repeat this step to exercise and to choose the most beautiful combination. When you change the buttons repeat press Fill Effectswindow.

The sixth and last step is the coloring of the chart area. The rectangle in which the chartaxesand columnsare drawnis called the chart area. The area beyond the edges of this rectangleis called the text area. Both the graphics and text area are colored the same way you colored bars, first by double clicking them, then selecting the appropriate options and colors. It is advisable to choose pastel colors, so that the entire graphic area will not have an overly dark look.

If a histogram is modified for printing black and white is better to choose options for shading and fine shades and not colors. Practice this way of filling, for example, doubleclick the chart and bars, press Fill Effects, then choose Patterns.

The most important thing is the interpretation of the chart; for the histogram, however, thisusually is a very simple problem, because we only have to notice if the chart bars are displayed symmetrically around the center value (symmetric distribution), or there is a tendency of asymmetry, more or less severe (left asymmetry or right asymmetry). It is sometimes possible to have histograms with two peaks - in this case we say that we have a bimodal distribution. For the histogram above, we see that it looks relatively symmetric, so the patients’ ages were distributed almost as a Gaussian curve.

2. Linear trend chart

It is one of the simplestcharts,both as creation and interpretation. It is used when we have a table like the one below, where we describe a parameter’s evolution over a period of time, measured daily or at other intervals.

In the table below, each line records a patient's temperature, measured over two weeks, denoted by T1, T2, ... to T14. On the row 2, patient B1’s temperatures are recorded, B2 is row 3 and B3 is row 4. We can represent all three evolution lines at the same time, and even more patients could be used, but is not recommended to display more than 2-4 lines.

To create a linear chart, the first step is data entry.

Enter three lines of 14 temperatureseach, as shown above, as if you’d measured the temperatures of three patients for 14 days. It is not mandatory to use the exact temperaturesfrom the image above, but it is good to use plausible values.

Select from cell A1 to cell O4. After selecting the cells as described above, either use Insert –Chart or click on the “Chart Wizard” icon on the toolbar.

A new window will appear (the left picture above), which is the first step in collecting information about the task that we want to perform. Notice that the top of the window, it is written that is the first of four steps (Step 1 of 4).

  • Select the Line chart type (selected in the figure at left), and then select the subtype shown on a black background.
  • Finally, click the Next button to proceed to step two. From the window of the second step nothing should be selected or modified, so just press the Next button to proceed to step three.
  • The third step involves more information that you need to input to create thechart. They are divided into sections.
  • In the right picture above, which shows the window for the third step, first use the “Titles” tab. Fill in the text, as follows:

In the first textbox, “Chart Title”,write the text "The temperature evolution of the patients" or something similar - although the text is longer than the box, the program will displayit correctly.

In the second textbox, “Category (X) axis”, write the word “Day”.

In the third textbox, “Value (Y) axis”, write the word “Temperature”.

Then, click on the Gridlines section of the window and select the checkboxes as in the left figure below.

Press the Next button to proceed to the fourth step.

•For the fourth step just press the Finish button.

•The effect is the displaya chart similar to the left picture above.

For processing, do as you did in the histogram. Adjust the size chart, but take care not to exceed the screen limits, reduce the text size, except the title, which is better to remain as is, etc.

A particularly pleasing aspect of this chart, it appears that the following steps:

Put the mouse cursor, even on one of the three lines, and double-click. It appears the left window, you choose as follows: At Color dialog box, choose red or another color hard. The Weight box, choose the thickest model line. On the right, in Style, select circle. The Foreground box, choose the color of brown (to show the right in red) and the Background box, choose yellow, which matches the red and brown. The Size box, choose 9 or 10 or even 11, and the Shadow, leave unchecked. Then click OK. The graph is now a much changed in appearance of lines. Do analogous to other lines, obviously choosing other colors or markers.

3. Column/ bar charts

To do an exercise, do the following: enter data as shown in the nextimage, to create a table that classifies patients by sex and age group.

This kind of vertical bar graph represents the absolute frequency of occurrence of cells of a table (number of patients enrolled in each cell). It has exactly as many rows and columns of bar lines, each table has rows and columns.

Select the table as shown in the figure. (Note, it is very important to make the selection exactly as shown, i.e. not select more or less rows or columns. In the figure the selection isfrom C2 to H4). Then, click the icon for graphics wizard, which displays the window above, left.

In the first step, choose the type of graph Column (graphic columns) subtype which shows that is marked on a black background. Then, clickNext. In the window at step two is not do any option but press the Next button. The effect is the appearance of the window to the third step, which is shown in the picture above, right. Write titles that you see in the figure (note the title is longer than the schedule includes box, but I write to put in evidence at the chart). Then, click Next.

In the window of the fourth step, do no option but just press Finish. The result is the emergence of a schedule they still adjust the size by pulling the edges, reduce the text to look like one of the figures below.

If the chart look does not satisfy you, it can be modified after it is made.

Perform a right mouse click on the blankarea of the chart, where the white color is chosen for the background. From the list that appears (see above, left), choose Chart Type, and the window that you have used it, you choose as you see in the picture, then press the OK button.

4. Scatter plots charts

This chart is used only if we have two columns of values, which are neighbors in the table, and contain only numeric values. It is one of the most informative charts (ie containing a significant amount of information), but his interpretation is more difficult.

To create a scatter chart, first the columns with numerical values should be selected. In the CIRRHOSIS.XLS file, the columns VSH1H and VSH2H recorded values for erythrocyte sedimentation rate per hour or two hours.

First open the CIRRHOSIS table,and then look for the columns that record the erythrocytes sedimentation rates values, which are called VSH1H andVSH2H. Then select the values from both columns. Select the cellsfrom AC1 toAD235, the last recorded data, as shown in figures on the left, where are the beginning (left) and end of selected area (right).

Then, click the Wizard icon in the window for graphics and a pick up as shown above by executing one click XY (Scatter) and the model points without lines you see in the picture is selected. Then press the Next button. In step two just press the Next button. In Step 3, write in the three dialog boxes, chart title and the names of two columns. The title is "The correlation between sedimentation rates per one hour and two hours. In the other two boxes write VSH1H and VSH2H.

Then press the Next button. In step 4 just press the Finish button.

For the chart that appears, adjust the size, reduce the text size and color it as desired, as you have done for thehistogram chart.

It is noted that the values of the two columns are correlated, because the points chart are arranged in a long and narrow cloud (with few exceptions). If the points are placed randomly, the values of the two columns are not correlated.

Regarding the scatter plot chart , you willobserve three different types of information they can give us:

•if the two sets of values are correlated (correlation).

•if the sample on which measurements were made is homogenous or heterogeneous (homogeneity).

•if the values of the two sets are distributed symmetricallyor not (symmetry).

Example:

1. The chart above shows that between serum sodium and serum potassium values in patients with hepato-renal disease there is no correlation, because the points are distributed about by chance. No clear trend can be seen as the points chart be placed in a certain way, except that the natural tendency to be higher density in the center, a trend which is apparent in general, most data in medicine and biology and not only. On the other hand, the trend points to sit in a single cloud, shows homogeneity range of values.

2. The chart above shows that between total bilirubin and direct bilirubin in patients with cirrhosis and liver cancer is a strong correlation, because the data are distributed approximately along a straight line. Also, the sample is homogeneous, as in a chart, which can be seen in the cloud of points that tends not to be divided into two different clouds. On the other hand we see that the cloud is much denser in the lower left. So, total bilirubin values of the series (horizontal), are distributed asymmetrically (more values left = right asymmetry). Similarly, direct bilirubin values of the series (vertical) are denser down to small values, so they are distributed throughout the asymmetric (right asymmetry).

3. In this case, it is noted that between thyroid hormones T4 and T3 values of patients with different forms of tyroidism there is a correlation, the data being distributed all along the line, even if it is discontinuous. Also, it appears that their distribution is grouped: one group with small values for the two hormones, indicating hypotyroidism, and a group with higher values indicating hypertyroidism, with a clear tendency to separate the data into two different clouds. We say in this case that the sample is heterogeneous. If the data forms only one cloud, it is said to be homogeneous.

Note. Charts of this type are the best for observing,at the same time, three types of information:correlation, homogeneity and symmetry.

Some give us more useful information on correlations, others tell us better the homogeneity and symmetry, so there can always be obtained from a single plot for all three types of information as easily. The assessments are always subjective and depend on the user skills.

5. Correlations

To check the correlation between two columns of numeric data, use the Correlation function. For this it is necessary that the two columns to find data in the table side by side. If not, they should be moved to another worksheet, so that they are neighbors.

For now, let us deal with the simple case when we have neighboring columns. For example, in table CIRRHOSIS, the VSH1H and VSH2H(i.e. erythrocyte sedimentation rate per one hour and per two hours) columns are one next to the other – data is recorded in the AC and ADcolumns. Values, for the first column,are fromAC2 to AC235 and from AD2 to AD235for the AD column.

To calculate Pearson's correlation coefficient between two sets of numerical data proceed as follows:

Open CHIRROSIS.XLS, usingFile-> Open. Select the Tools menu. From the list that opens, choose Data Analysis. From the window that opens, select Correlation.

In Input Range dialog box, type AC1: AD235. Check the box Labels First Row. If not checked, check the Output Range option button and type AC240in the appropriate dialog box. Finally, click OK.

The result is the display of a table in three columns and three lines, the correlation coefficient being sought AD242 cell. It is noted that the values of the two columns are closely related because the correlation coefficient is 0.89