Page 2 QU1 Lecture Notes Module 1
Module 1: Data description and presentation
Statistics:
The science of collecting, organizing, presenting, analyzing, and interpreting numerical data for the purpose of making more effective decisions. A way to turn data into useful information.
· A statistic is a number or fact used to describe an event, outcome, or group of data. Examples:
a) Today’s closing TSX Group Inc. stock price
b) Financial management (ex. capital budgeting, capital structure, working capital management, stock & bond valuation)
c) Accounting – collects, organizes, and provides information about a company’s activities used by internal and external people to make decisions. Managerial accounting gives info for internal use (managers make decisions regarding planning and control – ex. budgets). Financial accounting involves financial statements for external use, which require audits (ex. auditors must select samples of invoices to assess validity of accounts receivable).
d) Your grade point average
e) Sports statistics (ex. GAA for goalies or batting average in baseball)
f) Census data about the Canadian population
g) Survey results (ex. “exit polls” to predict election voting preferences)
h) Outcomes of an experiment or research
Descriptive Statistics (See Modules 1 & 2):
· Organizing, summarizing presenting data. Ex. Population mean, pie charts, & frequency distribution
Inferential Statistics (See Modules 4 to 8):
· Generalize from a sample to a population. Drawing conclusions about a large body of data (population = all values or objects under consideration) from the analysis of a small subset of that data (sample = a subset of the population).
· Calculations done on population data are called parameters and for sample data are statistics.
· Why use samples?
1. Population is too large, so data on entire population may not be available (ex. Census)
2. Gathering population data is too time consuming or expensive
3. Collecting data may damage the objects (ex. quality control testing for life of light bulbs)
4. Random, representative samples provide reliable estimates of population parameters
· Sampling error is the difference between the sample statistic and the actual population parameter due to the fact that not all objects in the population are analyzed (random).
· Estimates will not always be correct, so reliability is measured by:
a) Confidence level: The proportion of times that the estimate will be correct.
b) Significance level: The proportion of times that the estimate will be wrong.
Types of Data (levels of measurement):
· Variable: A characteristic of a population or sample. Values are obtained for each observation.
· Qualitative (categorical data): Data results from counting.
1. Nominal: Lowest level, so few calculations can be done. Can only be classified into categories (NO order). The categories must be mutually exclusive (an observation can only appear in one category) and exhaustive (each observation must appear in a category).
2. Ordinal: Categories are ranked, so one is higher than another. Ex. military rank.
· Quantitative (numerical data): Data results from measuring. 3 & 4 are considered the same in CGA notes.
3. Interval: Characteristics of Nominal & Ordinal, but the distance between values is a constant size. Can be measured objectively, but zero point is arbitrary. Ex. temperature or test mark.
4. Ratio: Highest level (most specific and descriptive), so we can legitimately use all operations of math and statistics. All requirements of Interval scale met, but the zero point is meaningful and the ratio of two numbers is meaningful (ex. money).
Graphing techniques for quantitative data:
· Array: Order all observations from smallest to largest.
· Frequency distribution: Table that groups data into non-overlapping consistent intervals called classes and shows the number of observations in each class.
· Class frequency: The number of observations in each class.
· Stated class limits report the boundary of each class.
· Class midpoint: The center of the class found by averaging consecutive stated lower limits.
· Modal class: The class with the largest number of observations (unimodal vs. bimodal)
· Number of class intervals (k) is found by 2k≥n, where n = # observations
· Class width = (Largest observation – Smallest observation) / Number of classes
· Histogram: Graphical representation of the frequency distribution. Use observed frequency in each class on the Y-axis (vertical) and class widths & limits on the X-axis (horizontal), so that the class frequency is represented by the height of the bar.
· Shapes of histograms:
1. Symmetrical: if cut in half, the 2 sides have identical shapes to the left and right (ex. bell shape)
2. Positively skewed: lower frequencies on right hand side (tail to the right)
3. Negatively skewed: lower frequencies on left (tail to the left)
· A line chart can represent a cumulative frequency distribution graphically. A “less-than” cumulative frequency ogive shows the percent of observations less than the upper limit of each class (with points at the upper class limits). Similarly, a “more-than” cumulative frequency polygon shows the number of observations greater than the lower limit of each class.
· Relative frequency distribution: % in each class. Divide the #of observations in each class by the total #.
· Stem & leaf display: A combination of graphing and sorting (NOT examinable).
· Simple line charts are useful for showing the trend of a variable over time.
Charts for qualitative data (usually nominal):
· Pie chart: Count the number of observations in each category, then determine the proportion of each category and put into “slices” on the pie (percent each category represents of the total).
· Bar chart: Similar to histograms, except put spaces between the bars. Categories go on X-axis.
Graphically describing the relationship between 2 interval variables (bivariate):
· Scatter diagram: For paired observations, plot the independent variable on the X-axis and dependent variable on the Y-axis. Linear and non-linear relationships become visible (ex. positive linear relationship means that one variable increases in step with the other).
Graphically describing the relationship between 2 nominal variables (bivariate):
· Contingency (cross-classification) table: Lists frequency of each combination of the values of the 2 variables. Then do bar charts of the contingency table for each category of the independent variable.
Graphical excellence:
a) Present large data sets concisely and coherently
b) Ideas and concepts are clearly understood by the reader
c) Encourage viewer to compare two or more variables
d) Substance of data is more important than form of the graph (ex. caption)
e) No distortion of what the data reveals (no graphical deception such as missing or break in scale of axis)
12/02/10