Math 365, Elementary Statistics |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Lesson 1: The Language and TerminologyIntroduction
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| X(Donald Smith) = 3.25, | X(Sam Donaldson) = 3.11, |
| X(Karen Currie) = 3.89, | X(King Who) = 2.13 |
On the other hand, if GENDER is the "characteristic" that we are studying,
then Y = gender of a student is a variable. So, given a student, Y has
a value. For example:
| Y(Donald Smith) = Male, | Y(Sam Donaldson) = Male, |
| Y(Karen Currie) = Female , | Y(King Who) = Male |
If HEIGHT is the characteristic that we are studying, then Z = height of students is a variable.
To give another example, if credit hours completed is the characteristic studied, T = the number of course credit hours completed so far by a student is a variable.
Similarly, given any other characteristic like weight, annual income, annual expenditure, you can construct a variable for this population.
A variable that takes numerical values is called a quantitative variable. So, the variables X, Z, and T above are quantitative variables, while Y is not. A variable that takes non-numerical values is called a qualitative variable. So, the variable Y above is a qualitative variable. We will mostly be concerned with quantitative variables.
We discuss two types of quantitative variables: continuous and discrete variables. A quantitative variable that can assume any numerical value over an interval is called a continuous variable. Since Z above can (hypothetically) assume any value between 0 to 100 inches, Z is a continuous variable. T assumes only integer values and is therefore not a continuous variable.
A different way to understand a discrete variable is that the possible values of the variable can be written down (or can be counted) in a (finite or infinite) list. We say that the values of a discrete variable are countable.
A quantitative variable is called a discrete variable if its possible values consist of breaks between successive values. If a variable assumes only a finite number of values, then it is also called a finite variable. Otherwise the variable is called an infinite variable. A finite variable is definitely a discrete variable. The variable T above is a discrete variable.
Definition 1. Given a set of data, any numerical value computed from the data using a formula or a rule is called a quantitative measure of the data.
Definition 2. A quantitative measure of a population data is called a parameter. In other words, parameters belong to the whole population and are computed (if feasible) from the WHOLE population data. Examples: the average GPA of all KU students, the height of the tallest student in KU, the average income of the entire KU student population.
One way to study a population is to know some of the parameters of the population. Unfortunately, computing such parameters could be expensive or even impossible. Essentially, parameters are unknown and the main game of statistics is to try to estimate parameters on the basis of small samples collected from the population.
Definition 3. A quantitative measure of a sample data is called a statistic. So, any constant that we compute from a sample is a statistic. We use these statistics to estimate the parameters of the population. For example, the average height computed from a sample is a reasonable estimate for the (parameter) average height of the KU student population. Obviously, we do not expect the value of the statistic to be exactly equal to the parameter value. Hopefully, the error will be small or will exceed our tolerable limit very rarely (say once in a 100 trials).
Why do we need a statistic?
Sometimes it will be impossible to know the actual value of a parameter. For example, let μ be the mean length of the life of light bulbs produced by a company. In this case, the company cannot test all the bulbs it produces to find a mean length. So, the best it can do is to test a few bulbs, compute the sample mean length (a statistic) of the life of these bulbs and use it as an estimate for the mean length (parameter μ) of the life for all the bulbs it produces.
Definition 4. The data that has not been
processed or organized in any form is called raw
data. When the data is arranged in an increasing or decreasing
order, then it is called an array. The
range of the data is the difference between
the largest and the smallest value of the data.
range = highest value - lowest value.
In this section we talk about representation of data organized in tabular form. Such a representation is called a frequency distribution. We are mostly concerned with numerical data (i.e., quantititative data), but also consider some non-numerical data (i.e., qualitative data).
Example. (from Khazanie, p. 18) The following is data on the blood group of 36 patients in a hospital:
| O | A | B | O | A | A | A | O | O |
| O | A | O | A | B | O | O | O | AB |
| B | A | A | O | O | A | A | O | AB |
| O | A | A | B | A | O | A | O | O |
We have four types of blood groups, namely, O, A, B, AB. Each of these blood groups may be referred to as a "class." The frequency of a class is defined as the number of data members that belong to that class. For example, the frequency of the class O is 16; the frequency of class A is 14. A table that lists the classes and the corresponding frequency is called the frequency distribution of this qualitative data. Following is the frequency distribution of this data:
| Blood Group | Frequency |
|---|---|
| O | 16 |
| A | 14 |
| B | 4 |
| AB | 2 |
| Total | 36 |
For the quantitative data, we consider two types of frequency table. When we are working with a large set of data we group that data into a few classes and construct a "frequency table," which we will discuss later. If the data set is small or if the number of values that appear in the data is small we need not group the data. Instead, we make a list of all the data members and give the corresponding frequency for each data member in a table. The number of times a data member (i.e., value) appears in the data is called the frequency of the data member. A list that presents the data members and the corresponding frequency in a tabular form is called a frequency table or frequency distribution. The relative frequency and percentage frequency of a data member x are defined as follows:
| relative frequency of x = | frequency of x
total # of data points |
| percentage frequency of x = | frequency of x
total # of data points |
· 100. |
The frequency table may also contain the relative and percentage frequency. Since we did not group the data into a few classes, we call this the frequency distribution of the ungrouped data.
Example 1.2.1 To estimate the mean time taken to complete a three-mile drive by a race car, the race car did several time trials, and the following sample of times taken (in seconds) to complete the laps was collected:
| 50 | 48 | 49 | 46 | 54 | 53 | 52 | 51 | 47 | 56 | 52 | 51 |
| 51 | 53 | 50 | 49 | 48 | 54 | 53 | 51 | 52 | 54 | 54 | 53 |
| 55 | 48 | 51 | 50 | 52 | 49 | 51 | 53 | 55 | 54 | 50 |
Note that there are 35 observations here. So we say that the size of the sample (or data) is 35. Also the values present are 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56. Since there are only 11 distinct values present we can make a frequency table for the ungrouped data. The following is the frequency distribution of this ungrouped data:
| Time (in seconds) |
Frequency | Relative Frequency |
Percentage Frequency |
|---|---|---|---|
| 46 | 1 | 1/35 | 2.86 |
| 47 | 1 | 1/35 | 2.86 |
| 48 | 3 | 3/35 | 8.57 |
| 49 | 3 | 3/35 | 8.57 |
| 50 | 4 | 4/35 | 11.43 |
| 51 | 6 | 6/35 | 17.14 |
| 52 | 4 | 4/35 | 11.43 |
| 53 | 5 | 5/35 | 14.29 |
| 54 | 5 | 5/35 | 14.29 |
| 55 | 2 | 2/35 | 5.71 |
| 56 | 1 | 1/35 | 2.86 |
| Total | 35 | 1 | 100 |
When we are working with a large set of data that has too many distinct class member (i.e., values) then we group the whole set of data into a few class intervals and give the corresponding "frequency" of the class. When the data is presented in this way, the data is called grouped data. The number of data members that fall in a class interval is called the class frequency and the relative and percentage frequencies are computed by the same formula as above. A list that gives various class intervals and the corresponding class frequencies in a tabular form is called a class frequency table or class frequency distribution of the data. The frequency distribution may also include the relative and percentage frequencies.
Grouped Data and Loss of Information
Sometimes it is convenient or necessary to group data into class intervals and construct a class frequency distribution. This is the case when there are too many distinct numbers present in the data—too many even to fit into a simple table on a page for presentation. In such situations, we group the data in a few class intervals. While class frequency distribution is very good for presentation and convenient for other reasons, we lose a lot of information in this process. There is no way we can recover the original data from the class frequency distribution.
Given a set of data, a good question would be, How many class intervals should we have? The answer is that it should not be too few nor should it be too many. If we take too few (say one), then all the information will be lost. On the other hand, if we take too many, we will have the problem of having to work with ungrouped data. (In this course we will always tell you how many classes to take.) Although sometimes it may be necessary to take class intervals of varying width, in this course we only consider classes of equal class width.
| class width = w = | R
Number of classes |
[L,L+w],[L+w,L+2w],[L+2w, L+3w], ...,[H-w,H]
Since this definition creates an ambiguous situation in which a
data value may fall into two classes, we need a convention to address
this situation.
A few more important definitions. The above intervals are called class intervals. The w above is called the class size or width. The lower end of the class is called lower limit and the upper end of the class is called upper limit. The class mark is the midpoint of the class, defined as follows:
| class mark = | lower limit of class+
upper limit of class
2 |
. |
A class limit is also called a class boundary. I took a slightly different approach when I defined the classes, so that for us class limits and class boundaries are the same. Although all the approaches are essentially the same, many slightly different approaches are possible depending on the situation.
Example 1.2.2 The following is the weight (in ounces), at birth, of a certain number of babies.
| 74 | 105 | 124 | 110 | 119 | 137 | 96 | 110 | 120 | 115 | 140 |
| 65 | 135 | 123 | 129 | 72 | 121 | 117 | 96 | 107 | 80 | 91 |
| 74 | 123 | 124 | 124 | 134 | 78 | 138 | 106 | 130 | 97 | 145 |
| 93 | 133 | 128 | 96 | 126 | 124 | 125 | 127 | 62 | 127 | 92 |
| 95 | 118 | 126 | 94 | 127 | 121 | 117 | 124 | 93 | 135 | 156 |
| 143 | 125 | 120 | 147 | 138 | 72 | 119 | 89 | 81 | 113 | 91 |
| 133 | 127 | 138 | 122 | 110 | 113 | 100 | 115 | 110 | 135 | 141 |
| 97 | 127 | 120 | 110 | 107 | 111 | 126 | 132 | 120 | 108 | 148 |
| 143 | 103 | 92 | 124 | 150 | 86 | 121 | 98 | 74 | 85 | 99 |
We will construct a class frequency table of this data by dividing the whole range of data into class intervals.
Solution: Note that the lowest value is 62 and the highest value is 156. We take L = 60, H = 160, so R = H-W = 100. We made such a choice of L and H, precisely so that R = 100 is a "nice" number. Now we decide to have 5 class intervals and so w = R/5 = 20. According to what I said above, our classes should be : [60, 80], [80,100], [100,120], [120,140], [140, 160]. But if we do so then there is a risk that some data members (like 80, 100, 120, 140) will fall in two classes. One way to avoid this is to add .5 to all the class boundaries. So, our classes are [60.5, 80.5], [80.5, 100.5], [100.5, 120.5], [120.5, 140.5], [140.5, 160.5].
So the frequency distribution is as follows:
| Classes | Frequency | Relative Frequency |
Percentage Frequency |
|---|---|---|---|
| 60.5 - 80.5 | 9 | 9/99 | 9.09 |
| 80.5 - 100.5 | 20 | 20/99 | 20.20 |
| 100.5 - 120.5 | 25 | 25/99 | 25.26 |
| 120.5 - 140.5 | 37 | 37/99 | 37.38 |
| 140.5 - 160.5 | 8 | 8/99 | 8.08 |
| Total | 99 | 1 | 100 |
Another way to represent data is to use pictures and graphs. We see such pictorial representation in newspapers and other sources every day. Pictorial representation is particularly important when you have to represent data to people with limited technical background, like newspaper readers or a governmental or congressional body.
The pie chart is a commonly used pictorial representation of data.
When you do your tax return every year, you find a few pie charts in
the instruction book for form 1040. These charts show what proportion/percentage
of each tax dollar goes for particular expenses. I reproduced the following
pie charts from the 1040 instruction book of 1999.
Among pictorial representations, the most useful in this course is the histogram. The histogram of data is the graphical representation of the frequency distribution of the data, where we plot the variable on the horizontal axis and above each class interval, we erect a bar of the height equal to the frequency of the class. Such a histogram is called a frequency histogram.
If, instead, we erect bars of height equal to the relative frequency, then the graph is called a relative frequency histogram. Similarly, we can construct a percentage frequency histogram.
The following is a histogram.

We have decided to avoid unequal class lengths, which makes our discussion
of the histogram fairly simple.
Remark. Take a look at the Stem and Leaf Diagram discussed in any textbook.
Example 1.3.1. Following is the frequency table of data on height (in inches) of some babies at birth. Sketch the histogram of the following data:
| Height | Frequency |
|---|---|
| 16-17 | 3 |
| 17-18 | 8 |
| 18-19 | 34 |
| 19-20 | 60 |
| 20-21 | 72 |
| 21-22 | 18 |
For a given value x of a variable, the cumulative frequency of the data, for x, is the number of data members that are less than or equal to x.
Definition. Given a frequency distribution of some data, for a class boundary x, the cumulative frequency is the sum of all the class frequenies less or equal to x. The cumulative frequency distribution is a table that gives the cumulative frequencies against some x values (for us the class boundaries). We also define cumulative relative frequency and cumulative percentage frequency as follows:
| cumulative relative frequency of x = |
cumulative frequency
of x
total # of data points |
| cumulative percentage frequency of x= | cumulative frequency
total # of data points |
×100 |
Example 1.3.2 Once again we consider the data on birth weight of babies in Example 1.2 that we discussed in the last section. A cumulative frequency distribution can be constructed from the frequency distribution.
Solution: We have seen the frequency distribution before. The following is the cumulative distributions:
| Weight | Cumulative Frequency |
Relative-Cumulative Frequency |
Cumulative Percentage Frequency |
|---|---|---|---|
| 60.5 | 0 | 0 | 0 |
| 80.5 | 9 | 9/99 | 9.09 |
| 100.5 | 29 | 29/100 | 29.29 |
| 120.5 | 54 | 54/99 | 54.55 |
| 140.5 | 91 | 91/99 | 91.92 |
| 160.5 | 99 | 1 | 100 |
Definition. The ogive
is a line graph, where we plot the variable on the horizontal axis and
the cumulative frequency on the vertical axis. If we plot the cumulative
relative frequency on the vertical axis, then the line graph is called
the relative frequency ogive.
Because we will be using calculators (TI-83) extensively in this course, let me explain how you enter data in the TI-83.
| Use of Calculators (TI-83): |
|---|
Enter Your Data:
|
It is not easy to construct a frequency table of a data set unless
you are systematic. Traditionally, we used "tally marks" to count the
frequency. Now you can use some software programs (e.g., Excel). Let
me show you a method, using a calculator (TI-83).
|
Exercise 1.2.1 To estimate the mean time taken to complete a
three-mile drive by a race car, the race car did several time trials,
and the following sample of times taken (in seconds) to complete the
laps was collected:
| 50 | 48 | 49 | 46 | 54 | 53 | 52 | 51 | 47 | 56 | 52 | 51 |
| 51 | 53 | 50 | 49 | 48 | 54 | 53 | 51 | 52 | 54 | 54 | 53 |
| 55 | 48 | 51 | 50 | 52 | 49 | 51 | 53 | 55 | 54 | 50 |
The following is the frequency distribution of this ungrouped data:
| Time (in seconds) |
Frequency | Relative Frequency |
Percentage Frequency |
|---|---|---|---|
| 46 | 1 | 1/35 | 2.86 |
| 47 | 1 | 1/35 | 2.86 |
| 48 | 3 | 3/35 | 8.57 |
| 49 | 3 | 3/35 | 8.57 |
| 50 | 4 | 4/35 | 11.43 |
| 51 | 6 | 6/35 | 17.14 |
| 52 | 4 | 4/35 | 11.43 |
| 53 | 5 | 5/35 | 14.29 |
| 54 | 5 | 5/35 | 14.29 |
| 55 | 2 | 2/35 | 5.71 |
| 56 | 1 | 1/35 | 2.86 |
| Total | 35 | 1 | 100 |
Construct a histogram.
Exercise 1.2.2. The following is the weight (in ounces), at birth, of 96 babies born in Lawrence Memorial Hospital in May 2000.
| 94 | 105 | 124 | 110 | 119 | 137 | 96 | 110 | 120 | 115 | 119 |
| 104 | 135 | 123 | 129 | 72 | 121 | 117 | 96 | 107 | 80 | 80 |
| 96 | 123 | 124 | 124 | 134 | 78 | 138 | 106 | 130 | 97 | 134 |
| 111 | 133 | 128 | 96 | 126 | 124 | 125 | 127 | 62 | 127 | 96 |
| 116 | 118 | 126 | 94 | 127 | 121 | 117 | 124 | 93 | 135 | 112 |
| 120 | 125 | 120 | 147 | 138 | 72 | 119 | 89 | 81 | 113 | 100 |
| 109 | 127 | 138 | 122 | 110 | 113 | 100 | 115 | 110 | 135 | 120 |
| 97 | 127 | 120 | 110 | 107 | 111 | 126 | 132 | 120 | 108 | 148 |
| 133 | 103 | 92 | 124 | 150 | 86 | 121 | 98 |
Construct a class frequency table of this data by dividing the the
whole range of data into class intervals:
[60.5-70.5], [70.5-80.5], [80.5-90.5], [90.5-100.5], [100.5-110.5], [110.5-120.5], [120.5-130.5], [130.5-140.5], [140.5-150.5]
Exercise 1.2.3. The following are the length (in inches), at birth, of 96 babies born in Lawrence Memorial Hospital in May 2000.
| 18 | 18.5 | 19 | 18.5 | 19 | 21 | 18 | 19 | 20 | 20.5 |
| 19 | 19 | 21.5 | 19.5 | 20 | 17 | 20 | 20 | 19 | 20.5 |
| 18 | 18.5 | 20 | 19.5 | 20.75 | 20 | 21 | 18 | 20.5 | 20 |
| 21 | 19 | 20.5 | 19 | 20 | 19.5 | 17.75 | 20 | 19.5 | 20 |
| 20.5 | 17 | 21 | 18.5 | 20 | 20 | 20 | 18.5 | 19.5 | 19 |
| 18 | 20.5 | 18 | 20 | 19 | 19 | 19.5 | 20 | 20.75 | 21 |
| 17.75 | 19 | 18 | 19 | 20 | 18.5 | 20 | 19 | 21 | 19 |
| 19.5 | 20 | 20 | 19 | 19.5 | 20 | 19.5 | 18.5 | 20.5 | 19.5 |
| 20.25 | 20 | 19.5 | 19.5 | 20 | 20 | 20 | 21 | 20 | 19 |
| 18.5 | 20.5 | 21.5 | 18 | 19.5 | 18 |
Construct a frequency table for this data by dividing the whole range into class intervals:
[16-17], [17-18], [18-19], [19-20], [20-21], [21-22].
Note: If a data member falls on the boundary, count it in the
right/upper class-interval.
Solution
Exercise 1.2.4. The following data represents the number of typos in a sample of 30 books published by some publisher.
| 156 | 159 | 162 | 160 | 156 | 162 |
| 159 | 160 | 156 | 156 | 160 | 162 |
| 156 | 159 | 162 | 156 | 162 | 158 |
| 160 | 158 | 159 | 162 | 158 | 158 |
| 162 | 160 | 159 | 162 | 162 | 160 |
Construct a frequency table (by sorting in your calculator). Also construct
a histogram.
Solution
Exercise 1.2.5. Following is data on the hourly wages (paid only in whole dollars) in an industry.
| 9 | 11 | 8 | 9 | 10 | 11 | 7 | 10 | 12 | 13 |
| 7 | 11 | 8 | 11 | 14 | 9 | 10 | 9 | 11 | 7 |
| 13 | 13 | 14 | 12 | 9 | 8 | 12 | 14 | 15 | 9 |
| 9 | 7 | 12 | 7 | 12 | 7 | 7 | 11 | 13 | 9 |
| 11 | 9 | 9 | 9 | 10 | 14 | 11 | 12 | 14 | 7 |
Construct a frequency table (by sorting in your calculator). Also construct
a histogram.
Solution
Exercise 1.2.6. Following is data on the hourly wages (paid only in whole dollars) of 99 employees in an industry.
| 7 | 11 | 7 | 11 | 10 | 9 | 10 | 10 | 12 | 13 |
| 7 | 8 | 11 | 11 | 14 | 9 | 7 | 9 | 11 | 7 |
| 9 | 13 | 12 | 14 | 7 | 8 | 7 | 14 | 15 | 9 |
| 9 | 7 | 11 | 9 | 12 | 9 | 12 | 11 | 14 | 9 |
| 12 | 13 | 7 | 9 | 10 | 14 | 11 | 12 | 13 | 7 |
| 15 | 15 | 16 | 16 | 15 | 16 | 11 | 7 | 18 | 19 |
| 15 | 16 | 15 | 15 | 16 | 16 | 17 | 16 | 16 | 13 |
| 15 | 15 | 16 | 15 | 16 | 15 | 15 | 17 | 16 | 12 |
| 16 | 15 | 15 | 16 | 15 | 15 | 19 | 8 | 16 | 17 |
| 16 | 16 | 15 | 16 | 16 | 16 | 13 | 12 | 8 |
Construct a frequency table (by sorting in your calculator).