| Satyagopal Mandal |
| Department of Mathematics |
| Office: 624 Snow Hall Phone: 785-864-5180 |
Statistics is a science that develops and formulates techniques in order to make inferences about a large population by studying a small sample.
What is Data?
When information is packaged in numerical form it is called DATA.
According to your text, statistics is the science of dealing with data. This includes collecting, organizing, understanding and interpreting data.
What is the Population?
In statistics we try to understand or make inferences or projections about a group of similar objects. Such a collection of individuals or objects that is under study is called the POPULATION.
Example.1. If we are studying the income distribution of Americans the population is the American population.
Example.2. If we are studying the income distribution of the immigrant American population then the population is the immigrant American population.
Example.3. If we are studying the growth of the fish population in Clinton Lake then the population is the fish population in the Clinton Lake.
Example.4. If we are studying the African elephants then the population is the population of African elephants.
(Please, have a look at Example 1-2, page 426).
The N-VALUE: The total number of members in the population under study is called the N-value of the population. If an accurate head count of all the members in the population were possible then we would know the N-value of the population. Often such a head count will not be possible. In that case this N-value will not be known.
(Please, have a look at Example 3-5, page 426).
Census
Article 1 and article 2 of the Constitution of the United States mandates that a national census be conducted every 10 years. By census we mean an official enumeration of the population. Not only in United States, census is conducted every 10 years all over the world. Following are a few comments about census:
Surveys
A more realistic and economical alternative to census is to collect data only from a small subgroup and then use this data to make inferences about the whole population. This approach is called a survey and the subgroups of the population from which the data is collected is called a sample.
The basic idea behind survey is that if we can find a sample that is "representative" of the whole population (that means it is not biased) then anything we need to know about the population can be derived from the sample.
(Please, read more about survey from your text, page 429-430.)
Public Opinion Polls
We all know about public opinion polls – Gallop poll, Harris poll and more. Please read more about Public opinion polls from your text (page 430-434). In particular, they discussed how and why the predictions made by various opinion polls in the presidential elections in 1936 (Franklin Roosevelt vs. Alfred Landon) and 1948 (Harry Truman vs. Thomas Dewey) went wrong.
Sampling Methods
It is a real challenge for a statistician how to pick a "representative sample". If a statistician tries to pick a sample, his/her human bias is essentially bound to result in a "biased sample". Whatever method we use to pick a sample, the selection of the sample members must be done randomly. That means that mathematics and methods of chance must guide the selection of sample members. A sample picked in such a manner is called a random sample and the method is called random sampling.
Another important concern regarding sampling is the cost of sampling. There are two methods of random sampling that we shall talk about here.
First divide the population into categories, called strata, and randomly select a sample from these strata. The chosen strata are then further divided into categories, called substrata and select a random sample of substrata from each of the strata. The process is continued for a number of times.
(Please read more about stratified sampling from your text, page 436-437.)
Sample Size
The sample size for a large population need not be very large
. In practice, it is often less than 1500. If you follow CNN polls or others, they normally sample 700-1200.Sampling : Terminology and Key Concepts
The job of a statistician is to make inferences about a large population on the basis a (small) sample.
3) Unless the population is small, the actual valued of a parameter will never be known. On the other hand, since the samples are small we can always compute the actual values of the statistics.
The game here is to estimate the parameters by appropriate statistics.Example. Suppose we want to understand the income distribution of the US population and we want to know the average income of the US population.
Here average US income is a parameter.
Since it will be almost impossible to compute the actual value of the average US income, we take a sample (say of size 1500) and compute the average income of the sample members.
This sample average is a statistic.
It is reasonable to use this (statistic) sample average income as an estimate for the (parameter) average US income.
Sampling Error
A statistic used to estimate a parameter is only an estimate. So, we will not expect the statistic to be exactly equal to the parameter. In the above example, we would not expect that the sample average income to be exactly equal to the average US income. The difference between the parameter and the statistic used to estimate it is called the
sampling error.There are two types of sampling errors as follows:
The Capture-recapture Method: estimating N-value
Suppose we want to estimate the number of fish in Clinton Lake. Let N be the number of fish in the lake. We will describe of capture-recapture method of doing this. The method is as follows:
So, we have an estimate
N = mn/k.
Exercise. As part of a project we made two trips to a local lake. The first day we caught m=325 fish and tagged them. On the second day we caught n=525 fish and out of them k=125 were tagged fish. Give an estimate of the total number of fish in the lake.
We have
N=mn/k=325x525/125= 1365.
Suggested Problems: Example 6 (page 439), Ex.31a-b (page 450)
Clinical Studies
When a vaccine or a new drug is tested, the statistical methods used are very interesting. I will not go deep into it, please have a look in your text (page 440). Main points are as follows:
The following are some data from the 1954 Salk Polio Vaccine Field Trials. Please see your text (page 441) for more.
Results of the Salk Polio Vaccine Trials
|
Number of Children |
Number of reported Polio-cases |
Number of reported Paralytic-cases |
Number of Fatal-cases |
|
|
Treatment gr. |
200,785 |
82 |
33 |
0 |
|
Control gr. |
201,229 |
162 |
115 |
4 |
You can see that the treatment group did better.
Suggested Problems: Look at all the odd number problems between Ex.1-24, page 445.