High School Modules > Miscellaneous Advanced Topics
The Central Limit Theorem
An exploration of the underlying concept of the Central Limit Theorem.
[Directions : Execute the Code Resource section first. Although there will be no output immediately, these definitions are used later in this worksheet.]
0. Code
Warning, the name changecoords has been redefined
1. Arbitrary Data Distributions
Any collection of data values can be expressed graphically, by drawing one cell for each occurrences of a particular data value at its location on the x-axis, stacking them if there are multiple occurrences at the same value.
There is a box for each data value. The minimum is 3 and maximum is 24. The downward facing green triangle is the location of the mean. The dashed line and numbers above it, show one standard deviation above the mean, 19.5, and one standard deviation below the mean, 5.89. This is a visual representation of the original data distribution. Notice that these distributions can be quite different. The next distribution is evenly distributed. Each data value has a frequency of 1, and every data value in the range is covered.
This next data set is another extreme - where all of the values are one value or another, with nothing in between.
Here is another set, which is larger.
2. Sample from a Distribution
Given a "population" which has the data values of any of the distributions we saw above, we can randomly choose a sample among that population. Each time we do it, we'll get a different sample. Here is the first data population, and a number of samples of five taken from this population.
Notice that each sample is different. We can also take larger samples .... 15... or even 30 - which is more than the number of elements in the original data set! This is like rolling a die with six sides 8 times or 20 times or 200 times ... the sample may be larger than the population
Lets look at a few samples from some other data sets - here is the evenly distributed set.
And here are some samples from the data set with only two values. Obviously all of the members of the samples will only consist of those same two values, but how many of each is still variable.
3. Sample Means
Now the next step is to take these samples that we pluck from a population, and compute the mean for each sample. You are no doubt an expert in computing means by now .... add up the numbers and divide by how many numbers there are.
We can also do this is an automated way ... snag a sample and compute its mean. These means will differ because the underlying samples differ.
What can we say about the collection of sample means we are creating?
4. Distribution of Sample Means
We started with a population of data values, then taken samples of it, and computed the mean of the sample. The next step is to look at what happens to this new data set ... the set of sample means. Lets start out with the evenly distributed data set.
To review ... there is a box for each data value. The minimum is 2 and maximum is 26. The downward facing green triangle is the location of the mean. The dashed line and numbers above it, show one standard deviation above the mean, and one standard deviation below the mean. This is a visual representation of the original data distribution. Lets take 10 samples of sample size 4 from this distribution. We'll plot the original data on the bottom as before, but then we'll also plot the distribution of sample means above it.
There is a blue box for each sample mean. The mean of these sample means is indicated by the upward facing small red triangle. The dashed red line and numbers indicate one standard deviation above and below this mean. When we say "one standard deviation" we are referring to the data of the sample mean distribution not the original data. What do you notice? Here are a few observations: 1. The mean of the sample means is relatively close to the mean of the original distribution. We will see that it gets closer and closer - the more samples we take, and the larger sample size we use. (see below) 2. The standard deviation of the sample means is quite a bit smaller than the standard deviation of the original data. You may also notice that the range of the sample means is much smaller than the range of the original data. These are both indications that the sample means are clustered around the mean much closer than the original data is. Let see how this changes if we take twice as many samples, 20, of the same size 4.
Or the same number of samples as before, 10, but now with sample size of 12 instead of 4.
Hopefully, in both cases you saw that there are "improvements" to the sample mean distribution - in that the means are closer, and the sample mean standard deviation should be smaller. We'll get even better results if we increase both.
Another data set .... What would happen if we tried the same thing but with the data set of only two extreme values?
Samples, and sample means look like this :
Lets take 10 samples of size 3, then increase the number of samples and sample size. We should see a progression. The bottom chart (in green) will not change because it refers to the original data distribution. However, we should see the data on the top (in blue) becoming clusters more and more toward the center, with the red lines of the standard deviation shrinking.
Other data sets ....
5. The Central Limit Theorem
The central limit says that the distribution of sample means is normally distributed - no matter what the distribution of the original data is! This is particularly handy when we don't know the true mean (which we did know above). If we can take samples, and find their means, and then take the means of the sample means, this value will be a good approximation to the true mean. The key to this is the sample size. The standard deviation of the sample means, is the standard deviation of the original data, divided by the square root of n, the sample size. Thus as the sample size increases, the standard deviation of the sample means will decrease. Here are some plots which show the original data along with the normal curves for the sample means for n = 5, 10, 15, 20, 25, 30, 35.
2002 Waterloo Maple Inc & Gregory Moore, all rights reserved.