Distributions and sampling

Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.

One meaning of the term distribute is to spread around as in divide and share. Some international relief programs distribute food to those who need it. A distribution system is a way or process of distributing something. Many companies have distribution warehouses as part of their distribution system. In statistics, a distribution is the way in which data values are put into buckets in order to summarize the data in a meaningful way. Statistical distribution

Distributions can be discrete (i.e., counts) or continuous (i.e., measures).

If there are enough values, the discrete counts can be approximated by a continuous measure.

Statistical distributions have many uses.

Fitting and analyzing sampled data with a distribution
Generating sample data for testing purposes (e.g., privacy considerations)

Generating test data can often involve creating sample text.

Pafnuty Chebyshev (statistician) developed a theorem that says that, for any set of data values, the proportion of values that like within k standard deviations of the mean is at least 1-1/k², where k is any constant greater than 1.

Russian name: Пафну́тий Чебышёв Chebyshev's theorem formula

What is a bell shape? The Liberty Bell is an example of a bell-shaped object. Many frequency distributions are bell-shaped, or can be approximated by a bell-shaped distribution.

For a symmetrical, bell-shaped, frequency distribution, the following are true.

About 68.0% of the observations will be within one standard deviation, σ, of the mean.
About 95.0% of the observations will be within two standard deviations of the mean.
About 99.7.0% of the observations will be within three standard deviations of the mean.

Six Sigma is a quality control technique based on controlling variance measurements.

The central limit theorem is a measure of central tendency.

The central limit theorem states that as more and more random samples of a given size are taken from a population, the distribution of the sample means can be approximated by a normal distribution. The more samples, the better the approximation.

Simulation techniques can be used to simulate many samples from a given distribution.

What happens if samples are taken from a population that is normally distributed?

Here is the result of sampling from a normal distribution.

The distribution of the sample means appears to be a normally distributed.

This is to be expected. Notice that the distribution of the sample means is narrower and taller. Since the height of all boxes (in this discrete approximation) is 1.0, the standard deviation of the sample means must be smaller than the standard deviation of the population.

Suppose that a real-world process is modeled by a uniform distribution in the range 30.0 to 60.0.

One example of a uniform distribution is a truly random number generation process within the range of interest, here 30.0 to 60.0. Suppose that 1000 samples are to be taken where each sample consists of 16 values selected at random from the distribution and averaged to get a sample mean. Of course, there are random errors inherent in each sample of 16 values, but these random errors can be minimized by taking more samples, which is why 1000 samples are taken. Then create a bar chart of the distribution of the means of the 1000 samples. What is the distribution of the sampling process?

Here is the result of sampling from a uniform distribution.

The sample means appear to have a bell-shaped distribution. This distribution is called the normal distribution, a measure of central tendency. The central limit theorem states that most sampling distributions can be approximated by a normal distribution, even if the population distribution (in this case, the uniform distribution) is not normally distributed. Thus, the central limit theorem has great importance since it means that the normal distribution has useful applications in practice. What is the importance of the central limit theorem?

Consider the following exponential distribution.

What happens if samples are taken from this exponential distribution?

Here is a chart of the results of 2000 samples of size 16 from an exponential distribution.

The distribution of the sample means appears to be a normally distributed.

Consider the following discrete distribution.

What happens if samples are taken from this discrete distribution?

Here is the chart of the results of 2000 samples of size 32 from the very non-normal discrete distribution.

The distribution of the sample means appears to be a normally distributed. The standard deviation of the population would be quite large, as the values are only at the extremes (low and high) of the possible values.

Thus, the standard deviation of the sample means appears to be less than the standard deviation of the population. Suppose that the middle values with 0.0 probability are omitted. An intuitive analogy at this point is to compare this distribution with the binomial distribution of a biased coin.

As can be seen, the central limit theorem is important in that it shows that normal distributions can be used to model the distribution of the sample means.

Here are some animations of some statistical concepts that are relevant to a data science course.

/QM.XLS/norm-01.xls: Normal distribution

As the mean changes, the distribution of the means keeps the same shape (i.e., the same standard deviation).

In this animation, the mean μ ranges from 40 to 60 while the standard deviation σ is 15.

As the standard deviation σ (sigma) changes, the mean μ (mu) stays the same, but the shape of the curve changes. Since the area under the curve is always 1.0, the curve gets wider and flatter as the standard deviation increases and taller and narrower as the standard deviation decreases.

The area under a probability curve is 1.0.

As the standard deviation σ (sigma) increases, the curve gets wider and shorter.

As the standard deviation σ (sigma) decreases, the curve gets narrower and taller.

As the mean μ (mu) increases, the curve shifts right.

As the mean μ (mu) decreases, the curve shifts left.

Here is an animation of how the Poisson distribution varies as mu is changed.

Notice that as μ(mu) increases, the Poisson distribution resembles the discrete binomial and the continuous normal distribution.