Skip to content

8.Overview and Integration

8.1 Review

  • the main concern of the science of statistics is in making inference on the parameter of the population on the basis of the data collected.
  • Data is frequently stored in the format of a data frame:
    • columns are the measured variable.
    • rows are the observations within the selected sample.
    • data types are: 1. numeric (discrete or continuous). 2. factors.
  • Statistics is geared towards dealing with variability which emerges from:
    • Descriptive statistics.
    • probability.
  • Descriptive statistics examines the distribution of the data:
    • reference is data frame in hand.
    • plots, histograms, box plots.
    • tables: frequency, relative frequency, cumulative relative frequency.
    • numerical summary statistics: mean, median, mode, standard deviation, variance, skewness, kurtosis.
  • In probability:
    • reference is: all data frames that could have been sampled from the population (the sample space of the sampling distribution)
    • same tools of descriptive statistics are used, but the meaning is different.
    • the relevance of the probabilistic analysis to the data actually sampled is indirect.
    • The given sample is only one realization within the sample space among all possible realizations
  • In statistical inference the characteristics of the data may be used in order to extrapolate from the sampled data to the entire population.
  • simple sampling model assumed that each subset of a given size from the population has equal probability to be selected as the sample.
  • other models assumed different probabilities to be selected as the sample. these models include:
    • Binomial distribution.
    • Poisson distribution.
    • Exponential distribution.
    • Uniform distribution.
    • Normal distribution.
    • and many more.
  • A statistic is a function of sampled data that is used for making statistical inference.
  • The relation between the distribution of a measurement and the distribution of a statistic computed from a sample that is generated from that distribution may be complex.
  • The Central Limit Theorem:
    • provides and approximation of the distribution of the sample mean (approximation is better with larger sample sizes).
    • Expectation of the sample mean is the expectation of the population mean.
    • variance of the sample mean is the variance of the population mean divided by the sample size.
    • the distribution of the sample mean may be approximated to a normal distribution with the same expectation and standard deviation.
    • The sampling distribution follows its sample distribution only in mean and variance, but not through other characteristics of the distribution.
    • The sampling distribution follows its sample distribution also in some other characteristics that can be extracted from the mean such as:
      • sum of the sample: sum = mean * sample size.

8.3 Examples

1. Stress in college campus

  • statement:
    • A study involving stress is done on a college campus among the students.
    • The stress scores follow a (continuous) Uniform distribution.
    • lowest stress score = 1
    • highest stress score = 5
    • sample of 75 students.
    • find:
      1. The probability that the average stress score for the 75 students is less than 2.
      2. The 90th percentile of the stress score for the 75 students.
      3. The probability that the total of 75 stress score is less than 200.
      4. The 90th percentile for the total stress score for the 75 students.

solution of EX1

  • plan:

    1. find the EX and VAR of the stress score in the population using Uniform distribution rules.
    2. Use sampling distribution rules to Extract the EX and VAR of sampling distribution for the total stress score from the population.
    3. Use the Normal distribution rules for the rest of the questions (according to the Central Limit Theorem).
    4. We use a normal distribution with the same expectation and standard deviation as the sampling distribution.
    5. sample mean = sum of sample / sample size, so that sum = mean * sample size.
    6. The 90th percentile of the distribution of the total sum is the 90th percentile of the sample mean * sample size.
  • For a uniform distribution:

    • E(X) = (a + b) / 2
    • Var(X) = (b - a)^2 / 12
  • For the sampling distribution:

    • E(X) = E(X) of the population
    • Var(X) = Var(X) of the population / sample size
# X ~ Uniform(1, 5)
a = 1
b = 5
n = 75

mu.bar = (a + b) / 2 # population EX or mean 3
sigma.bar = sqrt( (b - a)^2 / 12 ) # population Var  0.1333

### 1
pnorm(2, mu.bar, sigma.bar) # probability of average score is less than 2

### 2
qnorm(0.9, mu.bar, sigma.bar) # 90th percentile of the stress score

### 3
# mean = sum/n => mean = 200/n,
pnorm(200/n, mu.bar, sigma.bar) # probability of sample mean is less than = 200/75,

### 4
n * qnorm(0.9, mu.bar, sigma.bar) # 90th percentile of the total stress score

2. Stress in college campus 2

  • statement:

    • A study involving stress is done on a college campus among the students.
    • The stress scores follow a (discrete) Uniform distribution.
    • lowest stress score = 1
    • highest stress score = 5
    • stress score can only be 1, 2, 3, 4, 5.
    • sample of 75 students.
    • find:
      1. The probability that the average stress score for the 75 students is less than 2.
      2. The 90th percentile of the stress score for the 75 students.
      3. The probability that the total of 75 stress score is less than 200.
      4. The 90th percentile for the total stress score for the 75 students.

solution of EX2

  • plan:
    1. Denote again by X the stress score of a random student.
    2. The sample space of X is the set of all possible stress scores = {1, 2, 3, 4, 5}.
    3. The probability of getting any score is equal to ⅕.
    4. Since the probabilities must sum to 1 we get that P(X = x) = ⅕, for all x in the sample space.
    5. sample mean = sum ( element * probability )
    6. variance (sigma) = sum ( (element - sample mean)^2 * probability )
x = 1:5
p = rep(1/5, 5) # [0.2, 0.2, 0.2, 0.2, 0.2]
n = 75

mu.X = sum(x * p) # population mean
var.X =  sum( (x - mu.X)^2 * p ) # population variance

mu.bar = mu.X # sampling distribution mean
sigma.bar = sqrt(var.X / n) # sampling distribution variance

### 1
pnorm(2, mu.bar, sigma.bar) # probability of average score is less than 2

### 2
qnorm(0.9, mu.bar, sigma.bar) # 90th percentile of the stress score

### 3
pnorm(200/n, mu.bar, sigma.bar) # 0.02061342, probability of sample mean is less than = 200/75,
pnorm(199.5/n, mu.bar, sigma.bar) #0.01866821, with continuity correction since model is discrete.

### 4
n * qnorm(0.9, mu.bar, sigma.bar) # 90th percentile of the total stress score

3. cellular phone usage

  • statement:
    • cellular phone company conducts a study of their customers who exceed the time allowance included on their basic cellular phone contract.
    • for Users who exceeds the basic usage allowance, The excess tile follow an exponential distribution, with a mean of 22 minutes.
    • consider a sample of 80 users.
    • find:
      1. The probability that the average excess time for the 80 users is more than 20 minutes.
      2. The 95th percentile of the excess time for the 80 users.

solution of EX3

  • plan:
    • Let X be the excess time for customers who exceed the time included in their basic contract.
    • X ~ Exponential(lambda)
    • E(X) = 1 / lambda
    • Var(X) = 1 / lambda^2
    • E(X) = 22 = 1/lambda => lambda = 1/22
# X ~ Exponential(1/22)
lam = 1/22
n = 80

mu.X = 22 # population mean
Var.X = (1/lam^2) # population variance
sigma.X = sqrt(Var.X) # population standard deviation

mu.bar = 1/lam # sampling EX or mean
var.bar = 1/(lam^2 * n) # sampling average VAR
sigma.bar = sqrt(var.bar) # sampling average sd

### 1
1 - pnorm(20, mu.bar, sigma.bar) # P(X >= 20)

### 2
qnorm(0.95, mu.bar, sigma.bar) # 95th percentile  sampling average

4. Beverage company products

  • statement:
    • A beverage company produces cans that are supposed to contain 16 ounces of beverage.
    • Under normal production conditions the expected amount of beverage in each can is 16.0 ounces, with a standard deviation of 0.10 ounces.
    • QA department samples 50 cans from the production during the previous hour and measures the content in each of the cans.
    • If the average content of the 50 cans is below a control threshold then production is stopped and the can filling machine is re-calibrated.
    • QC.csv file is here: http://pluto.huji.ac.il/~msby/StatThink/Datasets/QC.csv
    • find:
      1. the probability that the amount of beverage in the can is below 15.95 ounces.
      2. the probability that the amount of beverage in a sample beverage of 50 cans is below 15.95 ounces.
      3. a threshold with the property that the probability of stopping the machine in a given hour is 5%.
      4. load file QC.csv, which contains measurements for 8 hours. assume we apply the threshold from 3. which hours in which the machine needs re-calibration?
      5. load file QA.csv, which hours contains outliers?

solution for EX4

  • plan:
    1. we have the expectation and the sd of the population. but don’t know the actual distribution of the population (all produced cans by a machine).
    2. we can not calculate the probability of beverage in the can in general cause we don’t know more about the distribution of the population.
    3. for the sampling distribution, we know the expectation and sd of the population, so we can apply the Central Limit Theorem, and consider it as Normal distribution.
    4. To answer 4., we have to find the mean for each of the hours and compare it to the threshold given in the previous question: if the mean is below the threshold then the machine needs re-calibration.

box plot of QC.csv

n =50

mu.X = 16 # mean of the population (the expectation)
sigma.X = 0.10 # sd of the population
var.X = (0.10)^2 # variance of the population

mu.bar = 16 # mean of the sampling distribution
var.bar = var.X/n # variance of the sampling distribution
sigma.bar = sqrt(var.bar) # variance of the sampling distribution

### 1
# Impossible since we don't know the actual distribution of the population

### 2
pnorm(15.95, mu.bar, sigma.bar) # P(<= 15.95) =  0.000203476

### 3
qnorm(15.95, mu.bar, sigma.bar) # Q(= 0.05) = 15.97674 (once), 5th-percentile of sampling distribution

### 4
QC = read.csv ('QC.csv')
threshold = 15.97674
columns = colnames(QC)
hoursUnderThreshold = list()

for (cl in columns){
    colData = QC[cl]
    m = apply(colData, 2, mean)
    hoursUnderThreshold[cl] = m <= threshold
}

print (hoursUnderThreshold)  # find that h3, h8 equals TRUE (below the threshold)


### 5
boxplot(QC)
# h4, h6, h7, h8 have outliers (circles in the diagram above)

5. Uniform(0, b), unknown b

  • statement:
    • A measurement follows the Uniform(0, b), for an unknown value of b.
    • Two statisticians propose two distinct ways to estimate the unknown quantity b with the aid of a sample of size n = 100.
    • Statistician A proposes to use twice the sample average (2 X.bar) as an estimate.
    • Statistician B proposes to use the largest observation instead.
    • In order to choose between the two options they agreed to prefer the statistic that tends to have values that are closer to b.
    • The performance of a statistic is evaluated using the mean square error (MSE), which is defined as the sum of the variance and the squared difference between the expectation and b.
    • MSE = Var(T) + (E(T) − b)^2 where T is the statistic.
    • A smaller mean square error corresponds to a better, more accurate, statistic.
    • find:
      1. assume b=10 compute expectation, variance, and MSE for Statistician A.
      2. assume b=10 compute expectation, variance, and MSE for Statistician B.
      3. assume b=13.7 compute expectation, variance, and MSE for Statistician A.
      4. assume b=13.7 compute expectation, variance, and MSE for Statistician B.
      5. compare results, which mode seems to be preferable?

solution for EX5

  • plan:
    1. we compute the statistics for a sample of size n = 10^5 (assumed)
    2. X ~ Uniform(0, 10), n = 10^5 for the first 2 questions.
N = 10^5 # population size (assumed)
n = 100 # sample size (given)

### 1 and 2
b1 = 10 # given
XA = rep(0, N)
XB = rep(0, N)

for (i in 1:N){
    X.samp = runif(n, 0, b1) # generate samples of size n=100
    XA[i] = 2 * mean(X.samp) # XA: twice the mean
    XB[i] = max(X.samp) # XB: max element in the sample
}

# Statistician A
XA.mean = mean(XA)
XA.var = var(XA)
XA.MSE = XA.var + (XA.mean - 10)^2 # 0.3341725

# Statistician B
XB.mean = mean(XB)
XB.var= var(XB)
XB.MSE = XB.var + (XB.mean - 10)^2 # 0.01924989

### 3 and 4
b2 = 13.7 # given
YA = rep(0, N)
YB = rep(0, N)

for (i in 1:N){
    Y.samp = runif(n, 0, b2) # generate samples of size n=100
    YA[i] = 2 * mean(Y.samp) # YA:  twice the mean
    YB[i] = max(Y.samp) # YB: max element in the sample
}

# Statistician A
YA.mean = mean(YA)
YA.var = var(YA)
YA.MSE =YA.var + (YA.mean - 10)^2 # 0.6264204

# Statistician B
YB.mean = mean(YB)
YB.var = var(YB)
YB.MSE = YB.var + (YB.mean - 10)^2 # 0.01787562


#### 5
# we saw MSE of Statistician B is less than MSE  of Statistician A in both cases =>
# Statistician B is more preferable and has more accurate estimations