8.Overview and Integration¶
8.1 Review¶
- the main concern of the science of statistics is in making inference on the parameter of the population on the basis of the data collected.
- Data is frequently stored in the format of a data frame:
- columns are the measured variable.
- rows are the observations within the selected sample.
- data types are: 1. numeric (discrete or continuous). 2. factors.
- Statistics is geared towards dealing with variability which emerges from:
- Descriptive statistics.
- probability.
- Descriptive statistics examines the distribution of the data:
- reference is data frame in hand.
- plots, histograms, box plots.
- tables: frequency, relative frequency, cumulative relative frequency.
- numerical summary statistics: mean, median, mode, standard deviation, variance, skewness, kurtosis.
- In probability:
- reference is: all data frames that could have been sampled from the population (the sample space of the sampling distribution)
- same tools of descriptive statistics are used, but the meaning is different.
- the relevance of the probabilistic analysis to the data actually sampled is indirect.
- The given sample is only one realization within the sample space among all possible realizations
- In statistical inference the characteristics of the data may be used in order to extrapolate from the sampled data to the entire population.
- simple sampling model assumed that each subset of a given size from the population has equal probability to be selected as the sample.
- other models assumed different probabilities to be selected as the sample. these models include:
- Binomial distribution.
- Poisson distribution.
- Exponential distribution.
- Uniform distribution.
- Normal distribution.
- and many more.
- A statistic is a function of sampled data that is used for making statistical inference.
- The relation between the distribution of a measurement and the distribution of a statistic computed from a sample that is generated from that distribution may be complex.
- The Central Limit Theorem:
- provides and approximation of the distribution of the sample mean (approximation is better with larger sample sizes).
- Expectation of the sample mean is the expectation of the population mean.
- variance of the sample mean is the variance of the population mean divided by the sample size.
- the distribution of the sample mean may be approximated to a normal distribution with the same expectation and standard deviation.
- The sampling distribution follows its sample distribution only in mean and variance, but not through other characteristics of the distribution.
- The sampling distribution follows its sample distribution also in some other characteristics that can be extracted from the mean such as:
- sum of the sample: sum = mean * sample size.
8.3 Examples¶
1. Stress in college campus¶
- statement:
- A study involving stress is done on a college campus among the students.
- The stress scores follow a (continuous) Uniform distribution.
- lowest stress score = 1
- highest stress score = 5
- sample of 75 students.
- find:
- The probability that the average stress score for the 75 students is less than 2.
- The 90th percentile of the stress score for the 75 students.
- The probability that the total of 75 stress score is less than 200.
- The 90th percentile for the total stress score for the 75 students.
solution of EX1¶
-
plan:
- find the EX and VAR of the stress score in the population using Uniform distribution rules.
- Use sampling distribution rules to Extract the EX and VAR of sampling distribution for the total stress score from the population.
- Use the Normal distribution rules for the rest of the questions (according to the Central Limit Theorem).
- We use a normal distribution with the same expectation and standard deviation as the sampling distribution.
- sample mean = sum of sample / sample size, so that sum = mean * sample size.
- The 90th percentile of the distribution of the total sum is the 90th percentile of the sample mean * sample size.
-
For a uniform distribution:
- E(X) = (a + b) / 2
- Var(X) = (b - a)^2 / 12
-
For the sampling distribution:
- E(X) = E(X) of the population
- Var(X) = Var(X) of the population / sample size
# X ~ Uniform(1, 5)
a = 1
b = 5
n = 75
mu.bar = (a + b) / 2 # population EX or mean 3
sigma.bar = sqrt( (b - a)^2 / 12 ) # population Var 0.1333
### 1
pnorm(2, mu.bar, sigma.bar) # probability of average score is less than 2
### 2
qnorm(0.9, mu.bar, sigma.bar) # 90th percentile of the stress score
### 3
# mean = sum/n => mean = 200/n,
pnorm(200/n, mu.bar, sigma.bar) # probability of sample mean is less than = 200/75,
### 4
n * qnorm(0.9, mu.bar, sigma.bar) # 90th percentile of the total stress score
2. Stress in college campus 2¶
-
statement:
- A study involving stress is done on a college campus among the students.
- The stress scores follow a (discrete) Uniform distribution.
- lowest stress score = 1
- highest stress score = 5
- stress score can only be 1, 2, 3, 4, 5.
- sample of 75 students.
- find:
- The probability that the average stress score for the 75 students is less than 2.
- The 90th percentile of the stress score for the 75 students.
- The probability that the total of 75 stress score is less than 200.
- The 90th percentile for the total stress score for the 75 students.
solution of EX2¶
- plan:
- Denote again by X the stress score of a random student.
- The sample space of X is the set of all possible stress scores = {1, 2, 3, 4, 5}.
- The probability of getting any score is equal to ⅕.
- Since the probabilities must sum to 1 we get that P(X = x) = ⅕, for all x in the sample space.
- sample mean = sum ( element * probability )
- variance (sigma) = sum ( (element - sample mean)^2 * probability )
x = 1:5
p = rep(1/5, 5) # [0.2, 0.2, 0.2, 0.2, 0.2]
n = 75
mu.X = sum(x * p) # population mean
var.X = sum( (x - mu.X)^2 * p ) # population variance
mu.bar = mu.X # sampling distribution mean
sigma.bar = sqrt(var.X / n) # sampling distribution variance
### 1
pnorm(2, mu.bar, sigma.bar) # probability of average score is less than 2
### 2
qnorm(0.9, mu.bar, sigma.bar) # 90th percentile of the stress score
### 3
pnorm(200/n, mu.bar, sigma.bar) # 0.02061342, probability of sample mean is less than = 200/75,
pnorm(199.5/n, mu.bar, sigma.bar) #0.01866821, with continuity correction since model is discrete.
### 4
n * qnorm(0.9, mu.bar, sigma.bar) # 90th percentile of the total stress score
3. cellular phone usage¶
- statement:
- cellular phone company conducts a study of their customers who exceed the time allowance included on their basic cellular phone contract.
- for Users who exceeds the basic usage allowance, The excess tile follow an exponential distribution, with a mean of 22 minutes.
- consider a sample of 80 users.
- find:
- The probability that the average excess time for the 80 users is more than 20 minutes.
- The 95th percentile of the excess time for the 80 users.
solution of EX3¶
- plan:
- Let X be the excess time for customers who exceed the time included in their basic contract.
- X ~ Exponential(lambda)
- E(X) = 1 / lambda
- Var(X) = 1 / lambda^2
- E(X) = 22 = 1/lambda => lambda = 1/22
# X ~ Exponential(1/22)
lam = 1/22
n = 80
mu.X = 22 # population mean
Var.X = (1/lam^2) # population variance
sigma.X = sqrt(Var.X) # population standard deviation
mu.bar = 1/lam # sampling EX or mean
var.bar = 1/(lam^2 * n) # sampling average VAR
sigma.bar = sqrt(var.bar) # sampling average sd
### 1
1 - pnorm(20, mu.bar, sigma.bar) # P(X >= 20)
### 2
qnorm(0.95, mu.bar, sigma.bar) # 95th percentile sampling average
4. Beverage company products¶
- statement:
- A beverage company produces cans that are supposed to contain 16 ounces of beverage.
- Under normal production conditions the expected amount of beverage in each can is 16.0 ounces, with a standard deviation of 0.10 ounces.
- QA department samples 50 cans from the production during the previous hour and measures the content in each of the cans.
- If the average content of the 50 cans is below a control threshold then production is stopped and the can filling machine is re-calibrated.
- QC.csv file is here: http://pluto.huji.ac.il/~msby/StatThink/Datasets/QC.csv
- find:
- the probability that the amount of beverage in the can is below 15.95 ounces.
- the probability that the amount of beverage in a sample beverage of 50 cans is below 15.95 ounces.
- a threshold with the property that the probability of stopping the machine in a given hour is 5%.
- load file
QC.csv
, which contains measurements for 8 hours. assume we apply the threshold from 3. which hours in which the machine needs re-calibration? - load file
QA.csv
, which hours contains outliers?
solution for EX4¶
- plan:
- we have the expectation and the sd of the population. but don’t know the actual distribution of the population (all produced cans by a machine).
- we can not calculate the probability of beverage in the can in general cause we don’t know more about the distribution of the population.
- for the sampling distribution, we know the expectation and sd of the population, so we can apply the Central Limit Theorem, and consider it as Normal distribution.
- To answer 4., we have to find the mean for each of the hours and compare it to the threshold given in the previous question: if the mean is below the threshold then the machine needs re-calibration.
n =50
mu.X = 16 # mean of the population (the expectation)
sigma.X = 0.10 # sd of the population
var.X = (0.10)^2 # variance of the population
mu.bar = 16 # mean of the sampling distribution
var.bar = var.X/n # variance of the sampling distribution
sigma.bar = sqrt(var.bar) # variance of the sampling distribution
### 1
# Impossible since we don't know the actual distribution of the population
### 2
pnorm(15.95, mu.bar, sigma.bar) # P(<= 15.95) = 0.000203476
### 3
qnorm(15.95, mu.bar, sigma.bar) # Q(= 0.05) = 15.97674 (once), 5th-percentile of sampling distribution
### 4
QC = read.csv ('QC.csv')
threshold = 15.97674
columns = colnames(QC)
hoursUnderThreshold = list()
for (cl in columns){
colData = QC[cl]
m = apply(colData, 2, mean)
hoursUnderThreshold[cl] = m <= threshold
}
print (hoursUnderThreshold) # find that h3, h8 equals TRUE (below the threshold)
### 5
boxplot(QC)
# h4, h6, h7, h8 have outliers (circles in the diagram above)
5. Uniform(0, b), unknown b¶
- statement:
- A measurement follows the Uniform(0, b), for an unknown value of b.
- Two statisticians propose two distinct ways to estimate the unknown quantity b with the aid of a sample of size n = 100.
- Statistician A proposes to use twice the sample average (2 X.bar) as an estimate.
- Statistician B proposes to use the largest observation instead.
- In order to choose between the two options they agreed to prefer the statistic that tends to have values that are closer to b.
- The performance of a statistic is evaluated using the mean square error (MSE), which is defined as the sum of the variance and the squared difference between the expectation and b.
- MSE = Var(T) + (E(T) − b)^2 where T is the statistic.
- A smaller mean square error corresponds to a better, more accurate, statistic.
- find:
- assume b=10 compute expectation, variance, and MSE for Statistician A.
- assume b=10 compute expectation, variance, and MSE for Statistician B.
- assume b=13.7 compute expectation, variance, and MSE for Statistician A.
- assume b=13.7 compute expectation, variance, and MSE for Statistician B.
- compare results, which mode seems to be preferable?
solution for EX5¶
- plan:
- we compute the statistics for a sample of size n = 10^5 (assumed)
- X ~ Uniform(0, 10), n = 10^5 for the first 2 questions.
N = 10^5 # population size (assumed)
n = 100 # sample size (given)
### 1 and 2
b1 = 10 # given
XA = rep(0, N)
XB = rep(0, N)
for (i in 1:N){
X.samp = runif(n, 0, b1) # generate samples of size n=100
XA[i] = 2 * mean(X.samp) # XA: twice the mean
XB[i] = max(X.samp) # XB: max element in the sample
}
# Statistician A
XA.mean = mean(XA)
XA.var = var(XA)
XA.MSE = XA.var + (XA.mean - 10)^2 # 0.3341725
# Statistician B
XB.mean = mean(XB)
XB.var= var(XB)
XB.MSE = XB.var + (XB.mean - 10)^2 # 0.01924989
### 3 and 4
b2 = 13.7 # given
YA = rep(0, N)
YB = rep(0, N)
for (i in 1:N){
Y.samp = runif(n, 0, b2) # generate samples of size n=100
YA[i] = 2 * mean(Y.samp) # YA: twice the mean
YB[i] = max(Y.samp) # YB: max element in the sample
}
# Statistician A
YA.mean = mean(YA)
YA.var = var(YA)
YA.MSE =YA.var + (YA.mean - 10)^2 # 0.6264204
# Statistician B
YB.mean = mean(YB)
YB.var = var(YB)
YB.MSE = YB.var + (YB.mean - 10)^2 # 0.01787562
#### 5
# we saw MSE of Statistician B is less than MSE of Statistician A in both cases =>
# Statistician B is more preferable and has more accurate estimations