4. Inference for numerical data¶

7. Inference for numerical data ¹¶

T distribution ²¶

T distribution is useful to plot the distribution of the sample mean when the population standard deviation is unknown.
It is a bell-shaped distribution that is symmetric around 0.
It is similar to the normal distribution but has heavier tails, and lower peak.
Observations are more likely to fall in the tails of the t-distribution than the normal distribution.
Observations are more likely to fall beyond 2 standard deviations from the mean in the t-distribution than the normal distribution.
Confidence intervals are wider, aka, more conservative when using the t-distribution than the normal distribution.
Thick tails means more error for mitigating the uncertainty and the less reliable the estimate for standard error.
T distribution has a parameter called degrees of freedom which determines the thickness of the tails.
As the degrees of freedom increase, the t-distribution approaches the normal distribution.
calculate the t statistic using the formula: \(T = \frac{obs - null}{SE}\).
P-value is the probability of observing a test statistic as extreme as the one observed, assuming the null hypothesis is true.

Inference for a mean ³¶

Mean of the population is within the confidence interval.

\[ \begin{aligned} point\ estimate &± margin\ of\ error\\ \bar{x} &± t^*_{df}SE_{\bar{x}} \\ \bar{x} &± t^*_{df}\frac{s}{\sqrt{n}} \\ \bar{x} &± t^*_{n-1}\frac{s}{\sqrt{n}} \end{aligned} \]

Where:
- s is the sample standard deviation.
- n is the sample size.
- t* is the t-score.
- df is the degrees of freedom.
- SE is the standard error.
- x is the sample mean.
To find the t* score:
- Calculate the degrees of freedom: \(df = n - 1\).
- Use the t-distribution table to find the t* score:
  - Find the row that corresponds to the degrees of freedom.
  - Find the column that corresponds to the confidence level.
  - The value at the intersection is the t* score.
- Or use the qt() function in R to find the t* score:
  - qt((1 - confidence)/2, df = n - 1)
  - for 95% confidence level, qt(0.025, df = n - 1)
To compute the p-value:
- Calculate the degrees of freedom: \(df = n - 1\).
- Calculate the t statistic using the formula: \(t = \frac{obs - null}{SE}\).
- Use the pt() function in R to find the p-value:
  - pt(t, df, lower.tail = FALSE) * 2
  - for a two-tailed test, multiply by 2.
- Using the t-distribution table:
  - Find the row that corresponds to the degrees of freedom.
  - Find the column that corresponds to the t statistic.
  - The value at the intersection is the p-value.

Inference for paired data ⁴¶

Paired data is when two observations are linked in some way, aka, not independent.
The difference between the two observations is calculated, and it is used to perform inference.
The difference is the new data set that is used to calculate the mean and standard deviation.
If the average difference is 0, then the null hypothesis is true; and there is no difference between the two observations sets.
H0: \(\mu_{diff} = 0\). Ha: \(\mu_{diff} ≠ 0\).

Difference of two independent means ⁵¶

\[ \begin{aligned} point\ estimate &± margin\ of\ error\\ (\bar{x}_1 - \bar{x}_2) &± t^*_{df}SE_{\bar{x}_1 - \bar{x}_2} \\ (\bar{x}_1 - \bar{x}_2) &± t^*_{df}\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \\ (\bar{x}_1 - \bar{x}_2) &± t^*_{(min(n_1 -1, n_2-2))} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \end{aligned} \]

So:

\[ SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \\ df = min(n_1 -1, n_2-2) \]

Conditions:
- Independence:
  - Within groups: sampled observations are independent.
    - Random sample/assignment.
    - 10% condition: both sample sizes are less than 10% of the population.
  - Between groups: the two groups are independent.
    - Groups are not paired.
- Sample size/skew:
  - The more skewed the data, the larger the sample size required.

References¶

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2019). Openintro statistics - Fourth edition. Open Textbook Library. https://www.biostat.jhsph.edu/~iruczins/teaching/books/2019.openintro.statistics.pdf Chapter 7 - Inference for numerical data. Section 7.1 - One Sample means with t-distribution from page 251 to page 261 Section 7.2 - Paired data from page 262 to page 266
Section 7.3 - Difference of two means from page 267 to page 277 ↩
Çetinkaya-Rundel, M. (2018a, February 20). 5 1A t distribution [Video]. YouTube. https://youtu.be/uVEj2uBJfq0 ↩
Çetinkaya-Rundel, M. (2018b, February 20). 5 1B Inference for a mean [Video]. YouTube. https://youtu.be/RYVIGj1l4xs ↩
Çetinkaya-Rundel, M. (2018c, February 20). 5 2 Inference for paired data [Video]. YouTube. https://youtu.be/K0QZ9_4w0HU ↩
Çetinkaya-Rundel, M. (2018d, February 20). 5 3 Difference of two independent means [Video]. YouTube. https://youtu.be/emZ24asR2F4 ↩

4. Inference for numerical data¶

7. Inference for numerical data 1¶

T distribution 2¶

Inference for a mean 3¶

Inference for paired data 4¶

Difference of two independent means 5¶