Skip to content

4. Inference for numerical data

7. Inference for numerical data 1

T distribution 2

  • T distribution is useful to plot the distribution of the sample mean when the population standard deviation is unknown.
  • It is a bell-shaped distribution that is symmetric around 0.
  • It is similar to the normal distribution but has heavier tails, and lower peak.
  • Observations are more likely to fall in the tails of the t-distribution than the normal distribution.
  • Observations are more likely to fall beyond 2 standard deviations from the mean in the t-distribution than the normal distribution.
  • Confidence intervals are wider, aka, more conservative when using the t-distribution than the normal distribution.
  • Thick tails means more error for mitigating the uncertainty and the less reliable the estimate for standard error.
  • T distribution has a parameter called degrees of freedom which determines the thickness of the tails.
  • As the degrees of freedom increase, the t-distribution approaches the normal distribution.
  • calculate the t statistic using the formula: \(T = \frac{obs - null}{SE}\).
  • P-value is the probability of observing a test statistic as extreme as the one observed, assuming the null hypothesis is true.

Inference for a mean 3

  • Mean of the population is within the confidence interval.
\[ \begin{aligned} point\ estimate &± margin\ of\ error\\ \bar{x} &± t^*_{df}SE_{\bar{x}} \\ \bar{x} &± t^*_{df}\frac{s}{\sqrt{n}} \\ \bar{x} &± t^*_{n-1}\frac{s}{\sqrt{n}} \end{aligned} \]
  • Where:
    • s is the sample standard deviation.
    • n is the sample size.
    • t* is the t-score.
    • df is the degrees of freedom.
    • SE is the standard error.
    • x is the sample mean.
  • To find the t* score:
    • Calculate the degrees of freedom: \(df = n - 1\).
    • Use the t-distribution table to find the t* score:
      • Find the row that corresponds to the degrees of freedom.
      • Find the column that corresponds to the confidence level.
      • The value at the intersection is the t* score.
    • Or use the qt() function in R to find the t* score:
      • qt((1 - confidence)/2, df = n - 1)
      • for 95% confidence level, qt(0.025, df = n - 1)
  • To compute the p-value:
    • Calculate the degrees of freedom: \(df = n - 1\).
    • Calculate the t statistic using the formula: \(t = \frac{obs - null}{SE}\).
    • Use the pt() function in R to find the p-value:
      • pt(t, df, lower.tail = FALSE) * 2
      • for a two-tailed test, multiply by 2.
    • Using the t-distribution table:
      • Find the row that corresponds to the degrees of freedom.
      • Find the column that corresponds to the t statistic.
      • The value at the intersection is the p-value.

Inference for paired data 4

  • Paired data is when two observations are linked in some way, aka, not independent.
  • The difference between the two observations is calculated, and it is used to perform inference.
  • The difference is the new data set that is used to calculate the mean and standard deviation.
  • If the average difference is 0, then the null hypothesis is true; and there is no difference between the two observations sets.
  • H0: \(\mu_{diff} = 0\). Ha: \(\mu_{diff} ≠ 0\).

Difference of two independent means 5

\[ \begin{aligned} point\ estimate &± margin\ of\ error\\ (\bar{x}_1 - \bar{x}_2) &± t^*_{df}SE_{\bar{x}_1 - \bar{x}_2} \\ (\bar{x}_1 - \bar{x}_2) &± t^*_{df}\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \\ (\bar{x}_1 - \bar{x}_2) &± t^*_{(min(n_1 -1, n_2-2))} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \end{aligned} \]

So:

\[ SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \\ df = min(n_1 -1, n_2-2) \]
  • Conditions:
    • Independence:
      • Within groups: sampled observations are independent.
        • Random sample/assignment.
        • 10% condition: both sample sizes are less than 10% of the population.
      • Between groups: the two groups are independent.
        • Groups are not paired.
    • Sample size/skew:
      • The more skewed the data, the larger the sample size required.

References


  1. Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2019). Openintro statistics - Fourth edition. Open Textbook Library. https://www.biostat.jhsph.edu/~iruczins/teaching/books/2019.openintro.statistics.pdf Chapter 7 - Inference for numerical data. Section 7.1 - One Sample means with t-distribution from page 251 to page 261 Section 7.2 - Paired data from page 262 to page 266
    Section 7.3 - Difference of two means from page 267 to page 277 

  2. Çetinkaya-Rundel, M. (2018a, February 20). 5 1A t distribution [Video]. YouTube. https://youtu.be/uVEj2uBJfq0 

  3. Çetinkaya-Rundel, M. (2018b, February 20). 5 1B Inference for a mean [Video]. YouTube. https://youtu.be/RYVIGj1l4xs 

  4. Çetinkaya-Rundel, M. (2018c, February 20). 5 2 Inference for paired data [Video]. YouTube. https://youtu.be/K0QZ9_4w0HU 

  5. Çetinkaya-Rundel, M. (2018d, February 20). 5 3 Difference of two independent means [Video]. YouTube. https://youtu.be/emZ24asR2F4