Skip to content

3. Testing for goodness of fit and Independence

Inference for categorical data 1

6.3 Testing for goodness of fit using chi-square

  • It is a method for assessing a null model when the data are binned, and:
    • Determine if a sample represents a population.
    • Determine if data follow a specific distribution (e.g., uniform, normal, etc.).
  • The geometric distribution describes the probability of waiting for the k th trial to observe the first success.

6.4 Testing for independence in two-way tables

  • A one-way table describes counts for each outcome in a single variable.
  • A two-way table describes counts for combinations of outcomes for two variables.

Testing for goodness of fit using chi-square 2

  • We are evaluating how well the observed data fit the expected distribution.
  • We build our hypothesis for testing:
    • Null hypothesis: The observed counts are consistent with the expected counts, thus, they are a good fit (Nothing unusual is going on).
    • Alternative hypothesis: The observed counts are not consistent with the expected counts, thus, they are not a good fit (Something unusual is going on).
  • Steps:
    • Quantify how different the observed counts from the expected counts (or population proportions).
    • Large deviations from what would be expected based on sampling variation (chance) alone provide evidence for the alternative hypothesis.
  • Conditions for the Chi-square test:
    • Independence: Sampled observations must be independent.
      • Random sample/assignment: The data should be collected using a random method.
      • If sampling without replacement, the sample size should be less than 10% of the population size.
      • Each case only contributes to one cell in the table: The data should be mutually exclusive, and each observation should belong to only one category.
    • Sample size: Each cell should have at least 5 expected counts.
  • Test statistic iss computed using the general formula p-p0/SE(p), which does two things:
    • Identify the difference between point estimate and the expected value (Observed - Expected). Assuming the null hypothesis is true, the expected value is the null value (p0).
    • Standardize the difference by dividing it by the standard error.
  • Chai-square statistic:
    • It is a special case of the general formula for the test statistic (p-p0/SE(p)).
    • It is the sum of the squared differences between the observed and expected counts divided by the expected counts.
    • \(\chi^2 = \sum_{1}^{k}{\frac{(O_k-E_k)^2}{E_k}}\) where \(k\) is the number of categories (cells).
    • Squaring ensures positivity so errors in both directions are treated equally, and does not cancel out.
    • Squaring also amplifies unusual deviations.
  • Chai-square distribution:
    • The distribution of the test statistic under the null hypothesis.
    • It is a right-skewed distribution.
    • The shape of the distribution depends on the degrees of freedom.
    • The degrees of freedom is the number of categories minus one.
    • The larger the degrees of freedom, the closer the distribution is to a normal distribution.
    • The distribution is always positive.
    • The p-value is the tail are above or greater the observed test statistic.
  • Steps:
    • State the hypotheses.
    • Calculate the test statistic.
    • Find the p-value.
    • Make a decision: Reject the null hypothesis if the p-value is less than the significance level.

Testing for independence in two-way tables 3

  • We are evaluating whether two categorical variables are independent.
  • We build our hypothesis for testing:
    • Null hypothesis: The two variables are independent.
    • Alternative hypothesis: The two variables are dependent.
  • Formula:
    • \(\chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c}{\frac{(O_{ij}-E_{ij})^2}{E_{ij}}}\) where \(r\) is the number of rows and \(c\) is the number of columns.
    • \(E_{ij} = \frac{(\text{row total})(\text{column total})}{\text{table total}}\).
    • The degrees of freedom is \((r-1)(c-1)\) where \(r\) is the number of rows and \(c\) is the number of columns.

References


  1. Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2019). Openintro statistics - Fourth edition. Open Textbook Library. https://www.biostat.jhsph.edu/~iruczins/teaching/books/2019.openintro.statistics.pdf Read Chapter 6 - Inference for categorical data. Section 6.3 - Testing for goodness of fit using chi-square from page 229 to page 239 Section 6.4 - Testing for independence in two-way tables from page 240 to page 248 

  2. Çetinkaya-Rundel, M. (2018a, February 20). 6 3 Testing for goodness of fit using chi square [Video]. YouTube. https://youtu.be/Uk36WGxujkc 

  3. Çetinkaya-Rundel, M. (2018b, February 20). 6 4 Homogeneity and independence in two-way tables [Video]. YouTube. https://youtu.be/yjrsfNdja0U