3. Testing for goodness of fit and Independence¶
Inference for categorical data 1¶
6.3 Testing for goodness of fit using chi-square¶
- It is a method for assessing a null model when the data are binned, and:
- Determine if a sample represents a population.
- Determine if data follow a specific distribution (e.g., uniform, normal, etc.).
- The geometric distribution describes the probability of waiting for the k th trial to observe the first success.
6.4 Testing for independence in two-way tables¶
- A one-way table describes counts for each outcome in a single variable.
- A two-way table describes counts for combinations of outcomes for two variables.
Testing for goodness of fit using chi-square 2¶
- We are evaluating how well the observed data fit the expected distribution.
- We build our hypothesis for testing:
- Null hypothesis: The observed counts are consistent with the expected counts, thus, they are a good fit (Nothing unusual is going on).
- Alternative hypothesis: The observed counts are not consistent with the expected counts, thus, they are not a good fit (Something unusual is going on).
- Steps:
- Quantify how different the observed counts from the expected counts (or population proportions).
- Large deviations from what would be expected based on sampling variation (chance) alone provide evidence for the alternative hypothesis.
- Conditions for the Chi-square test:
- Independence: Sampled observations must be independent.
- Random sample/assignment: The data should be collected using a random method.
- If sampling without replacement, the sample size should be less than 10% of the population size.
- Each case only contributes to one cell in the table: The data should be mutually exclusive, and each observation should belong to only one category.
- Sample size: Each cell should have at least 5 expected counts.
- Independence: Sampled observations must be independent.
- Test statistic iss computed using the general formula
p-p0/SE(p)
, which does two things:- Identify the difference between point estimate and the expected value (Observed - Expected). Assuming the null hypothesis is true, the expected value is the null value (p0).
- Standardize the difference by dividing it by the standard error.
- Chai-square statistic:
- It is a special case of the general formula for the test statistic (p-p0/SE(p)).
- It is the sum of the squared differences between the observed and expected counts divided by the expected counts.
- \(\chi^2 = \sum_{1}^{k}{\frac{(O_k-E_k)^2}{E_k}}\) where \(k\) is the number of categories (cells).
- Squaring ensures positivity so errors in both directions are treated equally, and does not cancel out.
- Squaring also amplifies unusual deviations.
- Chai-square distribution:
- The distribution of the test statistic under the null hypothesis.
- It is a right-skewed distribution.
- The shape of the distribution depends on the degrees of freedom.
- The degrees of freedom is the number of categories minus one.
- The larger the degrees of freedom, the closer the distribution is to a normal distribution.
- The distribution is always positive.
- The p-value is the tail are above or greater the observed test statistic.
- Steps:
- State the hypotheses.
- Calculate the test statistic.
- Find the p-value.
- Make a decision: Reject the null hypothesis if the p-value is less than the significance level.
Testing for independence in two-way tables 3¶
- We are evaluating whether two categorical variables are independent.
- We build our hypothesis for testing:
- Null hypothesis: The two variables are independent.
- Alternative hypothesis: The two variables are dependent.
- Formula:
- \(\chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c}{\frac{(O_{ij}-E_{ij})^2}{E_{ij}}}\) where \(r\) is the number of rows and \(c\) is the number of columns.
- \(E_{ij} = \frac{(\text{row total})(\text{column total})}{\text{table total}}\).
- The degrees of freedom is \((r-1)(c-1)\) where \(r\) is the number of rows and \(c\) is the number of columns.
References¶
-
Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2019). Openintro statistics - Fourth edition. Open Textbook Library. https://www.biostat.jhsph.edu/~iruczins/teaching/books/2019.openintro.statistics.pdf Read Chapter 6 - Inference for categorical data. Section 6.3 - Testing for goodness of fit using chi-square from page 229 to page 239 Section 6.4 - Testing for independence in two-way tables from page 240 to page 248 ↩
-
Çetinkaya-Rundel, M. (2018a, February 20). 6 3 Testing for goodness of fit using chi square [Video]. YouTube. https://youtu.be/Uk36WGxujkc ↩
-
Çetinkaya-Rundel, M. (2018b, February 20). 6 4 Homogeneity and independence in two-way tables [Video]. YouTube. https://youtu.be/yjrsfNdja0U ↩