Skip to content

DA3. Testing for Independence

Statement

A one-way table describes counts for each outcome in a single variable. A two-way table describes counts for combinations of outcomes for two variables. When we consider a two-way table, we often would like to know if these variables are related in any way.

  1. Think of an experiment in an area that interests you.
  2. Explain why the conditions are met for the chi-square test for homogeneity.
  3. Explain under which condition the null hypothesis (H0) will be rejected and formulate the conclusion in the context of your experiment.

Answer

1. Think of an experiment in an area that interests you

We will solve the problem 6.50 coffee and expression on page 248 from the textbook “OpenIntro Statistics” by Diez, Barr, and Çetinkaya-Rundel (2019).

“Researchers conducted a study investigating the relationship between caffeinated coffee consumption and risk of depression in women. They collected data on 50,739 women free of depression symptoms at the start of the study in the year 1996, and these women were followed through 2006. The researchers used questionnaires to collect data on caffeinated coffee consumption, asked each individual about physician- diagnosed depression, and also asked about the use of antidepressants. The table below shows the distribution of incidences of depression by amount of caffeinated coffee consumption” (Diez, Barr, & Çetinkaya-Rundel, 2019, p. 248).

Coffee consumption <=1 C/W 2-6 C/W 1 C/D 2-3 C/D >=4 C/D Total
Depression Yes 670 373 905 564 95 2607
No 11545 6244 16329 11726 2288 48132
Total 12215 6617 17234 12290 2383 50739

Where C/W is cups per week, C/D is cups per day.

2. Explain why the conditions are met for the chi-square test for homogeneity

According to (Çetinkaya-Rundel, 2019), the conditions for the chi-square test for homogeneity are:

  • (1) Independence: Sampled observations must be independent.
    • (1.A) Random sample/assignment: The data should be collected using a random method.
    • (1.B) If sampling without replacement, the sample size should be less than 10% of the population size.
    • (1.C) Each case only contributes to one cell in the table: The data should be mutually exclusive, and each observation should belong to only one category.
  • (2) Sample size: Each cell should have at least 5 expected counts.

(1.A) is met because the researchers collected randomly. (1.B) is met because the sample size (50,739) is less than 10% of the population of the United States. (1.C) is met because each observation belongs to only one category. (2) is met because each cell has at least 5 expected counts.

All conditions are met, so the sample distribution follows the chi-square distribution, and we can use the chi-square test for homogeneity.

3. Explain under which condition the null hypothesis (H0) will be rejected and formulate the conclusion in the context of your experiment

Let’s develop our hypothesis for testing:

  • H0: Null hypothesis: The two variables (coffee consumption and depression) are independent, that is, the risk of depression is the same for all levels of coffee consumption.
  • H1: Alternative hypothesis: The two variables ( coffee consumption and depression) are dependent, that is, the risk of depression is different for at least one level of coffee consumption.

The general success rate of depression is 2607/50739 = 0.0513. The expected counts for each cell are calculated by multiplying the row total by the column total and dividing by the table total according to the formula

\[E_{ij} = \frac{(\text{row total})(\text{column total})}{\text{table total}}\]

We will re-write the table putting expected counts withing parentheses for each cell.

<=1 C/W 2-6 C/W 1 C/D 2-3 C/D >=4 C/D Total
Yes 670 (627.6) 373 (340) 905 (885.5) 564 (631.5) 95 (122.5) 2607
No 11545 (11587.5) 6244 (6277) 16329 (16348.5) 11726 (11658.5) 2288 (2260.5) 48132
Total 12215 6617 17234 12290 2383 50739

Now let’s calculate the chi-square statistic using the formula

\[\chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c}{\frac{(O_{ij}-E_{ij})^2}{E_{ij}}}\]

where \(r\) is the number of rows and \(c\) is the number of columns, and we will put the results in the following table:

(O-E)^2/E <=1 C/W 2-6 C/W 1 C/D 2-3 C/D >=4 C/D
Yes 2.86 3.2 0.43 7.21 6.15
NO 0.16 0.17 0.02 0.39 0.33

The degrees of freedom is \((r-1)(c-1)\) where \(r\) is the number of rows and \(c\) is the number of columns. So:

\[ \begin{align*} X^2 &= 2.86 + 3.2 + 0.43 + 7.21 + 6.15 + 0.16 + 0.17 + 0.02 + 0.39 + 0.33 \\ &= 20.93 \\ df &= (2-1)(5-1) = 4 \end{align*} \]

The p-value is the tail are above or greater the observed test statistic. We will use the chi-square distribution table to find the p-value. The p-value is 0.00033.

The p-value < alpha (0.00033 < 0.05), so we reject the null hypothesis. We have enough evidence to conclude that the two variables (coffee consumption and depression) are dependent, that is, the risk of depression is different for at least one level of coffee consumption.

While the study does not imply causation, it suggests that coffee consumption is associated with reduces the risk of depression in women, as the observed results are significantly lower from the expected results.

References