2. Hypothesis Testing and Inference for categorical data¶

5.3 Hypnosis Testing for a Proportion ¹¶

5.3.1 Hypothesis Testing Framework¶

Example: We’re interested in understanding how much people know about world health and development. If we take a multiple choice world health question, then we might like to understand if:
- H0: People never learn these particular topics and their responses are simply equivalent to random guesses.
- HA: People have knowledge that helps them do better than random guessing, or perhaps, they have false knowledge that leads them to actually do worse than random guessing.
- These competing ideas are called hypotheses. We call H0 the null hypothesis and HA the alternative hypothesis.
The null hypothesis (H0) often represents a skeptical perspective or a claim to be tested.
- If there is sufficient evidence that supports the claim, we set aside our skepticism and reject the null hypothesis in favor of the alternative.
- Even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as true.
The alternative hypothesis (HA) represents an alternative claim under consideration and is often represented by a range of possible parameter values.
- Failing to find strong evidence for the alternative hypothesis is not equivalent to accepting the null hypothesis.

5.3.2 Testing Hypotheses Using Confidence Intervals¶

We construct the confidence interval for the parameter of interest and then check whether the null hypothesis value is within the confidence interval.
E.g. if confidence intervals is [20,30] and the null hypothesis value is 25, then we fail to reject the null hypothesis and the alternative hypothesis is not supported.
E.g. if confidence intervals is [20,30] and the null hypothesis value is 35, then we reject the null hypothesis and the alternative hypothesis is supported.
If the null value in the hypothesis test is p0 falls within the range of plausible values from the confidence interval, we cannot say the null value is implausible. That is, the data do not provide sufficient evidence to reject the null hypothesis, and we do not reject the null hypothesis, H0.
While we failed to reject H0, that does not necessarily mean the null hypothesis is true. Perhaps there was an actual difference, but we were not able to detect it with the relatively small sample.
Double Negatives
- We might say that the null hypothesis is not implausible or we failed to reject the null hypothesis. Double negatives are used to communicate that while we are not rejecting a position, we are also not saying it is correct.

5.3.3 Decision errors¶

Type 1 error: Rejecting the null hypothesis when it is actually true: null hypothesis is true but we choose the alternative hypothesis.
Type 2 error: Failing to reject the null hypothesis when the alternative is actually true: null hypothesis is false but we choose the null hypothesis.
Example: Court System: In a US court, the defendant is either innocent (H0) or guilty (HA).
- Type 1 error: Convicting an innocent person. The null hypnosis is true (the person is innocent), but we choose the alternative hypothesis (the person is guilty).
- Type 2 error: Letting a guilty person go free. The null hypothesis is false (the person is guilty), but we choose the null hypothesis (the person is innocent).
- To lower the Type 1 Error rate, we might raise our standard for conviction from “beyond a reasonable doubt” to “beyond a conceivable doubt” so fewer people would be wrongly convicted. However, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type 2 Errors.
- To lower the Type 2 Error rate, we want to convict more guilty people. We could lower the standards for conviction from “beyond a reasonable doubt” to “beyond a little doubt”. Lowering the bar for guilt will also result in more wrongful convictions, raising the Type 1 Error rate.
If we reduce how often we make one type of error, we generally make more of the other type.
We do not want to incorrectly reject H0 more than 5% of the time. This corresponds to a significance level of 0.05.

5.3.4 Formal testing using p-values¶

The p-value is a way of quantifying the strength of the evidence against the null hypothesis and in favor of the alternative hypothesis.
Statistical hypothesis testing typically uses the p-value method rather than making a decision based on confidence intervals.
P-value:
- The probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true.
- The probability of a Type 1 Error.
When evaluating hypotheses for proportions using the p-value method, we will slightly modify how we check the success-failure condition and compute the standard error for the single proportion case.
- We use the null value ($p_0$) to evaluate the success-failure condition and compute the standard error.
- So we are evaluating the null hypothesis when checking the success-failure condition and computing the standard error; thus, we are plotting the null distribution.
- We check if the observed value $\hat{p}$ is outside of the confidence interval for the null distribution (aka, in the tail of the null distribution).
- We compute the p-value by:
  - 1- Find the z-score using the formula: $z = \frac{\hat{p} - p_0}{SE}$.
  - 2- Find area under the curve using the z-score in one tail of the normal distribution.
  - 3- Multiply the area by 2 (for both tails) to get the p-value.
- The p-value represents the probability of observing such an extreme sample proportion by chance, if the null hypothesis were true.
When we evaluate a hypothesis test:
- We compare the p-value to the significance level, which is usually α = 0.05.
- If p-value $\lt$ α, we reject the null hypothesis => evidence against the null hypothesis => accept the alternative hypothesis.
- If p-value $\gt$ α, we fail to reject the null hypothesis => data do not provide convincing evidence against the null hypothesis => accept the null hypothesis.

How to solve a hypothesis testing problem for a proportion?¶

Example: A simple random sample of 1028 US adults in March 2013 show that 56% support nuclear arms reduction. Does this provide convincing evidence that a majority of Americans supported nuclear arms reduction at the 5% significance level?
- The null hypothesis is that 50% of Americans support nuclear arms reduction. The alternative hypothesis is that more than 50% of Americans support nuclear arms reduction.
- Prove that CLT holds for the null hypothesis.
- Compute the standard error for the null hypothesis.
- Find the z-score for the observed value in the null distribution using the formula: $z = \frac{\hat{p} - p_0}{SE}$.
- Find area under curve from the z-score till the right tail of the normal distribution.
- Double the area to get the p-value.
- Compare the p-value (area under the curve) to the significance level (α = 0.05).
- If p-value $\gt$ α, we fail to reject the null hypothesis and reject the alternative hypothesis => accept the null hypothesis.
- If p-value $\lt$ α, we reject the null hypothesis => we accept the alternative hypothesis.
In summary, the steps are:
- Prepare. Identify the parameter of interest, list hypotheses, identify the significance level, and identify ˆp and n.
- Check. Verify conditions to ensure ˆp is nearly normal under H0. For one-proportion hypothesis tests, use the null value to check the success-failure condition.
- Calculate. If the conditions hold, compute the standard error, again using p0, compute the Z-score, and identify the p-value. Conclude.
- Evaluate the hypothesis test by comparing the p-value to α, and provide a conclusion in the context of the problem.

5.3.5 Choosing a significance level¶

Usually, we use a significance level of 0.05.
If making a Type 1 Error is dangerous or especially costly, we should choose a small significance level (e.g. 0.01).
- Under this scenario we want to be very cautious about rejecting the null hypothesis, so we demand very strong evidence favoring HA before we would reject H0.
- Lowering the significance level (α) increases the bar for rejecting the null hypothesis, thus, accepting the null hypothesis more often.
If a Type 2 Error is relatively more dangerous or much more costly than a Type 1 Error, then we might choose a higher significance level (e.g. 0.10).
- Here we want to be cautious about failing to reject H0 when the alternative hypothesis is actually true.
- Collecting extra data (aka, raising the sample size) can help reduce the Type 2 Error rate without changing the significance level or the Type 1 Error rate.

5.3.6 Statistical significance versus practical significance¶

When the sample size becomes larger, point estimates become more precise and any real differences in the mean and null value become easier to detect and recognize. Even a very small difference would likely be detected if we took a large enough sample.
Sometimes researchers will take such large samples that even the slightest difference is detected, even differences where there is no practical value.
In such cases, we still say the difference is statistically significant, but it is not practically significant.
For example, an online experiment might identify that placing additional ads on a movie review website statistically significantly increases viewership of a TV show by 0.001%, but this increase might not have any practical value.

5.3.7 One-sided hypothesis tests¶

So far we’ve only considered what are called two-sided hypothesis tests, where we care about detecting whether p is either above or below some null value p0.
One-sided hypothesis test:
- We only care about one direction. The alternative hypothesis will take the shape of either $p > p_0$ or $p < p_0$.
- We compute the p-value for the one-sided test in the direction specified by the alternative hypothesis, thus we do not double the area under the curve.
- If we don’t have to double the tail area to get the p-value, then the p-value is smaller and the level of evidence required to identify an interesting finding in the direction of the alternative hypothesis goes down.
- The problem is that any interesting findings in the opposite direction must be disregarded.
- It creates a risk of overlooking data supporting the opposite conclusion.
When might a one-sided test be appropriate to use? Very rarely.

6.1 Inference for a Single Proportion ¹¶

6.1.1 Identifying when the sample proportion is nearly normal¶

The sampling distribution for $\hat{p}$ based on a sample of size n from a population with a true proportion p is nearly normal when:
- 1- The sample’s observations are independent, e.g. are from a simple random sample.
- 2- We expected to see at least 10 successes and 10 failures in the sample, i.e. np ≥ 10 and n(1 - p) ≥ 10. This is called the success-failure condition.
- When these conditions are met, then the sampling distribution of $\hat{p}$ is nearly normal with mean p and standard error $SE=\sqrt{\frac{p(1-p)}{n}}$.

6.1.2 Confidence intervals for a proportion¶

A confidence interval provides a range of plausible values for the parameter p, and when ˆp can be modeled using a normal distribution, the confidence interval for p takes the form $\hat{p} \pm z^*SE$.

6.1.3 Hypothesis testing for a proportion¶

We found the area under the curve in the tail of the null distribution and doubled it to get the p-value.
If p-value $\gt$ $\alpha$ (usually 0.05), we accept the null hypothesis.
If p-value $\lt$ $\alpha$, we accept the alternative hypothesis.

6.1.4 When one or more conditions aren’t met¶

When the success-failure condition isn’t met for a hypothesis test, we can simulate the null distribution of ˆp using the null value, p0.
For a confidence interval when the success-failure condition isn’t met, we can use what’s called the Clopper-Pearson interval.
The independence condition is a more nuanced requirement. When it isn’t met, it is important to understand how and why it isn’t met.

6.1.5 Choosing a sample size when estimating a proportion¶

A sample size large enough that the margin of error -which is the part we add and subtract from the point estimate in a confidence interval ($z^*SE$)- is sufficiently small that the sample is useful.
- The margin of error is at its largest when p = 0.5, so we use this value to calculate the worst-case value for n.
Example: A university newspaper is conducting a survey to determine what fraction of students support a $200 per year increase in fees to pay for a new football stadium. How big of a sample is required to ensure the margin of error is smaller than 0.04 using a 95% confidence level?
- The margin of error is $z^*SE = z^*\sqrt{\frac{p(1-p)}{n}}$.
- Let’s solve the equation $z^*\sqrt{\frac{p(1-p)}{n}} < 0.04$.
- We have n and p as unknowns, so we need to make an educated guess for p.
- We choose previous studies to estimate p, and if we can’t find any, we use p = 0.5.
- The margin of error is at its largest when p = 0.5, so we use this value to calculate the worst-case value for n.
- The answer is n = 601.
- This means that we need to survey at least 601 students to ensure the margin of error is smaller than 0.04 using a 95% confidence level.

6.2 Difference of Two Proportions ¹¶

6.2.1 Sampling distribution of the difference of two proportions¶

Conditions For The Sampling Distribution Of Pˆ1 - Pˆ2 To Be Normal:
- 1- Independence, extended. The data are independent within and between the two groups. Generally this is satisfied if the data come from two independent random samples or if the data come from a randomized experiment.
- 2- Success-failure condition. The success-failure condition holds for both groups, where we check successes and failures in each group separately.
When these conditions are satisfied, the standard error of ˆp1 - pˆ2 is

\[ SE = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]

6.2.2 Confidence intervals for p1 − p2¶

We can apply the generic confidence interval formula for a difference of two proportions, where we use ˆp1 - pˆ2 as the point estimate and substitute the SE formula:

\[ (\hat{p}_1 - \hat{p}_2) \pm z^*SE \]

Example: Interpreting a confidence interval of (-0.027,0.287) in the patients for CPR treatment (p.218):
- We are 90% confident that blood thinners have a difference of -2.7% to +28.7% percentage point impact on survival rate for patients who are like those in the study.
- Because 0% is contained in the interval, we do not have enough information to say whether blood thinners help or harm heart attack patients who have been admitted after they have undergone CPR.
Example: Interpret a confidence interval of (-0.0071, -0.0015) in fish oil vs placebo experiment (p.218):
- We are 95% confident that fish oils decreases heart attacks by 0.15 to 0.71 percentage points (off of a baseline of about 1.55%) over a 5-year period for subjects who are similar to those in the study.
- Because the interval is entirely below 0, the data provide strong evidence that fish oil supplements reduce heart attacks in patients like those in the study.

6.2.3 Hypothesis tests for the difference of two proportions¶

Pooled proportion:
- It is a way to check the success-failure condition.
- $\hat{p}_{pooled}=\frac{N1_p + N2_p}{N1 + N2} = \frac{\text{number of successes}}{number of cases} = \frac{\hat{p_1}n_1 + \hat{p_2}n_2}{n_1+n_2}$
- N1 and N2 are the number of observations in each group, and N1_p and N2_p are the number of successes in each group.

6.2.4 More on 2-proportion hypothesis tests¶

When we conduct a 2-proportion hypothesis test, usually H0 is p1 - p2 = 0. However, there are rare situations where we want to check for some difference in p1 and p2 that is some value other than 0. For example, maybe we care about checking a null hypothesis where p1 - p2 = 0.1. In contexts like these, we generally use ˆp1 and ˆp2 to check the success-failure condition and construct the standard error

6.2.5 Examining the standard error formula¶

\[ \begin{align*} SE_{\hat{p}_1 - \hat{p}_2} &= \sqrt{SE_{p_1}^2 + SE_{p_2}^2} \\ &= \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \end{align*} \]

Hypothesis Testing - Solving Problems with Proportions ²¶

Hypothesis Testing with Two Proportions ³¶

Inference for a Single Proportion ⁴¶

References¶

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2019). Openintro statistics - Fourth edition. Open Textbook Library. https://www.biostat.jhsph.edu/~iruczins/teaching/books/2019.openintro.statistics.pdf Read Chapter 5 - Foundations for Inference Section 5.3 - Hypothesis testing for a proportion from page 189 to page 205 Read Chapter 6 - Inference for categorical data Section 6.1 - Inference for a single proportion from page 208 to page 217. Section 6.2 - Difference of two proportions from page 217 to page 228 ↩↩↩
The Organic Chemistry Tutor. (2019a, October 28). Hypothesis testing - solving problems with proportions [Video]. YouTube. https://youtu.be/76VruarGn2Q ↩
The Organic Chemistry Tutor. (2019b, November 15). Hypothesis testing with two proportions [Video]. YouTube. https://youtu.be/pCbNUnZ98oE ↩
Introduction to Statistics at SLCC. (2021, June 18). Chapter 3.5 - lesson ⅓ - inference for a single proportion [Video]. YouTube. https://youtu.be/VB8kttv9hoY ↩