IB Mathematics AI – Topic 4
Statistics & Probability: Hypothesis Testing
Introduction to Hypothesis Testing
Null and Alternative Hypotheses
Definition: Hypothesis testing is a statistical method used to make decisions about population parameters based on sample data. It involves testing a claim by comparing observed data against what we would expect under a specific assumption.
The Two Hypotheses:
1. Null Hypothesis (\(H_0\)):
- The "status quo" or default assumption
- Assumes no effect, no difference, or no relationship
- What we assume to be true until evidence proves otherwise
- Always contains an equals sign (\(=\), \(\leq\), or \(\geq\))
2. Alternative Hypothesis (\(H_1\) or \(H_a\)):
- The claim we're testing or what we suspect is true
- Suggests there IS an effect, difference, or relationship
- What we accept if we reject \(H_0\)
- Contains \(\neq\), \(<\), or \(>\)
Types of Tests:
- Two-tailed test: \(H_1: \mu \neq \mu_0\) (testing if parameter is different)
- One-tailed test (right): \(H_1: \mu > \mu_0\) (testing if parameter is greater)
- One-tailed test (left): \(H_1: \mu < \mu_0\) (testing if parameter is less)
Significance Level (\(\alpha\))
Definition: The significance level is the probability of rejecting the null hypothesis when it is actually true (Type I error).
Common Significance Levels:
- \(\alpha = 0.05\) (5%) – most common in IB exams
- \(\alpha = 0.01\) (1%) – more stringent
- \(\alpha = 0.10\) (10%) – less stringent
Interpretation:
At 5% significance level, we're willing to accept a 5% chance of incorrectly rejecting a true null hypothesis.
p-value
Definition: The p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.
Decision Rule:
- If p-value \(< \alpha\): Reject \(H_0\) (evidence against null hypothesis)
- If p-value \(\geq \alpha\): Do not reject \(H_0\) (insufficient evidence)
Interpretation:
- Small p-value (≤ 0.05): Strong evidence against \(H_0\)
- Large p-value (> 0.05): Weak or no evidence against \(H_0\)
- The smaller the p-value, the stronger the evidence against \(H_0\)
⚠️ Common Pitfalls & Tips:
- "Reject \(H_0\)" does NOT mean "\(H_0\) is false" – only that there's sufficient evidence against it
- "Do not reject \(H_0\)" does NOT mean "\(H_0\) is true" – only insufficient evidence against it
- Always compare p-value with \(\alpha\), not with 0.5 or any other value
- State hypotheses BEFORE conducting the test
- Write conclusions in context of the problem
Chi-squared Test for Independence (\(\chi^2\))
Purpose & Conditions
Purpose: Tests whether two categorical variables are independent or associated. Used with contingency tables (two-way tables).
Hypotheses:
- \(H_0\): The two variables are independent
- \(H_1\): The two variables are not independent (they are associated)
Test Statistic Formula:
\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]
where O = observed frequency, E = expected frequency
Expected Frequency Formula:
\[ E = \frac{(\text{row total}) \times (\text{column total})}{\text{grand total}} \]
Degrees of Freedom:
\[ df = (r - 1)(c - 1) \]
where r = number of rows, c = number of columns
Conditions:
- Data must be counts/frequencies (not percentages or proportions)
- All expected frequencies should be at least 5
- Observations must be independent
⚠️ Common Pitfalls & Tips:
- Use your GDC's \(\chi^2\)-test function – manual calculation is tedious
- Always use OBSERVED values from the table, not expected
- Expected values should be calculated by GDC or using formula above
- GDC will give you both \(\chi^2\) statistic and p-value
- Remember: df = (rows - 1) × (columns - 1), NOT rows × columns
📝 Worked Example 1: Chi-squared Test for Independence
Question: A researcher wants to determine if there is an association between gender and preferred type of exercise. A survey of 200 people produced the following results:
| Cardio | Strength | Yoga | Total | |
|---|---|---|---|---|
| Male | 35 | 45 | 10 | 90 |
| Female | 45 | 35 | 30 | 110 |
| Total | 80 | 80 | 40 | 200 |
Test at the 5% significance level whether gender and exercise preference are independent.
(a) Write the null and alternative hypotheses.
(b) State the degrees of freedom.
(c) Calculate the \(\chi^2\) statistic and p-value.
(d) State your conclusion.
Solution:
(a) Hypotheses:
\(H_0\): Gender and exercise preference are independent
\(H_1\): Gender and exercise preference are not independent (they are associated)
(b) Degrees of Freedom:
\[ df = (r - 1)(c - 1) = (2 - 1)(3 - 1) = 1 \times 2 = 2 \]
(c) Chi-squared Statistic:
Using GDC:
1. Enter observed values into matrix [A]
2. STAT → TESTS → \(\chi^2\)-Test
3. Input: Observed [A], Expected [B]
4. Calculate
Manual calculation for understanding:
Expected frequency for Male-Cardio:
\[ E = \frac{90 \times 80}{200} = \frac{7200}{200} = 36 \]
Expected frequency for Male-Strength:
\[ E = \frac{90 \times 80}{200} = 36 \]
Expected frequency for Male-Yoga:
\[ E = \frac{90 \times 40}{200} = 18 \]
(Continue for female...)
GDC Output:
\(\chi^2 = 12.31\) (approximately)
p-value = 0.0021
(d) Conclusion:
Since p-value (0.0021) < \(\alpha\) (0.05), we reject \(H_0\).
Conclusion: At the 5% significance level, there is sufficient evidence to suggest that gender and exercise preference are not independent. In other words, there is an association between gender and type of exercise preferred.
Chi-squared Goodness of Fit Test
Purpose & Application
Purpose: Tests whether observed frequencies match expected frequencies according to a specific distribution or theoretical model.
Hypotheses:
- \(H_0\): The observed data fits the expected distribution
- \(H_1\): The observed data does not fit the expected distribution
Test Statistic:
\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]
Degrees of Freedom:
\[ df = n - 1 \]
where n = number of categories
Common Applications:
- Testing if a die is fair
- Testing if data follows uniform distribution
- Testing genetic ratios
- Testing if survey responses match expected proportions
⚠️ Common Pitfalls & Tips:
- Expected frequencies must be calculated based on the hypothesized distribution
- For fair die: each outcome has probability 1/6
- For uniform distribution: all categories have equal expected frequencies
- df = (number of categories - 1), NOT total number of observations
📝 Worked Example 2: Goodness of Fit Test
Question: A die is rolled 120 times and the following results are obtained:
| Number | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Observed | 15 | 18 | 22 | 25 | 17 | 23 |
Test at the 5% significance level whether the die is fair.
Solution:
Step 1: State hypotheses
\(H_0\): The die is fair (each number has probability \(\frac{1}{6}\))
\(H_1\): The die is not fair
Step 2: Calculate expected frequencies
If the die is fair, each number should appear with equal probability:
\[ E = \frac{120}{6} = 20 \text{ times for each number} \]
Step 3: Degrees of freedom
\[ df = n - 1 = 6 - 1 = 5 \]
Step 4: Calculate \(\chi^2\) statistic
Using GDC: Enter observed and expected values, run \(\chi^2\) GOF test
Manual calculation:
\[ \chi^2 = \frac{(15-20)^2}{20} + \frac{(18-20)^2}{20} + \frac{(22-20)^2}{20} + \frac{(25-20)^2}{20} + \frac{(17-20)^2}{20} + \frac{(23-20)^2}{20} \]
\[ \chi^2 = \frac{25}{20} + \frac{4}{20} + \frac{4}{20} + \frac{25}{20} + \frac{9}{20} + \frac{9}{20} \]
\[ \chi^2 = \frac{76}{20} = 3.8 \]
GDC gives: p-value = 0.579
Step 5: Conclusion
Since p-value (0.579) > \(\alpha\) (0.05), we do not reject \(H_0\).
Conclusion: At the 5% significance level, there is insufficient evidence to suggest the die is not fair. The observed frequencies are consistent with a fair die.
t-Tests for Population Means
One-Sample t-Test
Purpose: Tests whether a population mean differs from a hypothesized value when the population standard deviation is unknown.
Hypotheses:
- \(H_0: \mu = \mu_0\) (population mean equals hypothesized value)
- \(H_1: \mu \neq \mu_0\) (two-tailed) OR \(\mu > \mu_0\) OR \(\mu < \mu_0\) (one-tailed)
Test Statistic:
\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]
where \(\bar{x}\) = sample mean, \(s\) = sample standard deviation, \(n\) = sample size
Degrees of Freedom:
\[ df = n - 1 \]
Two-Sample t-Test (Pooled)
Purpose: Tests whether the means of two independent populations are equal.
Hypotheses:
- \(H_0: \mu_1 = \mu_2\) (the two population means are equal)
- \(H_1: \mu_1 \neq \mu_2\) (two-tailed) OR \(\mu_1 > \mu_2\) OR \(\mu_1 < \mu_2\) (one-tailed)
When to Use:
- Comparing means from two independent samples
- Population standard deviations unknown
- Assumes populations are normally distributed
- Assumes equal population variances (pooled)
⚠️ Common Pitfalls & Tips:
- t-test is used when population standard deviation is UNKNOWN
- z-test is used when population standard deviation IS KNOWN
- Always use GDC for t-tests – automatic calculation of p-value
- Choose "pooled" option for two-sample t-test in GDC
- State whether test is one-tailed or two-tailed based on \(H_1\)
📝 Worked Example 3: One-Sample t-Test
Question: A coffee shop claims that their average serving temperature is 70°C. A health inspector suspects the temperature is higher. She measures 15 randomly selected servings and finds:
Sample mean: \(\bar{x} = 73.2\)°C
Sample standard deviation: \(s = 4.5\)°C
Test at the 5% significance level whether the average temperature exceeds 70°C.
Solution:
Step 1: State hypotheses
\(H_0: \mu = 70\)°C (average temperature is 70°C)
\(H_1: \mu > 70\)°C (average temperature exceeds 70°C)
This is a one-tailed test (right-tailed)
Step 2: Given information
\(\bar{x} = 73.2\)°C, \(s = 4.5\)°C, \(n = 15\), \(\mu_0 = 70\)°C, \(\alpha = 0.05\)
Step 3: Calculate test statistic
Using GDC: STAT → TESTS → T-Test
Input: \(\mu_0 = 70\), \(\bar{x} = 73.2\), \(s = 4.5\), \(n = 15\)
Alternative: \(\mu > 70\)
Manual calculation:
\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{73.2 - 70}{4.5 / \sqrt{15}} = \frac{3.2}{4.5 / 3.873} = \frac{3.2}{1.162} = 2.754 \]
Degrees of freedom: \(df = 15 - 1 = 14\)
GDC Output:
t-statistic = 2.754
p-value = 0.0078
Step 4: Decision
Since p-value (0.0078) < \(\alpha\) (0.05), we reject \(H_0\).
Step 5: Conclusion
At the 5% significance level, there is sufficient evidence to support the inspector's suspicion that the average serving temperature exceeds 70°C.
Critical Values & Critical Regions
Definitions & Concepts
Critical Value:
The boundary value(s) that separate the rejection region from the non-rejection region. If the test statistic exceeds the critical value, we reject \(H_0\).
Critical Region (Rejection Region):
The set of values for the test statistic that leads to rejection of \(H_0\). The area in the tail(s) of the distribution.
Decision Rules:
- For \(\chi^2\): If \(\chi^2_{\text{calc}} > \chi^2_{\text{crit}}\), reject \(H_0\)
- For t-test: If \(|t_{\text{calc}}| > t_{\text{crit}}\) (two-tailed), reject \(H_0\)
- p-value method: If p-value < \(\alpha\), reject \(H_0\)
Finding Critical Values:
- For \(\chi^2\): Use \(\chi^2\) table or invChi function on GDC
- For t: Use t-table or invT function on GDC
- IB exams often provide critical values in questions
⚠️ Common Pitfalls & Tips:
- p-value method is generally easier than critical value method
- Both methods give same conclusion
- For two-tailed tests, critical region is in BOTH tails
- \(\chi^2\) tests are ALWAYS right-tailed (only upper critical value)
- If critical value is given in exam, use it; otherwise use p-value
Type I and Type II Errors
Understanding the Errors
Type I Error (\(\alpha\)):
- Definition: Rejecting \(H_0\) when it is actually TRUE
- Probability: \(P(\text{Type I Error}) = \alpha\) (significance level)
- Also called: "False Positive"
- Example: Concluding a drug is effective when it actually isn't
Type II Error (\(\beta\)):
- Definition: Failing to reject \(H_0\) when it is actually FALSE
- Probability: \(P(\text{Type II Error}) = \beta\)
- Also called: "False Negative"
- Example: Concluding a drug is not effective when it actually is
Relationship Summary Table:
| \(H_0\) is TRUE | \(H_0\) is FALSE | |
|---|---|---|
| Reject \(H_0\) | Type I Error (\(\alpha\)) | Correct Decision |
| Do Not Reject \(H_0\) | Correct Decision | Type II Error (\(\beta\)) |
Trade-off:
- Decreasing \(\alpha\) (making test more stringent) increases \(\beta\)
- Decreasing \(\beta\) increases \(\alpha\)
- To decrease both: increase sample size
⚠️ Common Pitfalls & Tips:
- Type I error can ONLY occur when \(H_0\) is true
- Type II error can ONLY occur when \(H_0\) is false
- Significance level \(\alpha\) = P(Type I Error)
- We can control Type I error by choosing \(\alpha\)
- Type II error probability depends on true parameter value and sample size
📝 Worked Example 4: Interpreting Errors in Context
Question: A pharmaceutical company tests a new drug and uses hypothesis testing with:
\(H_0\): The drug is not effective
\(H_1\): The drug is effective
(a) Describe a Type I error in this context.
(b) Describe a Type II error in this context.
(c) If the significance level is 1%, what is the probability of a Type I error?
(d) Which error would be more serious? Explain.
Solution:
(a) Type I Error:
A Type I error occurs when we reject \(H_0\) when it is true.
In context: Concluding the drug is effective when it actually is not effective. The company would market an ineffective drug.
(b) Type II Error:
A Type II error occurs when we fail to reject \(H_0\) when it is false.
In context: Concluding the drug is not effective when it actually is effective. The company would abandon a potentially life-saving drug.
(c) Probability of Type I Error:
P(Type I Error) = \(\alpha\) = 0.01 or 1%
There is a 1% chance of concluding the drug is effective when it actually isn't.
(d) More Serious Error:
Arguments could be made for both:
Type I more serious: Marketing an ineffective drug could harm patients who rely on it and waste healthcare resources. There could be legal and ethical consequences.
Type II more serious: Rejecting an effective drug means patients who could benefit won't have access to treatment. In life-threatening conditions, this could cost lives.
Typical answer: In pharmaceutical testing, Type I error is generally considered more serious because releasing an ineffective (or potentially harmful) drug to the public has greater immediate consequences than delaying an effective drug pending further testing.
Hypothesis Testing for Population Proportions
One-Sample z-Test for Proportion
Purpose: Tests a claim about a population proportion when sample size is large.
Hypotheses:
- \(H_0: p = p_0\) (population proportion equals hypothesized value)
- \(H_1: p \neq p_0\) (two-tailed) OR \(p > p_0\) OR \(p < p_0\) (one-tailed)
Test Statistic:
\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \]
where \(\hat{p}\) = sample proportion = \(\frac{x}{n}\), \(x\) = number of successes, \(n\) = sample size
Conditions:
- Sample is random
- \(np_0 \geq 5\) and \(n(1-p_0) \geq 5\) (large sample)
- Population size at least 10 times sample size
Alternative Approach (Binomial):
Can model as \(X \sim B(n, p_0)\) under \(H_0\) and find p-value directly
⚠️ Common Pitfalls & Tips:
- Use \(p_0\) (hypothesized value) in denominator, not \(\hat{p}\)
- For small samples, use exact binomial test instead of z-test
- Sample proportion \(\hat{p}\) must be between 0 and 1
- GDC can perform 1-PropZTest automatically
📝 Worked Example 5: Testing Population Proportion
Question: A company claims that 40% of customers prefer their new product. A survey of 200 randomly selected customers finds that 95 prefer the new product.
Test at the 5% significance level whether the true proportion is different from 40%.
Solution:
Step 1: State hypotheses
\(H_0: p = 0.40\) (40% of customers prefer the new product)
\(H_1: p \neq 0.40\) (proportion is different from 40%)
This is a two-tailed test
Step 2: Check conditions
\(np_0 = 200(0.40) = 80 \geq 5\) ✓
\(n(1-p_0) = 200(0.60) = 120 \geq 5\) ✓
Conditions met for z-test
Step 3: Calculate sample proportion
\[ \hat{p} = \frac{x}{n} = \frac{95}{200} = 0.475 \]
Step 4: Calculate test statistic
Using GDC: STAT → TESTS → 1-PropZTest
Input: \(p_0 = 0.40\), \(x = 95\), \(n = 200\)
Alternative: \(p \neq 0.40\)
Manual calculation:
\[ z = \frac{0.475 - 0.40}{\sqrt{\frac{0.40(0.60)}{200}}} = \frac{0.075}{\sqrt{\frac{0.24}{200}}} = \frac{0.075}{\sqrt{0.0012}} = \frac{0.075}{0.0346} = 2.167 \]
GDC Output:
z = 2.167
p-value = 0.0302
Step 5: Decision
Since p-value (0.0302) < \(\alpha\) (0.05), we reject \(H_0\).
Step 6: Conclusion
At the 5% significance level, there is sufficient evidence to conclude that the true proportion of customers who prefer the new product is different from 40%. The sample data suggests the proportion is closer to 47.5%.
📊 Quick Reference Summary
Basic Process
- State \(H_0\) and \(H_1\)
- Choose significance level \(\alpha\)
- Calculate test statistic
- Find p-value
- Compare p-value with \(\alpha\)
- State conclusion in context
Decision Rule
- p-value < \(\alpha\): Reject \(H_0\)
- p-value ≥ \(\alpha\): Do not reject \(H_0\)
- Typical \(\alpha\) = 0.05 (5%)
Errors
- Type I: Reject true \(H_0\)
- P(Type I) = \(\alpha\)
- Type II: Accept false \(H_0\)
- P(Type II) = \(\beta\)
Test Selection
- \(\chi^2\): Independence/Fit
- t-test: Means (σ unknown)
- z-test: Proportions/Means (σ known)
🎯 Which Test Should I Use?
| Situation | Test to Use | Key Information |
|---|---|---|
| Testing if two categorical variables are related | \(\chi^2\) Independence | Use contingency table, df = (r-1)(c-1) |
| Testing if data fits expected distribution | \(\chi^2\) Goodness of Fit | Compare O vs E frequencies, df = n-1 |
| Testing one population mean (σ unknown) | One-sample t-test | Use sample mean and s, df = n-1 |
| Comparing two population means | Two-sample t-test | Use pooled option on GDC |
| Testing population proportion | z-test for proportion | Or binomial test for small n |
🖩 Essential GDC Functions for Hypothesis Testing
- \(\chi^2\)-Test: STAT → TESTS → \(\chi^2\)-Test (for independence or GOF)
- One-sample t-test: STAT → TESTS → T-Test (use summary stats or data)
- Two-sample t-test: STAT → TESTS → 2-SampTTest (select pooled option)
- One-proportion z-test: STAT → TESTS → 1-PropZTest
- Finding p-values: Automatically calculated by all test functions
- Remember: Always write "Using GDC" in exam solutions
✍️ IB Exam Strategy for Hypothesis Testing
- Always write hypotheses first – define what \(H_0\) and \(H_1\) represent in context
- State significance level – usually 5% unless specified otherwise
- Show GDC work: Write "Using GDC:" and state which test you used
- Report both statistic and p-value when asked
- Compare p-value with \(\alpha\) explicitly
- State conclusion in two parts:
- Statistical: "We reject/do not reject \(H_0\)"
- Contextual: What this means for the problem
- For Type I/II errors: Always describe in context of the specific problem
- Check assumptions: Mention when conditions are met (e.g., large sample)
- Be precise with language:
- "Sufficient evidence" not "proof"
- "Do not reject" not "accept"
- "Suggests" not "proves"
🚫 Top Mistakes to Avoid
- Confusing hypotheses: \(H_0\) is status quo (equals), \(H_1\) is what we suspect
- Wrong comparison: Compare p-value with \(\alpha\), not with 0.5
- Incorrect language: Never say "accept \(H_0\)" – say "do not reject \(H_0\)"
- Missing context: Always interpret results in terms of the problem
- Switching errors: Type I = reject true \(H_0\); Type II = accept false \(H_0\)
- Wrong degrees of freedom: Check formula for each test type
- One-tailed vs two-tailed: Match alternative hypothesis to problem statement
- Forgetting conditions: Check sample size requirements for each test
- Claiming causation: Tests show association, not causation
- Misinterpreting p-value: It's NOT the probability that \(H_0\) is true
🔄 Hypothesis Testing Flowchart
START: What are you testing?
↓ Relationship between two categorical variables?
→ YES: Use \(\chi^2\) Test for Independence
↓ Does data fit expected distribution?
→ YES: Use \(\chi^2\) Goodness of Fit Test
↓ Testing population mean(s)?
→ One sample, σ unknown: Use One-sample t-test
→ Two samples, compare means: Use Two-sample t-test
↓ Testing population proportion?
→ YES: Use z-test for proportion or Binomial test