IB Mathematics AI – Topic 4

Statistics & Probability: Hypothesis Testing

Introduction to Hypothesis Testing

Null and Alternative Hypotheses

Definition: Hypothesis testing is a statistical method used to make decisions about population parameters based on sample data. It involves testing a claim by comparing observed data against what we would expect under a specific assumption.

The Two Hypotheses:

1. Null Hypothesis (\(H_0\)):

The "status quo" or default assumption
Assumes no effect, no difference, or no relationship
What we assume to be true until evidence proves otherwise
Always contains an equals sign (\(=\), \(\leq\), or \(\geq\))

2. Alternative Hypothesis (\(H_1\) or \(H_a\)):

The claim we're testing or what we suspect is true
Suggests there IS an effect, difference, or relationship
What we accept if we reject \(H_0\)
Contains \(\neq\), \(<\), or \(>\)

Types of Tests:

Two-tailed test: \(H_1: \mu \neq \mu_0\) (testing if parameter is different)
One-tailed test (right): \(H_1: \mu > \mu_0\) (testing if parameter is greater)
One-tailed test (left): \(H_1: \mu < \mu_0\) (testing if parameter is less)

Significance Level (\(\alpha\))

Definition: The significance level is the probability of rejecting the null hypothesis when it is actually true (Type I error).

Common Significance Levels:

\(\alpha = 0.05\) (5%) – most common in IB exams
\(\alpha = 0.01\) (1%) – more stringent
\(\alpha = 0.10\) (10%) – less stringent

Interpretation:

At 5% significance level, we're willing to accept a 5% chance of incorrectly rejecting a true null hypothesis.

p-value

Definition: The p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.

Decision Rule:

If p-value \(< \alpha\): Reject \(H_0\) (evidence against null hypothesis)
If p-value \(\geq \alpha\): Do not reject \(H_0\) (insufficient evidence)

Interpretation:

Small p-value (≤ 0.05): Strong evidence against \(H_0\)
Large p-value (> 0.05): Weak or no evidence against \(H_0\)
The smaller the p-value, the stronger the evidence against \(H_0\)

⚠️ Common Pitfalls & Tips:

"Reject \(H_0\)" does NOT mean "\(H_0\) is false" – only that there's sufficient evidence against it
"Do not reject \(H_0\)" does NOT mean "\(H_0\) is true" – only insufficient evidence against it
Always compare p-value with \(\alpha\), not with 0.5 or any other value
State hypotheses BEFORE conducting the test
Write conclusions in context of the problem

Chi-squared Test for Independence (\(\chi^2\))

Purpose & Conditions

Purpose: Tests whether two categorical variables are independent or associated. Used with contingency tables (two-way tables).

Hypotheses:

\(H_0\): The two variables are independent
\(H_1\): The two variables are not independent (they are associated)

Test Statistic Formula:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]

where O = observed frequency, E = expected frequency

Expected Frequency Formula:

\[ E = \frac{(\text{row total}) \times (\text{column total})}{\text{grand total}} \]

Degrees of Freedom:

\[ df = (r - 1)(c - 1) \]

where r = number of rows, c = number of columns

Conditions:

Data must be counts/frequencies (not percentages or proportions)
All expected frequencies should be at least 5
Observations must be independent

⚠️ Common Pitfalls & Tips:

Use your GDC's \(\chi^2\)-test function – manual calculation is tedious
Always use OBSERVED values from the table, not expected
Expected values should be calculated by GDC or using formula above
GDC will give you both \(\chi^2\) statistic and p-value
Remember: df = (rows - 1) × (columns - 1), NOT rows × columns

📝 Worked Example 1: Chi-squared Test for Independence

Question: A researcher wants to determine if there is an association between gender and preferred type of exercise. A survey of 200 people produced the following results:

	Cardio	Strength	Yoga	Total
Male	35	45	10	90
Female	45	35	30	110
Total	80	80	40	200

Test at the 5% significance level whether gender and exercise preference are independent.

(a) Write the null and alternative hypotheses.

(b) State the degrees of freedom.

(d) State your conclusion.

Solution:

(a) Hypotheses:

\(H_0\): Gender and exercise preference are independent

\(H_1\): Gender and exercise preference are not independent (they are associated)

(b) Degrees of Freedom:

\[ df = (r - 1)(c - 1) = (2 - 1)(3 - 1) = 1 \times 2 = 2 \]

(c) Chi-squared Statistic:

Using GDC:

1. Enter observed values into matrix [A]

2. STAT → TESTS → \(\chi^2\)-Test

3. Input: Observed [A], Expected [B]

4. Calculate

Manual calculation for understanding:

Expected frequency for Male-Cardio:

\[ E = \frac{90 \times 80}{200} = \frac{7200}{200} = 36 \]

Expected frequency for Male-Strength:

\[ E = \frac{90 \times 80}{200} = 36 \]

Expected frequency for Male-Yoga:

\[ E = \frac{90 \times 40}{200} = 18 \]

(Continue for female...)

GDC Output:

\(\chi^2 = 12.31\) (approximately)

p-value = 0.0021

(d) Conclusion:

Since p-value (0.0021) < \(\alpha\) (0.05), we reject \(H_0\).

Conclusion: At the 5% significance level, there is sufficient evidence to suggest that gender and exercise preference are not independent. In other words, there is an association between gender and type of exercise preferred.

Chi-squared Goodness of Fit Test

Purpose & Application

Purpose: Tests whether observed frequencies match expected frequencies according to a specific distribution or theoretical model.

Hypotheses:

\(H_0\): The observed data fits the expected distribution
\(H_1\): The observed data does not fit the expected distribution

Test Statistic:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]

Degrees of Freedom:

\[ df = n - 1 \]

where n = number of categories

Common Applications:

Testing if a die is fair
Testing if data follows uniform distribution
Testing genetic ratios
Testing if survey responses match expected proportions

⚠️ Common Pitfalls & Tips:

Expected frequencies must be calculated based on the hypothesized distribution
For fair die: each outcome has probability 1/6
For uniform distribution: all categories have equal expected frequencies
df = (number of categories - 1), NOT total number of observations

📝 Worked Example 2: Goodness of Fit Test

Question: A die is rolled 120 times and the following results are obtained:

Number	1	2	3	4	5	6
Observed	15	18	22	25	17	23

Test at the 5% significance level whether the die is fair.

Solution:

Step 1: State hypotheses

\(H_0\): The die is fair (each number has probability \(\frac{1}{6}\))

\(H_1\): The die is not fair

Step 2: Calculate expected frequencies

If the die is fair, each number should appear with equal probability:

\[ E = \frac{120}{6} = 20 \text{ times for each number} \]

Step 3: Degrees of freedom

\[ df = n - 1 = 6 - 1 = 5 \]

Step 4: Calculate \(\chi^2\) statistic

Using GDC: Enter observed and expected values, run \(\chi^2\) GOF test

Manual calculation:

\[ \chi^2 = \frac{(15-20)^2}{20} + \frac{(18-20)^2}{20} + \frac{(22-20)^2}{20} + \frac{(25-20)^2}{20} + \frac{(17-20)^2}{20} + \frac{(23-20)^2}{20} \]

\[ \chi^2 = \frac{25}{20} + \frac{4}{20} + \frac{4}{20} + \frac{25}{20} + \frac{9}{20} + \frac{9}{20} \]

\[ \chi^2 = \frac{76}{20} = 3.8 \]

GDC gives: p-value = 0.579

Step 5: Conclusion

Since p-value (0.579) > \(\alpha\) (0.05), we do not reject \(H_0\).

Conclusion: At the 5% significance level, there is insufficient evidence to suggest the die is not fair. The observed frequencies are consistent with a fair die.

t-Tests for Population Means

One-Sample t-Test

Purpose: Tests whether a population mean differs from a hypothesized value when the population standard deviation is unknown.

Hypotheses:

\(H_0: \mu = \mu_0\) (population mean equals hypothesized value)
\(H_1: \mu \neq \mu_0\) (two-tailed) OR \(\mu > \mu_0\) OR \(\mu < \mu_0\) (one-tailed)

Test Statistic:

\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]

where \(\bar{x}\) = sample mean, \(s\) = sample standard deviation, \(n\) = sample size

Degrees of Freedom:

\[ df = n - 1 \]

Two-Sample t-Test (Pooled)

Purpose: Tests whether the means of two independent populations are equal.

Hypotheses:

\(H_0: \mu_1 = \mu_2\) (the two population means are equal)
\(H_1: \mu_1 \neq \mu_2\) (two-tailed) OR \(\mu_1 > \mu_2\) OR \(\mu_1 < \mu_2\) (one-tailed)

When to Use:

Comparing means from two independent samples
Population standard deviations unknown
Assumes populations are normally distributed
Assumes equal population variances (pooled)

⚠️ Common Pitfalls & Tips:

t-test is used when population standard deviation is UNKNOWN
z-test is used when population standard deviation IS KNOWN
Always use GDC for t-tests – automatic calculation of p-value
Choose "pooled" option for two-sample t-test in GDC
State whether test is one-tailed or two-tailed based on \(H_1\)

📝 Worked Example 3: One-Sample t-Test

Question: A coffee shop claims that their average serving temperature is 70°C. A health inspector suspects the temperature is higher. She measures 15 randomly selected servings and finds:

Sample mean: \(\bar{x} = 73.2\)°C

Sample standard deviation: \(s = 4.5\)°C

Test at the 5% significance level whether the average temperature exceeds 70°C.

Solution:

Step 1: State hypotheses

\(H_0: \mu = 70\)°C (average temperature is 70°C)

\(H_1: \mu > 70\)°C (average temperature exceeds 70°C)

This is a one-tailed test (right-tailed)

Step 2: Given information

\(\bar{x} = 73.2\)°C, \(s = 4.5\)°C, \(n = 15\), \(\mu_0 = 70\)°C, \(\alpha = 0.05\)

Step 3: Calculate test statistic

Using GDC: STAT → TESTS → T-Test

Input: \(\mu_0 = 70\), \(\bar{x} = 73.2\), \(s = 4.5\), \(n = 15\)

Alternative: \(\mu > 70\)

Manual calculation:

\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{73.2 - 70}{4.5 / \sqrt{15}} = \frac{3.2}{4.5 / 3.873} = \frac{3.2}{1.162} = 2.754 \]

Degrees of freedom: \(df = 15 - 1 = 14\)

GDC Output:

t-statistic = 2.754

p-value = 0.0078

Step 4: Decision

Since p-value (0.0078) < \(\alpha\) (0.05), we reject \(H_0\).

Step 5: Conclusion

At the 5% significance level, there is sufficient evidence to support the inspector's suspicion that the average serving temperature exceeds 70°C.

Critical Values & Critical Regions

Definitions & Concepts

Critical Value:

The boundary value(s) that separate the rejection region from the non-rejection region. If the test statistic exceeds the critical value, we reject \(H_0\).

Critical Region (Rejection Region):

The set of values for the test statistic that leads to rejection of \(H_0\). The area in the tail(s) of the distribution.

Decision Rules:

For \(\chi^2\): If \(\chi^2_{\text{calc}} > \chi^2_{\text{crit}}\), reject \(H_0\)
For t-test: If \(|t_{\text{calc}}| > t_{\text{crit}}\) (two-tailed), reject \(H_0\)
p-value method: If p-value < \(\alpha\), reject \(H_0\)

Finding Critical Values:

For \(\chi^2\): Use \(\chi^2\) table or invChi function on GDC
For t: Use t-table or invT function on GDC
IB exams often provide critical values in questions

⚠️ Common Pitfalls & Tips:

p-value method is generally easier than critical value method
Both methods give same conclusion
For two-tailed tests, critical region is in BOTH tails
\(\chi^2\) tests are ALWAYS right-tailed (only upper critical value)
If critical value is given in exam, use it; otherwise use p-value

Type I and Type II Errors

Understanding the Errors

Type I Error (\(\alpha\)):

Definition: Rejecting \(H_0\) when it is actually TRUE
Probability: \(P(\text{Type I Error}) = \alpha\) (significance level)
Also called: "False Positive"
Example: Concluding a drug is effective when it actually isn't

Type II Error (\(\beta\)):

Definition: Failing to reject \(H_0\) when it is actually FALSE
Probability: \(P(\text{Type II Error}) = \beta\)
Also called: "False Negative"
Example: Concluding a drug is not effective when it actually is

Relationship Summary Table:

	\(H_0\) is TRUE	\(H_0\) is FALSE
Reject \(H_0\)	Type I Error (\(\alpha\))	Correct Decision
Do Not Reject \(H_0\)	Correct Decision	Type II Error (\(\beta\))

Trade-off:

Decreasing \(\alpha\) (making test more stringent) increases \(\beta\)
Decreasing \(\beta\) increases \(\alpha\)
To decrease both: increase sample size

⚠️ Common Pitfalls & Tips:

Type I error can ONLY occur when \(H_0\) is true
Type II error can ONLY occur when \(H_0\) is false
Significance level \(\alpha\) = P(Type I Error)
We can control Type I error by choosing \(\alpha\)
Type II error probability depends on true parameter value and sample size

📝 Worked Example 4: Interpreting Errors in Context

Question: A pharmaceutical company tests a new drug and uses hypothesis testing with:

\(H_0\): The drug is not effective

\(H_1\): The drug is effective

(a) Describe a Type I error in this context.

(b) Describe a Type II error in this context.

(d) Which error would be more serious? Explain.

Solution:

(a) Type I Error:

A Type I error occurs when we reject \(H_0\) when it is true.

In context: Concluding the drug is effective when it actually is not effective. The company would market an ineffective drug.

(b) Type II Error:

A Type II error occurs when we fail to reject \(H_0\) when it is false.

In context: Concluding the drug is not effective when it actually is effective. The company would abandon a potentially life-saving drug.

(c) Probability of Type I Error:

P(Type I Error) = \(\alpha\) = 0.01 or 1%

There is a 1% chance of concluding the drug is effective when it actually isn't.

(d) More Serious Error:

Arguments could be made for both:

Type I more serious: Marketing an ineffective drug could harm patients who rely on it and waste healthcare resources. There could be legal and ethical consequences.

Type II more serious: Rejecting an effective drug means patients who could benefit won't have access to treatment. In life-threatening conditions, this could cost lives.

Typical answer: In pharmaceutical testing, Type I error is generally considered more serious because releasing an ineffective (or potentially harmful) drug to the public has greater immediate consequences than delaying an effective drug pending further testing.

Hypothesis Testing for Population Proportions

One-Sample z-Test for Proportion

Purpose: Tests a claim about a population proportion when sample size is large.

Hypotheses:

\(H_0: p = p_0\) (population proportion equals hypothesized value)
\(H_1: p \neq p_0\) (two-tailed) OR \(p > p_0\) OR \(p < p_0\) (one-tailed)

Test Statistic:

\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \]

where \(\hat{p}\) = sample proportion = \(\frac{x}{n}\), \(x\) = number of successes, \(n\) = sample size

Conditions:

Sample is random
\(np_0 \geq 5\) and \(n(1-p_0) \geq 5\) (large sample)
Population size at least 10 times sample size

Alternative Approach (Binomial):

Can model as \(X \sim B(n, p_0)\) under \(H_0\) and find p-value directly

⚠️ Common Pitfalls & Tips:

Use \(p_0\) (hypothesized value) in denominator, not \(\hat{p}\)
For small samples, use exact binomial test instead of z-test
Sample proportion \(\hat{p}\) must be between 0 and 1
GDC can perform 1-PropZTest automatically

📝 Worked Example 5: Testing Population Proportion

Question: A company claims that 40% of customers prefer their new product. A survey of 200 randomly selected customers finds that 95 prefer the new product.

Test at the 5% significance level whether the true proportion is different from 40%.

Solution:

Step 1: State hypotheses

\(H_0: p = 0.40\) (40% of customers prefer the new product)

\(H_1: p \neq 0.40\) (proportion is different from 40%)

This is a two-tailed test

Step 2: Check conditions

\(np_0 = 200(0.40) = 80 \geq 5\) ✓

\(n(1-p_0) = 200(0.60) = 120 \geq 5\) ✓

Conditions met for z-test

Step 3: Calculate sample proportion

\[ \hat{p} = \frac{x}{n} = \frac{95}{200} = 0.475 \]

Step 4: Calculate test statistic

Using GDC: STAT → TESTS → 1-PropZTest

Input: \(p_0 = 0.40\), \(x = 95\), \(n = 200\)

Alternative: \(p \neq 0.40\)

Manual calculation:

\[ z = \frac{0.475 - 0.40}{\sqrt{\frac{0.40(0.60)}{200}}} = \frac{0.075}{\sqrt{\frac{0.24}{200}}} = \frac{0.075}{\sqrt{0.0012}} = \frac{0.075}{0.0346} = 2.167 \]

GDC Output:

z = 2.167

p-value = 0.0302

Step 5: Decision

Since p-value (0.0302) < \(\alpha\) (0.05), we reject \(H_0\).

Step 6: Conclusion

At the 5% significance level, there is sufficient evidence to conclude that the true proportion of customers who prefer the new product is different from 40%. The sample data suggests the proportion is closer to 47.5%.

📊 Quick Reference Summary

Basic Process

State \(H_0\) and \(H_1\)
Choose significance level \(\alpha\)
Calculate test statistic
Find p-value
Compare p-value with \(\alpha\)
State conclusion in context

Decision Rule

p-value < \(\alpha\): Reject \(H_0\)
p-value ≥ \(\alpha\): Do not reject \(H_0\)
Typical \(\alpha\) = 0.05 (5%)

Errors

Type I: Reject true \(H_0\)
P(Type I) = \(\alpha\)
Type II: Accept false \(H_0\)
P(Type II) = \(\beta\)

Test Selection

\(\chi^2\): Independence/Fit
t-test: Means (σ unknown)
z-test: Proportions/Means (σ known)

🎯 Which Test Should I Use?

Situation	Test to Use	Key Information
Testing if two categorical variables are related	\(\chi^2\) Independence	Use contingency table, df = (r-1)(c-1)
Testing if data fits expected distribution	\(\chi^2\) Goodness of Fit	Compare O vs E frequencies, df = n-1
Testing one population mean (σ unknown)	One-sample t-test	Use sample mean and s, df = n-1
Comparing two population means	Two-sample t-test	Use pooled option on GDC
Testing population proportion	z-test for proportion	Or binomial test for small n

🖩 Essential GDC Functions for Hypothesis Testing

\(\chi^2\)-Test: STAT → TESTS → \(\chi^2\)-Test (for independence or GOF)
One-sample t-test: STAT → TESTS → T-Test (use summary stats or data)
Two-sample t-test: STAT → TESTS → 2-SampTTest (select pooled option)
One-proportion z-test: STAT → TESTS → 1-PropZTest
Finding p-values: Automatically calculated by all test functions
Remember: Always write "Using GDC" in exam solutions

✍️ IB Exam Strategy for Hypothesis Testing

Always write hypotheses first – define what \(H_0\) and \(H_1\) represent in context
State significance level – usually 5% unless specified otherwise
Show GDC work: Write "Using GDC:" and state which test you used
Report both statistic and p-value when asked
Compare p-value with \(\alpha\) explicitly
State conclusion in two parts:
- Statistical: "We reject/do not reject \(H_0\)"
- Contextual: What this means for the problem
For Type I/II errors: Always describe in context of the specific problem
Check assumptions: Mention when conditions are met (e.g., large sample)
Be precise with language:
- "Sufficient evidence" not "proof"
- "Do not reject" not "accept"
- "Suggests" not "proves"

🚫 Top Mistakes to Avoid

Confusing hypotheses: \(H_0\) is status quo (equals), \(H_1\) is what we suspect
Wrong comparison: Compare p-value with \(\alpha\), not with 0.5
Incorrect language: Never say "accept \(H_0\)" – say "do not reject \(H_0\)"
Missing context: Always interpret results in terms of the problem
Switching errors: Type I = reject true \(H_0\); Type II = accept false \(H_0\)
Wrong degrees of freedom: Check formula for each test type
One-tailed vs two-tailed: Match alternative hypothesis to problem statement
Forgetting conditions: Check sample size requirements for each test
Claiming causation: Tests show association, not causation
Misinterpreting p-value: It's NOT the probability that \(H_0\) is true

🔄 Hypothesis Testing Flowchart

START: What are you testing?

↓ Relationship between two categorical variables?

→ YES: Use \(\chi^2\) Test for Independence

↓ Does data fit expected distribution?

→ YES: Use \(\chi^2\) Goodness of Fit Test

↓ Testing population mean(s)?

→ One sample, σ unknown: Use One-sample t-test

→ Two samples, compare means: Use Two-sample t-test

↓ Testing population proportion?

→ YES: Use z-test for proportion or Binomial test