IB Mathematics AI – Topic 4

Statistics & Probability: Hypothesis Testing

Introduction to Hypothesis Testing

Null and Alternative Hypotheses

Definition: Hypothesis testing is a statistical method used to make decisions about population parameters based on sample data. It involves testing a claim by comparing observed data against what we would expect under a specific assumption.

The Two Hypotheses:

1. Null Hypothesis (\(H_0\)):

  • The "status quo" or default assumption
  • Assumes no effect, no difference, or no relationship
  • What we assume to be true until evidence proves otherwise
  • Always contains an equals sign (\(=\), \(\leq\), or \(\geq\))

2. Alternative Hypothesis (\(H_1\) or \(H_a\)):

  • The claim we're testing or what we suspect is true
  • Suggests there IS an effect, difference, or relationship
  • What we accept if we reject \(H_0\)
  • Contains \(\neq\), \(<\), or \(>\)

Types of Tests:

  • Two-tailed test: \(H_1: \mu \neq \mu_0\) (testing if parameter is different)
  • One-tailed test (right): \(H_1: \mu > \mu_0\) (testing if parameter is greater)
  • One-tailed test (left): \(H_1: \mu < \mu_0\) (testing if parameter is less)

Significance Level (\(\alpha\))

Definition: The significance level is the probability of rejecting the null hypothesis when it is actually true (Type I error).

Common Significance Levels:

  • \(\alpha = 0.05\) (5%) – most common in IB exams
  • \(\alpha = 0.01\) (1%) – more stringent
  • \(\alpha = 0.10\) (10%) – less stringent

Interpretation:

At 5% significance level, we're willing to accept a 5% chance of incorrectly rejecting a true null hypothesis.

p-value

Definition: The p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.

Decision Rule:

  • If p-value \(< \alpha\): Reject \(H_0\) (evidence against null hypothesis)
  • If p-value \(\geq \alpha\): Do not reject \(H_0\) (insufficient evidence)

Interpretation:

  • Small p-value (≤ 0.05): Strong evidence against \(H_0\)
  • Large p-value (> 0.05): Weak or no evidence against \(H_0\)
  • The smaller the p-value, the stronger the evidence against \(H_0\)

⚠️ Common Pitfalls & Tips:

  • "Reject \(H_0\)" does NOT mean "\(H_0\) is false" – only that there's sufficient evidence against it
  • "Do not reject \(H_0\)" does NOT mean "\(H_0\) is true" – only insufficient evidence against it
  • Always compare p-value with \(\alpha\), not with 0.5 or any other value
  • State hypotheses BEFORE conducting the test
  • Write conclusions in context of the problem

Chi-squared Test for Independence (\(\chi^2\))

Purpose & Conditions

Purpose: Tests whether two categorical variables are independent or associated. Used with contingency tables (two-way tables).

Hypotheses:

  • \(H_0\): The two variables are independent
  • \(H_1\): The two variables are not independent (they are associated)

Test Statistic Formula:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]

where O = observed frequency, E = expected frequency

Expected Frequency Formula:

\[ E = \frac{(\text{row total}) \times (\text{column total})}{\text{grand total}} \]

Degrees of Freedom:

\[ df = (r - 1)(c - 1) \]

where r = number of rows, c = number of columns

Conditions:

  • Data must be counts/frequencies (not percentages or proportions)
  • All expected frequencies should be at least 5
  • Observations must be independent

⚠️ Common Pitfalls & Tips:

  • Use your GDC's \(\chi^2\)-test function – manual calculation is tedious
  • Always use OBSERVED values from the table, not expected
  • Expected values should be calculated by GDC or using formula above
  • GDC will give you both \(\chi^2\) statistic and p-value
  • Remember: df = (rows - 1) × (columns - 1), NOT rows × columns

📝 Worked Example 1: Chi-squared Test for Independence

Question: A researcher wants to determine if there is an association between gender and preferred type of exercise. A survey of 200 people produced the following results:

CardioStrengthYogaTotal
Male35451090
Female453530110
Total808040200

Test at the 5% significance level whether gender and exercise preference are independent.

(a) Write the null and alternative hypotheses.

(b) State the degrees of freedom.

(c) Calculate the \(\chi^2\) statistic and p-value.

(d) State your conclusion.

Solution:

(a) Hypotheses:

\(H_0\): Gender and exercise preference are independent

\(H_1\): Gender and exercise preference are not independent (they are associated)

(b) Degrees of Freedom:

\[ df = (r - 1)(c - 1) = (2 - 1)(3 - 1) = 1 \times 2 = 2 \]

(c) Chi-squared Statistic:

Using GDC:

1. Enter observed values into matrix [A]

2. STAT → TESTS → \(\chi^2\)-Test

3. Input: Observed [A], Expected [B]

4. Calculate

Manual calculation for understanding:

Expected frequency for Male-Cardio:

\[ E = \frac{90 \times 80}{200} = \frac{7200}{200} = 36 \]

Expected frequency for Male-Strength:

\[ E = \frac{90 \times 80}{200} = 36 \]

Expected frequency for Male-Yoga:

\[ E = \frac{90 \times 40}{200} = 18 \]

(Continue for female...)

GDC Output:

\(\chi^2 = 12.31\) (approximately)

p-value = 0.0021

(d) Conclusion:

Since p-value (0.0021) < \(\alpha\) (0.05), we reject \(H_0\).

Conclusion: At the 5% significance level, there is sufficient evidence to suggest that gender and exercise preference are not independent. In other words, there is an association between gender and type of exercise preferred.

Chi-squared Goodness of Fit Test

Purpose & Application

Purpose: Tests whether observed frequencies match expected frequencies according to a specific distribution or theoretical model.

Hypotheses:

  • \(H_0\): The observed data fits the expected distribution
  • \(H_1\): The observed data does not fit the expected distribution

Test Statistic:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]

Degrees of Freedom:

\[ df = n - 1 \]

where n = number of categories

Common Applications:

  • Testing if a die is fair
  • Testing if data follows uniform distribution
  • Testing genetic ratios
  • Testing if survey responses match expected proportions

⚠️ Common Pitfalls & Tips:

  • Expected frequencies must be calculated based on the hypothesized distribution
  • For fair die: each outcome has probability 1/6
  • For uniform distribution: all categories have equal expected frequencies
  • df = (number of categories - 1), NOT total number of observations

📝 Worked Example 2: Goodness of Fit Test

Question: A die is rolled 120 times and the following results are obtained:

Number123456
Observed151822251723

Test at the 5% significance level whether the die is fair.

Solution:

Step 1: State hypotheses

\(H_0\): The die is fair (each number has probability \(\frac{1}{6}\))

\(H_1\): The die is not fair

Step 2: Calculate expected frequencies

If the die is fair, each number should appear with equal probability:

\[ E = \frac{120}{6} = 20 \text{ times for each number} \]

Step 3: Degrees of freedom

\[ df = n - 1 = 6 - 1 = 5 \]

Step 4: Calculate \(\chi^2\) statistic

Using GDC: Enter observed and expected values, run \(\chi^2\) GOF test

Manual calculation:

\[ \chi^2 = \frac{(15-20)^2}{20} + \frac{(18-20)^2}{20} + \frac{(22-20)^2}{20} + \frac{(25-20)^2}{20} + \frac{(17-20)^2}{20} + \frac{(23-20)^2}{20} \]

\[ \chi^2 = \frac{25}{20} + \frac{4}{20} + \frac{4}{20} + \frac{25}{20} + \frac{9}{20} + \frac{9}{20} \]

\[ \chi^2 = \frac{76}{20} = 3.8 \]

GDC gives: p-value = 0.579

Step 5: Conclusion

Since p-value (0.579) > \(\alpha\) (0.05), we do not reject \(H_0\).

Conclusion: At the 5% significance level, there is insufficient evidence to suggest the die is not fair. The observed frequencies are consistent with a fair die.

t-Tests for Population Means

One-Sample t-Test

Purpose: Tests whether a population mean differs from a hypothesized value when the population standard deviation is unknown.

Hypotheses:

  • \(H_0: \mu = \mu_0\) (population mean equals hypothesized value)
  • \(H_1: \mu \neq \mu_0\) (two-tailed) OR \(\mu > \mu_0\) OR \(\mu < \mu_0\) (one-tailed)

Test Statistic:

\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]

where \(\bar{x}\) = sample mean, \(s\) = sample standard deviation, \(n\) = sample size

Degrees of Freedom:

\[ df = n - 1 \]

Two-Sample t-Test (Pooled)

Purpose: Tests whether the means of two independent populations are equal.

Hypotheses:

  • \(H_0: \mu_1 = \mu_2\) (the two population means are equal)
  • \(H_1: \mu_1 \neq \mu_2\) (two-tailed) OR \(\mu_1 > \mu_2\) OR \(\mu_1 < \mu_2\) (one-tailed)

When to Use:

  • Comparing means from two independent samples
  • Population standard deviations unknown
  • Assumes populations are normally distributed
  • Assumes equal population variances (pooled)

⚠️ Common Pitfalls & Tips:

  • t-test is used when population standard deviation is UNKNOWN
  • z-test is used when population standard deviation IS KNOWN
  • Always use GDC for t-tests – automatic calculation of p-value
  • Choose "pooled" option for two-sample t-test in GDC
  • State whether test is one-tailed or two-tailed based on \(H_1\)

📝 Worked Example 3: One-Sample t-Test

Question: A coffee shop claims that their average serving temperature is 70°C. A health inspector suspects the temperature is higher. She measures 15 randomly selected servings and finds:

Sample mean: \(\bar{x} = 73.2\)°C

Sample standard deviation: \(s = 4.5\)°C

Test at the 5% significance level whether the average temperature exceeds 70°C.

Solution:

Step 1: State hypotheses

\(H_0: \mu = 70\)°C (average temperature is 70°C)

\(H_1: \mu > 70\)°C (average temperature exceeds 70°C)

This is a one-tailed test (right-tailed)

Step 2: Given information

\(\bar{x} = 73.2\)°C, \(s = 4.5\)°C, \(n = 15\), \(\mu_0 = 70\)°C, \(\alpha = 0.05\)

Step 3: Calculate test statistic

Using GDC: STAT → TESTS → T-Test

Input: \(\mu_0 = 70\), \(\bar{x} = 73.2\), \(s = 4.5\), \(n = 15\)

Alternative: \(\mu > 70\)

Manual calculation:

\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{73.2 - 70}{4.5 / \sqrt{15}} = \frac{3.2}{4.5 / 3.873} = \frac{3.2}{1.162} = 2.754 \]

Degrees of freedom: \(df = 15 - 1 = 14\)

GDC Output:

t-statistic = 2.754

p-value = 0.0078

Step 4: Decision

Since p-value (0.0078) < \(\alpha\) (0.05), we reject \(H_0\).

Step 5: Conclusion

At the 5% significance level, there is sufficient evidence to support the inspector's suspicion that the average serving temperature exceeds 70°C.

Critical Values & Critical Regions

Definitions & Concepts

Critical Value:

The boundary value(s) that separate the rejection region from the non-rejection region. If the test statistic exceeds the critical value, we reject \(H_0\).

Critical Region (Rejection Region):

The set of values for the test statistic that leads to rejection of \(H_0\). The area in the tail(s) of the distribution.

Decision Rules:

  • For \(\chi^2\): If \(\chi^2_{\text{calc}} > \chi^2_{\text{crit}}\), reject \(H_0\)
  • For t-test: If \(|t_{\text{calc}}| > t_{\text{crit}}\) (two-tailed), reject \(H_0\)
  • p-value method: If p-value < \(\alpha\), reject \(H_0\)

Finding Critical Values:

  • For \(\chi^2\): Use \(\chi^2\) table or invChi function on GDC
  • For t: Use t-table or invT function on GDC
  • IB exams often provide critical values in questions

⚠️ Common Pitfalls & Tips:

  • p-value method is generally easier than critical value method
  • Both methods give same conclusion
  • For two-tailed tests, critical region is in BOTH tails
  • \(\chi^2\) tests are ALWAYS right-tailed (only upper critical value)
  • If critical value is given in exam, use it; otherwise use p-value

Type I and Type II Errors

Understanding the Errors

Type I Error (\(\alpha\)):

  • Definition: Rejecting \(H_0\) when it is actually TRUE
  • Probability: \(P(\text{Type I Error}) = \alpha\) (significance level)
  • Also called: "False Positive"
  • Example: Concluding a drug is effective when it actually isn't

Type II Error (\(\beta\)):

  • Definition: Failing to reject \(H_0\) when it is actually FALSE
  • Probability: \(P(\text{Type II Error}) = \beta\)
  • Also called: "False Negative"
  • Example: Concluding a drug is not effective when it actually is

Relationship Summary Table:

\(H_0\) is TRUE\(H_0\) is FALSE
Reject \(H_0\)Type I Error (\(\alpha\))Correct Decision
Do Not Reject \(H_0\)Correct DecisionType II Error (\(\beta\))

Trade-off:

  • Decreasing \(\alpha\) (making test more stringent) increases \(\beta\)
  • Decreasing \(\beta\) increases \(\alpha\)
  • To decrease both: increase sample size

⚠️ Common Pitfalls & Tips:

  • Type I error can ONLY occur when \(H_0\) is true
  • Type II error can ONLY occur when \(H_0\) is false
  • Significance level \(\alpha\) = P(Type I Error)
  • We can control Type I error by choosing \(\alpha\)
  • Type II error probability depends on true parameter value and sample size

📝 Worked Example 4: Interpreting Errors in Context

Question: A pharmaceutical company tests a new drug and uses hypothesis testing with:

\(H_0\): The drug is not effective

\(H_1\): The drug is effective

(a) Describe a Type I error in this context.

(b) Describe a Type II error in this context.

(c) If the significance level is 1%, what is the probability of a Type I error?

(d) Which error would be more serious? Explain.

Solution:

(a) Type I Error:

A Type I error occurs when we reject \(H_0\) when it is true.

In context: Concluding the drug is effective when it actually is not effective. The company would market an ineffective drug.

(b) Type II Error:

A Type II error occurs when we fail to reject \(H_0\) when it is false.

In context: Concluding the drug is not effective when it actually is effective. The company would abandon a potentially life-saving drug.

(c) Probability of Type I Error:

P(Type I Error) = \(\alpha\) = 0.01 or 1%

There is a 1% chance of concluding the drug is effective when it actually isn't.

(d) More Serious Error:

Arguments could be made for both:

Type I more serious: Marketing an ineffective drug could harm patients who rely on it and waste healthcare resources. There could be legal and ethical consequences.

Type II more serious: Rejecting an effective drug means patients who could benefit won't have access to treatment. In life-threatening conditions, this could cost lives.

Typical answer: In pharmaceutical testing, Type I error is generally considered more serious because releasing an ineffective (or potentially harmful) drug to the public has greater immediate consequences than delaying an effective drug pending further testing.

Hypothesis Testing for Population Proportions

One-Sample z-Test for Proportion

Purpose: Tests a claim about a population proportion when sample size is large.

Hypotheses:

  • \(H_0: p = p_0\) (population proportion equals hypothesized value)
  • \(H_1: p \neq p_0\) (two-tailed) OR \(p > p_0\) OR \(p < p_0\) (one-tailed)

Test Statistic:

\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \]

where \(\hat{p}\) = sample proportion = \(\frac{x}{n}\), \(x\) = number of successes, \(n\) = sample size

Conditions:

  • Sample is random
  • \(np_0 \geq 5\) and \(n(1-p_0) \geq 5\) (large sample)
  • Population size at least 10 times sample size

Alternative Approach (Binomial):

Can model as \(X \sim B(n, p_0)\) under \(H_0\) and find p-value directly

⚠️ Common Pitfalls & Tips:

  • Use \(p_0\) (hypothesized value) in denominator, not \(\hat{p}\)
  • For small samples, use exact binomial test instead of z-test
  • Sample proportion \(\hat{p}\) must be between 0 and 1
  • GDC can perform 1-PropZTest automatically

📝 Worked Example 5: Testing Population Proportion

Question: A company claims that 40% of customers prefer their new product. A survey of 200 randomly selected customers finds that 95 prefer the new product.

Test at the 5% significance level whether the true proportion is different from 40%.

Solution:

Step 1: State hypotheses

\(H_0: p = 0.40\) (40% of customers prefer the new product)

\(H_1: p \neq 0.40\) (proportion is different from 40%)

This is a two-tailed test

Step 2: Check conditions

\(np_0 = 200(0.40) = 80 \geq 5\) ✓

\(n(1-p_0) = 200(0.60) = 120 \geq 5\) ✓

Conditions met for z-test

Step 3: Calculate sample proportion

\[ \hat{p} = \frac{x}{n} = \frac{95}{200} = 0.475 \]

Step 4: Calculate test statistic

Using GDC: STAT → TESTS → 1-PropZTest

Input: \(p_0 = 0.40\), \(x = 95\), \(n = 200\)

Alternative: \(p \neq 0.40\)

Manual calculation:

\[ z = \frac{0.475 - 0.40}{\sqrt{\frac{0.40(0.60)}{200}}} = \frac{0.075}{\sqrt{\frac{0.24}{200}}} = \frac{0.075}{\sqrt{0.0012}} = \frac{0.075}{0.0346} = 2.167 \]

GDC Output:

z = 2.167

p-value = 0.0302

Step 5: Decision

Since p-value (0.0302) < \(\alpha\) (0.05), we reject \(H_0\).

Step 6: Conclusion

At the 5% significance level, there is sufficient evidence to conclude that the true proportion of customers who prefer the new product is different from 40%. The sample data suggests the proportion is closer to 47.5%.

📊 Quick Reference Summary

Basic Process

  1. State \(H_0\) and \(H_1\)
  2. Choose significance level \(\alpha\)
  3. Calculate test statistic
  4. Find p-value
  5. Compare p-value with \(\alpha\)
  6. State conclusion in context

Decision Rule

  • p-value < \(\alpha\): Reject \(H_0\)
  • p-value ≥ \(\alpha\): Do not reject \(H_0\)
  • Typical \(\alpha\) = 0.05 (5%)

Errors

  • Type I: Reject true \(H_0\)
  • P(Type I) = \(\alpha\)
  • Type II: Accept false \(H_0\)
  • P(Type II) = \(\beta\)

Test Selection

  • \(\chi^2\): Independence/Fit
  • t-test: Means (σ unknown)
  • z-test: Proportions/Means (σ known)

🎯 Which Test Should I Use?

SituationTest to UseKey Information
Testing if two categorical variables are related\(\chi^2\) IndependenceUse contingency table, df = (r-1)(c-1)
Testing if data fits expected distribution\(\chi^2\) Goodness of FitCompare O vs E frequencies, df = n-1
Testing one population mean (σ unknown)One-sample t-testUse sample mean and s, df = n-1
Comparing two population meansTwo-sample t-testUse pooled option on GDC
Testing population proportionz-test for proportionOr binomial test for small n

🖩 Essential GDC Functions for Hypothesis Testing

  • \(\chi^2\)-Test: STAT → TESTS → \(\chi^2\)-Test (for independence or GOF)
  • One-sample t-test: STAT → TESTS → T-Test (use summary stats or data)
  • Two-sample t-test: STAT → TESTS → 2-SampTTest (select pooled option)
  • One-proportion z-test: STAT → TESTS → 1-PropZTest
  • Finding p-values: Automatically calculated by all test functions
  • Remember: Always write "Using GDC" in exam solutions

✍️ IB Exam Strategy for Hypothesis Testing

  1. Always write hypotheses first – define what \(H_0\) and \(H_1\) represent in context
  2. State significance level – usually 5% unless specified otherwise
  3. Show GDC work: Write "Using GDC:" and state which test you used
  4. Report both statistic and p-value when asked
  5. Compare p-value with \(\alpha\) explicitly
  6. State conclusion in two parts:
    • Statistical: "We reject/do not reject \(H_0\)"
    • Contextual: What this means for the problem
  7. For Type I/II errors: Always describe in context of the specific problem
  8. Check assumptions: Mention when conditions are met (e.g., large sample)
  9. Be precise with language:
    • "Sufficient evidence" not "proof"
    • "Do not reject" not "accept"
    • "Suggests" not "proves"

🚫 Top Mistakes to Avoid

  1. Confusing hypotheses: \(H_0\) is status quo (equals), \(H_1\) is what we suspect
  2. Wrong comparison: Compare p-value with \(\alpha\), not with 0.5
  3. Incorrect language: Never say "accept \(H_0\)" – say "do not reject \(H_0\)"
  4. Missing context: Always interpret results in terms of the problem
  5. Switching errors: Type I = reject true \(H_0\); Type II = accept false \(H_0\)
  6. Wrong degrees of freedom: Check formula for each test type
  7. One-tailed vs two-tailed: Match alternative hypothesis to problem statement
  8. Forgetting conditions: Check sample size requirements for each test
  9. Claiming causation: Tests show association, not causation
  10. Misinterpreting p-value: It's NOT the probability that \(H_0\) is true

🔄 Hypothesis Testing Flowchart

START: What are you testing?

↓ Relationship between two categorical variables?

→ YES: Use \(\chi^2\) Test for Independence

↓ Does data fit expected distribution?

→ YES: Use \(\chi^2\) Goodness of Fit Test

↓ Testing population mean(s)?

→ One sample, σ unknown: Use One-sample t-test

→ Two samples, compare means: Use Two-sample t-test

↓ Testing population proportion?

→ YES: Use z-test for proportion or Binomial test