IB Mathematics AA – Topic 4: Statistics & Probability
Comprehensive Guide to Statistics
Introduction to Statistics
Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions. From medical trials and economic forecasts to sports analytics and quality control, statistical methods help us extract meaningful patterns from numerical information and quantify uncertainty in our conclusions.
Key concepts: Measures of central tendency (mean, median, mode) describe the "center" or typical value of a dataset. Measures of dispersion (range, interquartile range, variance, standard deviation) quantify how spread out the data is. Visual representations like box plots, stem-and-leaf diagrams, and cumulative frequency curves reveal patterns and distributions at a glance.
Why statistics matters: Every scientific study, business decision, and policy evaluation relies on statistical analysis. Understanding variability helps assess reliability—a medicine with consistent effects is more trustworthy than one with wildly varying results. Comparing datasets requires both central tendency and spread to tell the complete story.
In this guide: We'll master calculating and interpreting mean, median, and mode, understand variance and standard deviation as measures of spread, learn to calculate and apply the interquartile range and identify outliers, create and interpret stem-and-leaf plots and box-and-whisker diagrams, work with cumulative frequency graphs and percentiles—all essential skills for IB exam success.
1. Measures of Central Tendency
The Three Averages
Central Tendency Measures
Mean (Arithmetic Average)
\(\bar{x} = \frac{\sum x}{n} = \frac{x_1 + x_2 + \cdots + x_n}{n}\)
Properties:
- Uses all data values
- Affected by outliers (extreme values)
- Most commonly used measure
- Can be any value, not necessarily in the dataset
Median (Middle Value)
Order data from smallest to largest, then find middle value
If \(n\) is odd: median is the middle value at position \(\frac{n+1}{2}\)
If \(n\) is even: median is average of two middle values
Properties:
- Not affected by outliers (robust)
- Always a value that could represent the data
- Divides dataset into two equal halves
- Better than mean for skewed distributions
Mode (Most Frequent Value)
The value that appears most often in the dataset
Properties:
- Can have no mode (all values appear once)
- Can have multiple modes (bimodal, multimodal)
- Only measure that works for categorical data
- Least affected by extreme values
⚠ Common Pitfalls:
- Forgetting to order data: Must sort data before finding median
- Even number of values: Median is average of two middle values, not either one
- Mean of means: Can't average group means unless groups are equal size
- Choosing wrong measure: Use median for skewed data, mean for symmetric data
Example 1: Central Tendency
Problem: The test scores of 7 students are: 65, 72, 85, 90, 72, 68, 95
(a) Find the mean
(b) Find the median
(c) Find the mode
Solution:
(a) Mean:
\(\bar{x} = \frac{\sum x}{n} = \frac{65 + 72 + 85 + 90 + 72 + 68 + 95}{7}\)
\(= \frac{547}{7} = 78.14\)
Mean = 78.14 (or 78.1)
(b) Median:
Step 1: Order the data
65, 68, 72, 72, 85, 90, 95
Step 2: Find middle position
\(n = 7\) (odd), so position = \(\frac{7+1}{2} = 4\)
The 4th value in ordered list is 72
Median = 72
(c) Mode:
Count frequency of each value:
65 (once), 68 (once), 72 (twice), 85 (once), 90 (once), 95 (once)
72 appears most frequently (2 times)
Mode = 72
2. Measures of Spread (Dispersion)
Range and Interquartile Range
Simple Measures of Spread:
Range
\(\text{Range} = \text{Maximum} - \text{Minimum}\)
Simple but affected by outliers
Interquartile Range (IQR)
\(\text{IQR} = Q_3 - Q_1\)
\(Q_1\): Lower quartile (25th percentile) - median of lower half
\(Q_2\): Median (50th percentile)
\(Q_3\): Upper quartile (75th percentile) - median of upper half
IQR measures spread of middle 50% of data (robust to outliers)
Variance and Standard Deviation
Variance and Standard Deviation
Population Variance
\(\sigma^2 = \frac{\sum(x - \mu)^2}{N}\)
where \(\mu\) is population mean, \(N\) is population size
Sample Variance
\(s^2 = \frac{\sum(x - \bar{x})^2}{n-1}\)
where \(\bar{x}\) is sample mean, \(n\) is sample size (divide by \(n-1\) for unbiased estimate)
Standard Deviation
\(\sigma = \sqrt{\sigma^2}\) (population)
\(s = \sqrt{s^2}\) (sample)
Square root of variance; in same units as original data
Alternative Formula (easier for calculation):
\(\sigma^2 = \frac{\sum x^2}{N} - \mu^2\)
\(s^2 = \frac{\sum x^2}{n} - \bar{x}^2\) (approximate for large \(n\))
Outliers
Outlier Detection Rule:
A value is an outlier if:
\(x < Q_1 - 1.5 \times \text{IQR}\)
or
\(x > Q_3 + 1.5 \times \text{IQR}\)
💡 Spread Tips:
- Variance has squared units; standard deviation has original units
- Larger standard deviation = more spread out data
- IQR is robust (not affected by outliers), range is not
- GDC can calculate all these measures quickly—use for verification
Example 2: Measures of Spread (IB-Style)
Problem: A dataset has values: 12, 15, 18, 20, 22, 25, 30
(a) Find the range
(b) Find \(Q_1\), \(Q_2\), and \(Q_3\)
(c) Find the interquartile range
(d) Determine if there are any outliers
Solution:
(a) Range:
Range = Maximum - Minimum = 30 - 12 = 18
Range = 18
(b) Quartiles:
Data already ordered: 12, 15, 18, 20, 22, 25, 30
\(n = 7\) (odd)
\(Q_2\) (Median): Position \(\frac{7+1}{2} = 4\)
\(Q_2 = 20\)
\(Q_1\): Median of lower half (12, 15, 18)
\(Q_1 = 15\)
\(Q_3\): Median of upper half (22, 25, 30)
\(Q_3 = 25\)
\(Q_1 = 15\), \(Q_2 = 20\), \(Q_3 = 25\)
(c) Interquartile Range:
\(\text{IQR} = Q_3 - Q_1 = 25 - 15 = 10\)
IQR = 10
(d) Outliers:
Lower bound: \(Q_1 - 1.5 \times \text{IQR} = 15 - 1.5(10) = 15 - 15 = 0\)
Upper bound: \(Q_3 + 1.5 \times \text{IQR} = 25 + 1.5(10) = 25 + 15 = 40\)
All values (12 to 30) fall within [0, 40]
No outliers
3. Data Representation
Stem-and-Leaf Plots
What is a Stem-and-Leaf Plot?
Displays data by splitting each value into a "stem" (leading digit(s)) and "leaf" (trailing digit)
Example: For data 23, 25, 31, 34, 37, 42
Stem | Leaf
2 | 3 5
3 | 1 4 7
4 | 2
Key: 2|3 means 23
Advantages: Shows actual data values, easy to find median and mode, reveals distribution shape
Box-and-Whisker Plots (Box Plots)
Box Plot Components
A box plot displays five key numbers:
- Minimum: Smallest value (excluding outliers)
- \(Q_1\): Lower quartile (start of box)
- Median (\(Q_2\)): Line inside box
- \(Q_3\): Upper quartile (end of box)
- Maximum: Largest value (excluding outliers)
Box: Represents middle 50% of data (IQR)
Whiskers: Extend to minimum and maximum (within 1.5×IQR)
Outliers: Plotted as individual points beyond whiskers
Cumulative Frequency
Cumulative Frequency Graphs:
Cumulative frequency is the running total of frequencies up to a given value
Using Cumulative Frequency to find:
- Median: Value at \(\frac{n}{2}\)th position
- \(Q_1\): Value at \(\frac{n}{4}\)th position
- \(Q_3\): Value at \(\frac{3n}{4}\)th position
- Percentiles: \(k\)th percentile at \(\frac{kn}{100}\)th position
Cumulative Frequency Curve (Ogive):
- Plot cumulative frequency against upper class boundaries
- Join points with smooth S-shaped curve
- Read off median, quartiles from graph
- Curve always increases (never decreases)
⚠ Representation Pitfalls:
- Box plot scale: Always use consistent, clearly labeled scale
- Outliers: Mark outliers separately, don't include in whiskers
- Cumulative frequency: Always increasing, starts at 0, ends at \(n\)
- Reading from graph: Interpolate carefully between plotted points
📋 Statistics Quick Reference
| Measure | Formula | What it Measures |
|---|---|---|
| Mean | \(\bar{x} = \frac{\sum x}{n}\) | Average value |
| Median | Middle value when ordered | Central value (robust) |
| Range | Max - Min | Total spread |
| IQR | \(Q_3 - Q_1\) | Spread of middle 50% |
| Variance | \(s^2 = \frac{\sum(x-\bar{x})^2}{n-1}\) | Average squared deviation |
| Std Dev | \(s = \sqrt{s^2}\) | Typical deviation from mean |
🎯 IB Exam Strategy
Common Question Types:
- "Calculate mean, median, mode": Show working for each measure
- "Find quartiles and IQR": Order data first, find positions carefully
- "Draw box plot": Calculate five-number summary, mark outliers separately
- "Identify outliers": Use \(Q_1 - 1.5 \times \text{IQR}\) and \(Q_3 + 1.5 \times \text{IQR}\)
- "From cumulative frequency graph": Read median at \(\frac{n}{2}\), quartiles at \(\frac{n}{4}\) and \(\frac{3n}{4}\)
Key Reminders:
- Always order data before finding median or quartiles
- Use GDC for standard deviation and variance calculations
- IQR and median are robust (not affected by outliers)
- Mean and standard deviation are affected by outliers
- Show all working—partial credit available
🎉 Master Statistics!
Statistics transforms raw data into meaningful insights. From medical research determining drug effectiveness to businesses analyzing customer behavior, statistical methods provide the foundation for evidence-based decision-making. Mastering measures of central tendency, spread, and visual representations prepares you for advanced study and real-world applications!
Key Success Factors:
- ✓ Mean: sum divided by count (affected by outliers)
- ✓ Median: middle value when ordered (robust to outliers)
- ✓ IQR = \(Q_3 - Q_1\) (spread of middle 50%)
- ✓ Standard deviation measures typical distance from mean
- ✓ Outlier if outside \(Q_1 - 1.5 \times \text{IQR}\) to \(Q_3 + 1.5 \times \text{IQR}\)
- ✓ Box plot shows five-number summary at a glance
Order First • Calculate Carefully • Interpret Meaningfully
Master statistics and excel in IB Mathematics! 🚀