IB Mathematics AA – Topic 4: Statistics & Probability

Comprehensive Guide to Statistics

Introduction to Statistics

Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions. From medical trials and economic forecasts to sports analytics and quality control, statistical methods help us extract meaningful patterns from numerical information and quantify uncertainty in our conclusions.

Key concepts: Measures of central tendency (mean, median, mode) describe the "center" or typical value of a dataset. Measures of dispersion (range, interquartile range, variance, standard deviation) quantify how spread out the data is. Visual representations like box plots, stem-and-leaf diagrams, and cumulative frequency curves reveal patterns and distributions at a glance.

Why statistics matters: Every scientific study, business decision, and policy evaluation relies on statistical analysis. Understanding variability helps assess reliability—a medicine with consistent effects is more trustworthy than one with wildly varying results. Comparing datasets requires both central tendency and spread to tell the complete story.

In this guide: We'll master calculating and interpreting mean, median, and mode, understand variance and standard deviation as measures of spread, learn to calculate and apply the interquartile range and identify outliers, create and interpret stem-and-leaf plots and box-and-whisker diagrams, work with cumulative frequency graphs and percentiles—all essential skills for IB exam success.

1. Measures of Central Tendency

The Three Averages

Central Tendency Measures

Mean (Arithmetic Average)

\(\bar{x} = \frac{\sum x}{n} = \frac{x_1 + x_2 + \cdots + x_n}{n}\)

Properties:

  • Uses all data values
  • Affected by outliers (extreme values)
  • Most commonly used measure
  • Can be any value, not necessarily in the dataset

Median (Middle Value)

Order data from smallest to largest, then find middle value

If \(n\) is odd: median is the middle value at position \(\frac{n+1}{2}\)

If \(n\) is even: median is average of two middle values

Properties:

  • Not affected by outliers (robust)
  • Always a value that could represent the data
  • Divides dataset into two equal halves
  • Better than mean for skewed distributions

Mode (Most Frequent Value)

The value that appears most often in the dataset

Properties:

  • Can have no mode (all values appear once)
  • Can have multiple modes (bimodal, multimodal)
  • Only measure that works for categorical data
  • Least affected by extreme values

⚠ Common Pitfalls:

  • Forgetting to order data: Must sort data before finding median
  • Even number of values: Median is average of two middle values, not either one
  • Mean of means: Can't average group means unless groups are equal size
  • Choosing wrong measure: Use median for skewed data, mean for symmetric data

Example 1: Central Tendency

Problem: The test scores of 7 students are: 65, 72, 85, 90, 72, 68, 95

(a) Find the mean

(b) Find the median

(c) Find the mode

Solution:

(a) Mean:

\(\bar{x} = \frac{\sum x}{n} = \frac{65 + 72 + 85 + 90 + 72 + 68 + 95}{7}\)

\(= \frac{547}{7} = 78.14\)

Mean = 78.14 (or 78.1)

(b) Median:

Step 1: Order the data

65, 68, 72, 72, 85, 90, 95

Step 2: Find middle position

\(n = 7\) (odd), so position = \(\frac{7+1}{2} = 4\)

The 4th value in ordered list is 72

Median = 72

(c) Mode:

Count frequency of each value:

65 (once), 68 (once), 72 (twice), 85 (once), 90 (once), 95 (once)

72 appears most frequently (2 times)

Mode = 72

2. Measures of Spread (Dispersion)

Range and Interquartile Range

Simple Measures of Spread:

Range

\(\text{Range} = \text{Maximum} - \text{Minimum}\)

Simple but affected by outliers

Interquartile Range (IQR)

\(\text{IQR} = Q_3 - Q_1\)

\(Q_1\): Lower quartile (25th percentile) - median of lower half

\(Q_2\): Median (50th percentile)

\(Q_3\): Upper quartile (75th percentile) - median of upper half

IQR measures spread of middle 50% of data (robust to outliers)

Variance and Standard Deviation

Variance and Standard Deviation

Population Variance

\(\sigma^2 = \frac{\sum(x - \mu)^2}{N}\)

where \(\mu\) is population mean, \(N\) is population size

Sample Variance

\(s^2 = \frac{\sum(x - \bar{x})^2}{n-1}\)

where \(\bar{x}\) is sample mean, \(n\) is sample size (divide by \(n-1\) for unbiased estimate)

Standard Deviation

\(\sigma = \sqrt{\sigma^2}\) (population)

\(s = \sqrt{s^2}\) (sample)

Square root of variance; in same units as original data

Alternative Formula (easier for calculation):

\(\sigma^2 = \frac{\sum x^2}{N} - \mu^2\)

\(s^2 = \frac{\sum x^2}{n} - \bar{x}^2\) (approximate for large \(n\))

Outliers

Outlier Detection Rule:

A value is an outlier if:

\(x < Q_1 - 1.5 \times \text{IQR}\)

or

\(x > Q_3 + 1.5 \times \text{IQR}\)

💡 Spread Tips:

  • Variance has squared units; standard deviation has original units
  • Larger standard deviation = more spread out data
  • IQR is robust (not affected by outliers), range is not
  • GDC can calculate all these measures quickly—use for verification

Example 2: Measures of Spread (IB-Style)

Problem: A dataset has values: 12, 15, 18, 20, 22, 25, 30

(a) Find the range

(b) Find \(Q_1\), \(Q_2\), and \(Q_3\)

(c) Find the interquartile range

(d) Determine if there are any outliers

Solution:

(a) Range:

Range = Maximum - Minimum = 30 - 12 = 18

Range = 18

(b) Quartiles:

Data already ordered: 12, 15, 18, 20, 22, 25, 30

\(n = 7\) (odd)

\(Q_2\) (Median): Position \(\frac{7+1}{2} = 4\)

\(Q_2 = 20\)

\(Q_1\): Median of lower half (12, 15, 18)

\(Q_1 = 15\)

\(Q_3\): Median of upper half (22, 25, 30)

\(Q_3 = 25\)

\(Q_1 = 15\), \(Q_2 = 20\), \(Q_3 = 25\)

(c) Interquartile Range:

\(\text{IQR} = Q_3 - Q_1 = 25 - 15 = 10\)

IQR = 10

(d) Outliers:

Lower bound: \(Q_1 - 1.5 \times \text{IQR} = 15 - 1.5(10) = 15 - 15 = 0\)

Upper bound: \(Q_3 + 1.5 \times \text{IQR} = 25 + 1.5(10) = 25 + 15 = 40\)

All values (12 to 30) fall within [0, 40]

No outliers

3. Data Representation

Stem-and-Leaf Plots

What is a Stem-and-Leaf Plot?

Displays data by splitting each value into a "stem" (leading digit(s)) and "leaf" (trailing digit)

Example: For data 23, 25, 31, 34, 37, 42

Stem | Leaf
2 | 3 5
3 | 1 4 7
4 | 2

Key: 2|3 means 23

Advantages: Shows actual data values, easy to find median and mode, reveals distribution shape

Box-and-Whisker Plots (Box Plots)

Box Plot Components

A box plot displays five key numbers:

  • Minimum: Smallest value (excluding outliers)
  • \(Q_1\): Lower quartile (start of box)
  • Median (\(Q_2\)): Line inside box
  • \(Q_3\): Upper quartile (end of box)
  • Maximum: Largest value (excluding outliers)

Box: Represents middle 50% of data (IQR)

Whiskers: Extend to minimum and maximum (within 1.5×IQR)

Outliers: Plotted as individual points beyond whiskers

Cumulative Frequency

Cumulative Frequency Graphs:

Cumulative frequency is the running total of frequencies up to a given value

Using Cumulative Frequency to find:

  • Median: Value at \(\frac{n}{2}\)th position
  • \(Q_1\): Value at \(\frac{n}{4}\)th position
  • \(Q_3\): Value at \(\frac{3n}{4}\)th position
  • Percentiles: \(k\)th percentile at \(\frac{kn}{100}\)th position

Cumulative Frequency Curve (Ogive):

  • Plot cumulative frequency against upper class boundaries
  • Join points with smooth S-shaped curve
  • Read off median, quartiles from graph
  • Curve always increases (never decreases)

⚠ Representation Pitfalls:

  • Box plot scale: Always use consistent, clearly labeled scale
  • Outliers: Mark outliers separately, don't include in whiskers
  • Cumulative frequency: Always increasing, starts at 0, ends at \(n\)
  • Reading from graph: Interpolate carefully between plotted points

📋 Statistics Quick Reference

Measure Formula What it Measures
Mean \(\bar{x} = \frac{\sum x}{n}\) Average value
Median Middle value when ordered Central value (robust)
Range Max - Min Total spread
IQR \(Q_3 - Q_1\) Spread of middle 50%
Variance \(s^2 = \frac{\sum(x-\bar{x})^2}{n-1}\) Average squared deviation
Std Dev \(s = \sqrt{s^2}\) Typical deviation from mean

🎯 IB Exam Strategy

Common Question Types:

  • "Calculate mean, median, mode": Show working for each measure
  • "Find quartiles and IQR": Order data first, find positions carefully
  • "Draw box plot": Calculate five-number summary, mark outliers separately
  • "Identify outliers": Use \(Q_1 - 1.5 \times \text{IQR}\) and \(Q_3 + 1.5 \times \text{IQR}\)
  • "From cumulative frequency graph": Read median at \(\frac{n}{2}\), quartiles at \(\frac{n}{4}\) and \(\frac{3n}{4}\)

Key Reminders:

  • Always order data before finding median or quartiles
  • Use GDC for standard deviation and variance calculations
  • IQR and median are robust (not affected by outliers)
  • Mean and standard deviation are affected by outliers
  • Show all working—partial credit available

🎉 Master Statistics!

Statistics transforms raw data into meaningful insights. From medical research determining drug effectiveness to businesses analyzing customer behavior, statistical methods provide the foundation for evidence-based decision-making. Mastering measures of central tendency, spread, and visual representations prepares you for advanced study and real-world applications!

Key Success Factors:

  • ✓ Mean: sum divided by count (affected by outliers)
  • ✓ Median: middle value when ordered (robust to outliers)
  • ✓ IQR = \(Q_3 - Q_1\) (spread of middle 50%)
  • ✓ Standard deviation measures typical distance from mean
  • ✓ Outlier if outside \(Q_1 - 1.5 \times \text{IQR}\) to \(Q_3 + 1.5 \times \text{IQR}\)
  • ✓ Box plot shows five-number summary at a glance

Order First • Calculate Carefully • Interpret Meaningfully

Master statistics and excel in IB Mathematics! 🚀