IB Mathematics AI – Topic 4

Statistics & Probability: Descriptive Statistics

Measures of Central Tendency

Mean (Arithmetic Average)

Definition: The sum of all values divided by the number of values. Most sensitive to outliers.

For ungrouped data:

\[ \bar{x} = \frac{\sum x_i}{n} = \frac{x_1 + x_2 + \cdots + x_n}{n} \]

For grouped data (frequency table):

\[ \bar{x} = \frac{\sum f_i x_i}{\sum f_i} \]

where \(f_i\) is frequency and \(x_i\) is the data value (or midpoint for grouped intervals)

⚠️ Common Pitfalls & Tips:

  • The mean is highly affected by outliers – consider using median for skewed data
  • For grouped data, use the midpoint of each class interval
  • Always check units and round appropriately (typically 3 significant figures for IB)

Median

Definition: The middle value when data is arranged in ascending order. Not affected by extreme values.

For ungrouped data:

  • If \(n\) is odd: Median = \(\left(\frac{n+1}{2}\right)^{\text{th}}\) value
  • If \(n\) is even: Median = \(\frac{1}{2}\left[\left(\frac{n}{2}\right)^{\text{th}} + \left(\frac{n}{2}+1\right)^{\text{th}}\right]\) value

From cumulative frequency: Median position = \(\frac{n}{2}\)

⚠️ Common Pitfalls & Tips:

  • Always order the data first before finding the median
  • For even-sized datasets, take the mean of the two middle values
  • Use cumulative frequency curves to estimate median graphically

Mode

Definition: The value(s) that occur most frequently in a dataset.

  • Unimodal: One mode
  • Bimodal: Two modes
  • Multimodal: More than two modes
  • No mode: All values occur with equal frequency

For grouped data: The modal class is the interval with the highest frequency

⚠️ Common Pitfalls & Tips:

  • Mode is the only measure of central tendency that can be used for categorical data
  • A dataset may have no mode or multiple modes
  • For histograms, the modal class is the bar with the highest frequency

📝 Worked Example 1: Ungrouped Data

Question: The test scores of 9 students are: 78, 85, 92, 78, 88, 95, 82, 78, 90

Find: (a) Mean (b) Median (c) Mode

Solution:

(a) Mean:

\[ \bar{x} = \frac{78 + 85 + 92 + 78 + 88 + 95 + 82 + 78 + 90}{9} \]

\[ \bar{x} = \frac{766}{9} = 85.1 \text{ (3 s.f.)} \]

(b) Median:

First, order the data: 78, 78, 78, 82, 85, 88, 90, 92, 95

Since \(n = 9\) (odd), median position = \(\frac{9+1}{2} = 5^{\text{th}}\) value

Median = 85

(c) Mode:

The value 78 appears 3 times (most frequent)

Mode = 78

📝 Worked Example 2: Grouped Data

Question: The frequency table shows the ages of 50 people at a concert:

AgeFrequencyMidpoint\(f \times x\)
10-19814.5116
20-291524.5367.5
30-391834.5621
40-49944.5400.5
Total501505

Calculate the mean age.

Solution:

Using the formula for grouped data:

\[ \bar{x} = \frac{\sum f_i x_i}{\sum f_i} = \frac{1505}{50} = 30.1 \text{ years} \]

The modal class is 30-39 (highest frequency = 18)

Measures of Dispersion

Range

\[ \text{Range} = \text{Maximum value} - \text{Minimum value} \]

Quartiles & Interquartile Range (IQR)

Definition: Quartiles divide ordered data into four equal parts.

  • \(Q_1\) (Lower Quartile): 25% of data lies below this value
  • \(Q_2\) (Median): 50% of data lies below this value
  • \(Q_3\) (Upper Quartile): 75% of data lies below this value

Interquartile Range (IQR):

\[ \text{IQR} = Q_3 - Q_1 \]

IQR measures the spread of the middle 50% of data and is resistant to outliers.

Outlier Detection:

A data point is an outlier if:

  • It is less than \(Q_1 - 1.5 \times \text{IQR}\)
  • It is greater than \(Q_3 + 1.5 \times \text{IQR}\)

⚠️ Common Pitfalls & Tips:

  • Your GDC can calculate quartiles quickly – know how to access this function
  • \(Q_2\) is always equal to the median
  • IQR is preferred over range when outliers are present
  • Use cumulative frequency curves to find quartiles graphically

📝 Worked Example 3: Quartiles & IQR

Question: The daily temperatures (°C) recorded over 15 days are:

18, 20, 21, 22, 23, 23, 24, 25, 26, 27, 28, 29, 30, 31, 35

Find: (a) \( Q_1, Q_2, Q_3 \) (b) IQR (c) Identify any outliers

Solution:

(a) Finding Quartiles:

Data is already ordered. \(n = 15\)

\(Q_2\) (Median) = 8th value = 25°C

\(Q_1\) = Median of lower half (first 7 values) = 4th value = 22°C

\(Q_3\) = Median of upper half (last 7 values) = 12th value = 29°C

(b) IQR:

\[ \text{IQR} = Q_3 - Q_1 = 29 - 22 = 7°\text{C} \]

(c) Outliers:

Lower boundary: \(Q_1 - 1.5 \times \text{IQR} = 22 - 1.5(7) = 11.5°\text{C}\)

Upper boundary: \(Q_3 + 1.5 \times \text{IQR} = 29 + 1.5(7) = 39.5°\text{C}\)

Since all values lie between 11.5°C and 39.5°C, there are no outliers.

Data Visualization

Histograms

Definition: A graphical representation of grouped continuous data using bars where the area represents frequency.

Key Features:

  • Bars touch (no gaps) – continuous data
  • X-axis: Class intervals (bins)
  • Y-axis: Frequency or frequency density
  • Frequency density used when class widths are unequal

Frequency Density Formula:

\[ \text{Frequency Density} = \frac{\text{Frequency}}{\text{Class Width}} \]

Area of bar = Frequency

\[ \text{Frequency} = \text{Frequency Density} \times \text{Class Width} \]

⚠️ Common Pitfalls & Tips:

  • Bars must touch in histograms (unlike bar charts)
  • When class widths vary, always use frequency density on y-axis
  • Modal class is the bar with the highest frequency (not necessarily tallest bar if using frequency density)
  • Read scales carefully – check if y-axis shows frequency or frequency density

Box and Whisker Plots (Box Plots)

Definition: A visual summary showing the five-number summary of a dataset.

Five-Number Summary:

  1. Minimum value
  2. \(Q_1\) (Lower Quartile)
  3. \(Q_2\) (Median)
  4. \(Q_3\) (Upper Quartile)
  5. Maximum value

Box Plot Structure:

|———[ | | ]———|
Min    Q₁   Q₂   Q₃    Max

  • Box spans from \(Q_1\) to \(Q_3\) (represents middle 50% of data)
  • Line inside box marks the median (\(Q_2\))
  • Whiskers extend to minimum and maximum (or to boundaries if outliers exist)
  • Outliers shown as individual points beyond whiskers

Interpreting Box Plots:

  • Symmetrical: Median in center of box, whiskers equal length
  • Positively skewed: Right whisker longer, median closer to \(Q_1\)
  • Negatively skewed: Left whisker longer, median closer to \(Q_3\)

⚠️ Common Pitfalls & Tips:

  • Box plots don't show individual data points or frequency
  • Always plot outliers separately beyond the whiskers
  • Use the scale carefully when drawing or reading box plots
  • Remember: The box contains 50% of the data (middle portion)

📝 Worked Example 4: Histogram with Unequal Class Widths

Question: The table shows the time (in minutes) students spent on homework:

Time (minutes)FrequencyClass WidthFrequency Density
0-206200.3
20-3012101.2
30-5016200.8
50-608100.8

(a) Explain why frequency density must be used. (b) Find the modal class.

Solution:

(a) Frequency density must be used because the class widths are not equal. The classes have widths of 20, 10, 20, and 10 minutes respectively. Using frequency density ensures that the area of each bar represents the frequency correctly.

Calculation check:

For 20-30 interval: \(\text{Frequency Density} = \frac{12}{10} = 1.2\) ✓

(b) The modal class is 30-50 minutes because it has the highest frequency (16 students), not the 20-30 class which has the tallest bar (highest frequency density).

Cumulative Frequency

Cumulative Frequency Curves

Definition: A running total of frequencies, plotted to show how many data values fall below each upper class boundary.

Key Features:

  • Plot cumulative frequency against the upper class boundary
  • Join points with a smooth curve (ogive)
  • Final point has y-coordinate equal to total frequency (\(n\))
  • Used to estimate median, quartiles, and percentiles

Reading from Cumulative Frequency Curve:

  • Median: Find \(\frac{n}{2}\) on y-axis, read across to curve, then down to x-axis
  • \(Q_1\): Find \(\frac{n}{4}\) on y-axis
  • \(Q_3\): Find \(\frac{3n}{4}\) on y-axis
  • Percentiles: For \(k^{\text{th}}\) percentile, find \(\frac{kn}{100}\) on y-axis

⚠️ Common Pitfalls & Tips:

  • Always plot at upper class boundaries, not midpoints
  • Start cumulative frequency at 0 for the lower boundary of the first class
  • Draw a smooth S-shaped curve, not straight lines between points
  • Use a ruler for horizontal and vertical lines when reading values
  • Check that the last cumulative frequency equals the total \(n\)

📝 Worked Example 5: Cumulative Frequency & Box Plot

Question: The table shows the heights (in cm) of 80 students:

Height (cm)FrequencyCumulative Frequency
150 ≤ h < 15588
155 ≤ h < 1601220
160 ≤ h < 1652444
165 ≤ h < 1702266
170 ≤ h < 1801480

Using the cumulative frequency curve, find: (a) Median (b) \(Q_1\) and \(Q_3\) (c) IQR (d) Estimate how many students have heights between 158 cm and 167 cm

Solution:

(a) Median:

Position of median = \(\frac{n}{2} = \frac{80}{2} = 40^{\text{th}}\) value

From cumulative frequency curve: Read across from 40 on y-axis to curve, then down to x-axis

Median ≈ 163 cm

(b) Quartiles:

\(Q_1\) position = \(\frac{n}{4} = \frac{80}{4} = 20^{\text{th}}\) value → \(Q_1\) ≈ 159 cm

\(Q_3\) position = \(\frac{3n}{4} = \frac{3 \times 80}{4} = 60^{\text{th}}\) value → \(Q_3\) ≈ 167 cm

(c) IQR:

\[ \text{IQR} = Q_3 - Q_1 = 167 - 159 = 8 \text{ cm} \]

(d) Number of students between 158 cm and 167 cm:

From cumulative frequency curve:

• At 167 cm: cumulative frequency ≈ 60

• At 158 cm: cumulative frequency ≈ 16

Number of students = 60 - 16 = 44 students

📝 Worked Example 6: Complete IB-Style Problem

Question: The box plot below shows the distribution of scores for Class A in a mathematics test. The five-number summary is:

Minimum = 42, \(Q_1\) = 58, Median = 67, \(Q_3\) = 78, Maximum = 95

(a) Calculate the interquartile range.

(b) Determine whether a score of 105 would be considered an outlier.

(c) Comment on the skewness of the distribution.

Solution:

(a) IQR:

\[ \text{IQR} = Q_3 - Q_1 = 78 - 58 = 20 \]

IQR = 20 marks

(b) Outlier Test:

Lower boundary: \(Q_1 - 1.5 \times \text{IQR} = 58 - 1.5(20) = 58 - 30 = 28\)

Upper boundary: \(Q_3 + 1.5 \times \text{IQR} = 78 + 1.5(20) = 78 + 30 = 108\)

Since \(105 < 108\), a score of 105 would NOT be considered an outlier.

(c) Skewness:

Looking at the box plot structure:

  • Median (67) is closer to \(Q_1\) (58) than to \(Q_3\) (78)
  • Distance from \(Q_1\) to Median = 67 - 58 = 9
  • Distance from Median to \(Q_3\) = 78 - 67 = 11
  • Right whisker (95 - 78 = 17) is longer than left whisker (58 - 42 = 16)

The distribution is slightly positively skewed (skewed to the right) because the upper half of the data is more spread out than the lower half.

Standard Deviation & Variance

Definition: Standard deviation (\(\sigma\) or \(s\)) measures the average distance of data points from the mean.

For ungrouped data (population):

\[ \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{n}} \quad \text{or} \quad \sigma = \sqrt{\frac{\sum x_i^2}{n} - \mu^2} \]

Variance: \(\sigma^2\)

For grouped data:

\[ \sigma = \sqrt{\frac{\sum f_i(x_i - \bar{x})^2}{\sum f_i}} \quad \text{or} \quad \sigma = \sqrt{\frac{\sum f_i x_i^2}{\sum f_i} - \bar{x}^2} \]

Interpretation:

  • Larger standard deviation → more spread out data
  • Smaller standard deviation → data clustered around mean
  • Standard deviation = 0 → all values are identical

⚠️ Common Pitfalls & Tips:

  • Always use your GDC for standard deviation calculations in exams
  • Make sure you know the difference between \(\sigma_n\) (population) and \(\sigma_{n-1}\) (sample) on your calculator
  • Standard deviation has the same units as the original data
  • Variance is in squared units

📊 Quick Reference Summary

Central Tendency

  • Mean: Average, affected by outliers
  • Median: Middle value, resistant to outliers
  • Mode: Most frequent value

Dispersion

  • Range: Max - Min
  • IQR: \(Q_3 - Q_1\)
  • Standard Deviation: Spread from mean

Visualizations

  • Histograms: Frequency distribution
  • Box Plots: Five-number summary
  • Cumulative Frequency: Running totals

Key Formulas

  • Outliers: \(Q_1 - 1.5 \times \text{IQR}\)
  • Freq Density: \(\frac{\text{Freq}}{\text{Width}}\)
  • Quartile positions: \(\frac{n}{4}\), \(\frac{3n}{4}\)

🖩 GDC Calculator Tips

  • Learn to enter data into lists for statistical analysis
  • Know how to find one-variable statistics (mean, median, \(Q_1\), \(Q_3\), \(\sigma\))
  • Practice creating box plots and histograms on your calculator
  • Understand the difference between \(\sigma_n\) and \(\sigma_{n-1}\) functions
  • Use statistical regression features for bivariate data

✍️ IB Exam Strategy

  1. Show all working – even when using GDC, write down formulas and key steps
  2. Use correct notation – \(\bar{x}\) for mean, \(Q_1\), \(Q_2\), \(Q_3\) for quartiles
  3. Include units – always state units in your final answer
  4. Round appropriately – typically 3 significant figures unless stated otherwise
  5. Draw accurate diagrams – use a ruler for box plots and axes
  6. Interpret in context – relate statistical findings back to the question scenario
  7. Check reasonableness – do your answers make sense in context?