IB Mathematics AA – Topic 4: Statistics & Probability

Comprehensive Guide to Bivariate Statistics

Introduction to Bivariate Statistics

Bivariate statistics analyzes the relationship between two variables—examining whether they move together, in opposite directions, or independently. From predicting house prices based on size to understanding the relationship between study hours and exam scores, bivariate analysis reveals patterns that single-variable statistics cannot detect.

Key concepts: A scatter plot visualizes paired data points to reveal patterns. The regression line (line of best fit) provides a mathematical model to predict one variable from another. The correlation coefficient \(r\) quantifies the strength and direction of the linear relationship, ranging from -1 (perfect negative correlation) through 0 (no linear correlation) to +1 (perfect positive correlation).

Why bivariate statistics matters: Understanding relationships between variables drives decision-making across all fields—economists predict GDP from unemployment rates, medical researchers link cholesterol to heart disease risk, and engineers optimize performance by analyzing multiple factors. However, correlation does not imply causation—a strong relationship doesn't prove one variable causes changes in the other.

In this guide: We'll master creating and interpreting scatter plots, understand positive, negative, and zero correlation, derive and apply the equation of the regression line \(y = ax + b\), calculate and interpret Pearson's correlation coefficient \(r\), assess the strength of linear relationships, make predictions using regression models, and understand the critical distinction between correlation and causation—all essential for IB exam success.

1. Scatter Plots and Types of Correlation

What is Bivariate Data?

Bivariate Data:

Data consisting of pairs of values for two variables, often denoted as \((x, y)\)

  • Independent variable (\(x\)): Explanatory variable (what you control or measure first)
  • Dependent variable (\(y\)): Response variable (what you predict or observe as result)
  • Scatter plot (scatter diagram): Graph showing all data points plotted as coordinates

Types of Correlation

Correlation Patterns

Positive Correlation

  • As \(x\) increases, \(y\) tends to increase
  • Points slope upward from left to right
  • Correlation coefficient: \(0 < r \leq 1\)

Example: Height and weight, study hours and exam scores

Negative Correlation

  • As \(x\) increases, \(y\) tends to decrease
  • Points slope downward from left to right
  • Correlation coefficient: \(-1 \leq r < 0\)

Example: Car age and value, altitude and temperature

Zero (No) Correlation

  • No clear linear relationship between variables
  • Points scattered randomly
  • Correlation coefficient: \(r \approx 0\)

Example: Shoe size and IQ, height and favorite color

Strength of Correlation

Interpreting Correlation Strength:

Value of \(|r|\) Strength Description
0.9 to 1.0 Very Strong Points very close to line
0.7 to 0.9 Strong Clear linear pattern
0.5 to 0.7 Moderate Some scatter around line
0.3 to 0.5 Weak Considerable scatter
0 to 0.3 Very Weak/None No clear pattern

Note: These are guidelines; interpretation can vary by context

2. Regression Line (Line of Best Fit)

Equation of the Regression Line

Regression Line Formula

\(y = ax + b\)

or equivalently: \(y = mx + c\)

where:

  • \(a\) (or \(m\)): gradient (slope) of the line
  • \(b\) (or \(c\)): y-intercept (value of \(y\) when \(x = 0\))

Finding the Regression Line

Method 1: Using GDC (Recommended for IB)

  1. Enter \(x\)-values in List 1 (L1)
  2. Enter \(y\)-values in List 2 (L2)
  3. Access Statistics menu → Calculation → Linear Regression
  4. Select LinReg (ax+b) or LinReg (y=mx+c)
  5. Calculator displays: \(a\) (or \(m\)), \(b\) (or \(c\)), and \(r\)

Method 2: Least Squares Formulas (For Understanding)

Gradient:

\(a = \frac{S_{xy}}{S_{xx}} = \frac{\sum xy - \frac{(\sum x)(\sum y)}{n}}{\sum x^2 - \frac{(\sum x)^2}{n}}\)

Y-intercept:

\(b = \bar{y} - a\bar{x}\)

where \(\bar{x}\) and \(\bar{y}\) are the means of \(x\) and \(y\)

Note: The regression line always passes through the point \((\bar{x}, \bar{y})\)

Using the Regression Line

Applications:

  • Interpolation: Predicting \(y\) for \(x\) values within the data range (reliable)
  • Extrapolation: Predicting \(y\) for \(x\) values outside the data range (less reliable)
  • Interpretation: Gradient shows how much \(y\) changes per unit change in \(x\)

⚠ Regression Pitfalls:

  • Extrapolation danger: Predictions far outside data range may be inaccurate
  • Causation assumption: Regression doesn't prove \(x\) causes \(y\)!
  • Outliers: Single extreme values can heavily influence the regression line
  • Non-linear data: Linear regression only appropriate for roughly linear relationships

Example 1: Finding and Using Regression Line

Problem: A study records hours studied (\(x\)) and test scores (\(y\)) for 5 students:

Hours (\(x\)) Score (\(y\))
2 50
4 60
6 70
8 80
10 90

(a) Find the equation of the regression line

(b) Interpret the gradient

(c) Predict the score for a student who studies 7 hours

Solution:

(a) Finding regression line equation:

Using GDC:

Enter data: L1 = {2, 4, 6, 8, 10}, L2 = {50, 60, 70, 80, 90}

Run LinReg(ax+b)

GDC output: \(a = 5\), \(b = 40\)

Equation: \(y = 5x + 40\)

(b) Interpreting gradient:

The gradient \(a = 5\) means:

For each additional hour studied, the test score increases by 5 marks on average

(c) Prediction for 7 hours:

Substitute \(x = 7\) into \(y = 5x + 40\):

\(y = 5(7) + 40 = 35 + 40 = 75\)

Predicted score: 75 marks

Note: This is interpolation (7 is within the range 2 to 10), so prediction is reliable

3. Pearson's Correlation Coefficient (\(r\))

Definition and Formula

Pearson's Correlation Coefficient

The correlation coefficient \(r\) measures the strength and direction of linear relationship

\(-1 \leq r \leq +1\)

Formula (for understanding):

\(r = \frac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}}\)

\(= \frac{\sum xy - \frac{(\sum x)(\sum y)}{n}}{\sqrt{\left[\sum x^2 - \frac{(\sum x)^2}{n}\right]\left[\sum y^2 - \frac{(\sum y)^2}{n}\right]}}\)

where \(n\) is the number of data pairs

Interpreting \(r\)

Key Interpretations:

\(r = +1\): Perfect positive correlation

All points lie exactly on a line with positive slope

\(r = -1\): Perfect negative correlation

All points lie exactly on a line with negative slope

\(r = 0\): No linear correlation

No linear relationship (but may have non-linear relationship)

\(r > 0\): Positive correlation (both increase together)

\(r < 0\): Negative correlation (one increases as other decreases)

Properties of \(r\)

Important Properties:

  • Unitless: \(r\) has no units, independent of measurement scales
  • Symmetric: Correlation of \(x\) with \(y\) equals correlation of \(y\) with \(x\)
  • Linear only: \(r\) measures linear relationships; may miss non-linear patterns
  • Sensitive to outliers: Extreme values can significantly affect \(r\)
  • \(r^2\): Coefficient of determination—proportion of variance explained by model

💡 Correlation Tips:

  • Use GDC to calculate \(r\)—it's displayed with regression results
  • Strong correlation (\(|r| > 0.7\)) suggests good linear fit
  • Weak correlation (\(|r| < 0.3\)) suggests linear model may not be appropriate
  • Always interpret \(r\) in context of the problem

Example 2: Correlation Coefficient (IB-Style)

Problem: The table shows temperature (\(x\)°C) and ice cream sales (\(y\) units):

Temperature (\(x\)) Sales (\(y\))
15 30
20 45
25 60
30 70
35 90

(a) Calculate the correlation coefficient \(r\)

(b) Interpret the value of \(r\)

(c) Find the regression line equation

(d) Predict sales when temperature is 28°C

Solution:

(a) Correlation coefficient:

Using GDC:

Enter L1 = {15, 20, 25, 30, 35}

Enter L2 = {30, 45, 60, 70, 90}

Run LinReg(ax+b)

GDC displays: \(r = 0.9919\)

\(r = 0.992\) (to 3 s.f.)

(b) Interpretation:

\(r = 0.992\) is very close to +1

There is a very strong positive linear correlation between temperature and ice cream sales

This means: as temperature increases, ice cream sales tend to increase in a very consistent linear pattern

(c) Regression equation:

From same GDC output:

\(a = 2.85\), \(b = -12.5\)

\(y = 2.85x - 12.5\)

(d) Prediction for 28°C:

Substitute \(x = 28\):

\(y = 2.85(28) - 12.5 = 79.8 - 12.5 = 67.3\)

Predicted sales: 67 units (rounded)

Note: This is interpolation and the strong correlation makes this prediction reliable

4. Correlation Does NOT Imply Causation

Critical Understanding

Just because two variables are correlated does NOT mean one causes the other!

Three possible explanations for correlation:

  • Causation: \(x\) directly causes changes in \(y\)
  • Reverse causation: \(y\) actually causes changes in \(x\)
  • Lurking variable: Third variable \(z\) causes both \(x\) and \(y\)
  • Coincidence: Correlation is purely by chance

Classic Examples of Spurious Correlation:

  • Ice cream sales and drowning deaths: Both increase in summer (lurking variable: weather/temperature)
  • Shoe size and reading ability in children: Both increase with age (lurking variable: age)
  • Number of firefighters and fire damage: Bigger fires need more firefighters (reverse relationship)

Bottom line: Always think critically about WHY variables might be correlated!

📋 Bivariate Statistics Quick Reference

Concept Formula/Description Key Points
Regression Line \(y = ax + b\) Line of best fit; use for predictions
Gradient (\(a\)) Change in \(y\) per unit change in \(x\) Rate of change
Y-intercept (\(b\)) Value of \(y\) when \(x = 0\) Starting value
Correlation \(r\) \(-1 \leq r \leq +1\) Strength & direction of linear relationship
Strong Correlation \(|r| > 0.7\) Points close to line
Interpolation Predict within data range Reliable
Extrapolation Predict outside data range Less reliable

🎯 IB Exam Strategy

Common Question Types:

  • "Find the regression line": Use GDC LinReg, write equation \(y = ax + b\)
  • "Calculate correlation coefficient": Use GDC, interpret strength and direction
  • "Predict y when x = ...": Substitute into regression equation
  • "Interpret the gradient": Explain meaning in context (rate of change)
  • "Comment on strength of correlation": Use \(|r|\) value and descriptive terms
  • "Explain why correlation ≠ causation": Discuss lurking variables

Key Reminders:

  • Always use GDC for regression calculations—faster and more accurate
  • Correlation coefficient \(r\) is given automatically with LinReg output
  • Interpret findings in context of the problem
  • \(r\) close to ±1 means strong linear relationship
  • Interpolation (within range) is reliable; extrapolation (outside) is risky
  • Never conclude causation from correlation alone!

🎉 Master Bivariate Statistics!

Bivariate statistics reveals relationships hidden in paired data. From predicting outcomes to understanding patterns, regression analysis and correlation provide powerful tools for data-driven decision making. Remember: strong correlation shows association, but careful reasoning is needed to establish causation. Master these skills for IB success and real-world applications!

Key Success Factors:

  • ✓ Regression line: \(y = ax + b\) (use GDC LinReg)
  • ✓ Gradient \(a\) shows rate of change of \(y\) per unit of \(x\)
  • ✓ Correlation \(r\): -1 ≤ \(r\) ≤ +1 (strength and direction)
  • ✓ Strong correlation: \(|r| > 0.7\); weak: \(|r| < 0.3\)
  • ✓ Interpolation reliable; extrapolation risky
  • ✓ CRITICAL: Correlation ≠ Causation!

Use Your GDC • Interpret in Context • Think Critically

Master bivariate statistics and excel in IB Mathematics! 🚀