IB Mathematics AI – Topic 4

Statistics & Probability: Bivariate Statistics

Scatter Plots (Scatter Diagrams)

Definition & Purpose

Definition: A scatter plot is a graphical representation of bivariate data (two variables) that shows the relationship between an independent variable (x) and a dependent variable (y).

Key Components:

  • Independent variable (x): Explanatory variable, plotted on horizontal axis
  • Dependent variable (y): Response variable, plotted on vertical axis
  • Data points: Each point represents one observation (x, y)
  • Outliers: Points that don't follow the general pattern

Types of Relationships:

  • Positive correlation: As x increases, y tends to increase
  • Negative correlation: As x increases, y tends to decrease
  • No correlation: No clear relationship between x and y
  • Non-linear relationship: Variables related but not in a straight line

⚠️ Common Pitfalls & Tips:

  • Always label axes clearly with variable names and units
  • Choose appropriate scales to make the pattern visible
  • Correlation does not imply causation – a relationship doesn't mean one variable causes the other
  • Look for outliers that may affect correlation and regression calculations
  • Your GDC can plot scatter diagrams quickly – learn this function

📝 Worked Example 1: Interpreting Scatter Plots

Question: A researcher collected data on the number of hours students studied (x) and their test scores (y). The scatter plot shows a clear upward trend with points closely clustered around an imaginary line. One point (2 hours, 95%) appears far from the others.

(a) Describe the relationship between study hours and test scores.

(b) Identify any outliers and suggest a possible explanation.

(c) Explain what this relationship suggests about studying and performance.

Solution:

(a) Relationship:

The scatter plot shows a strong positive correlation between study hours and test scores. As the number of study hours increases, test scores tend to increase as well. The points are closely clustered, suggesting the relationship is quite strong.

(b) Outliers:

The point (2, 95) is an outlier – a student who studied only 2 hours but scored 95%. This deviates significantly from the general pattern where low study hours correspond to lower scores.

Possible explanations: The student may have strong prior knowledge, exceptional natural ability, or may have studied more than recorded.

(c) Interpretation:

The positive correlation suggests that increased study time is associated with higher test scores. However, correlation does not prove causation – other factors (motivation, study quality, prior knowledge) may also play roles. The relationship indicates studying is beneficial, but the outlier shows it's not the only factor determining success.

Correlation

Pearson's Product-Moment Correlation Coefficient (r)

Definition: A numerical measure of the strength and direction of a linear relationship between two quantitative variables.

Formula:

\[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}} \]

Alternative formula (for calculations):

\[ r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \]

Properties of r:

  • \(-1 \leq r \leq 1\)
  • \(r = 1\): Perfect positive linear correlation
  • \(r = -1\): Perfect negative linear correlation
  • \(r = 0\): No linear correlation
  • \(|r| > 0.8\): Strong correlation
  • \(0.5 < |r| < 0.8\): Moderate correlation
  • \(|r| < 0.5\): Weak correlation

⚠️ Common Pitfalls & Tips:

  • Pearson's r only measures linear relationships – may be near zero even if strong non-linear relationship exists
  • Extremely sensitive to outliers – one outlier can dramatically change r
  • Always use your GDC to calculate r – manual calculation is time-consuming
  • Check the scatter plot before relying on r value
  • The sign of r indicates direction; the magnitude indicates strength

Spearman's Rank Correlation Coefficient (\(r_s\))

Definition: A non-parametric measure of correlation based on the ranks of data rather than actual values. Less sensitive to outliers and can detect monotonic (but not necessarily linear) relationships.

Formula:

\[ r_s = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)} \]

where \(d_i\) is the difference between ranks for each pair, and \(n\) is the number of data pairs

When to use Spearman's instead of Pearson's:

  • Data contains outliers that heavily influence Pearson's r
  • Relationship is monotonic but not linear
  • Data is ordinal (ranks rather than measurements)
  • Distribution is not normal

Interpretation: Same as Pearson's r: ranges from -1 to 1, with same strength interpretations

⚠️ Common Pitfalls & Tips:

  • First rank all x-values, then rank all y-values separately
  • For tied ranks, assign the average of the ranks they would occupy
  • Spearman's \(r_s\) is more robust to outliers than Pearson's r
  • Know how to calculate both on your GDC

📝 Worked Example 2: Calculating Correlation Coefficients

Question: The table shows the number of hours of sunshine (x) and ice cream sales in thousands of dollars (y) for 6 days:

DaySunshine (hrs), xSales ($1000), y
124
257
335
4810
568
646

(a) Calculate Pearson's correlation coefficient r.

(b) Interpret this value in context.

Solution:

(a) Calculating r using GDC:

Step 1: Enter data into lists (L1 for x, L2 for y)

Step 2: Use statistical calculation function (STAT → CALC → LinReg)

Step 3: Read the r value from output

Manual verification (for understanding):

\(\sum x = 28, \quad \sum y = 40, \quad \sum xy = 213\)

\(\sum x^2 = 158, \quad \sum y^2 = 310, \quad n = 6\)

\[ r = \frac{6(213) - (28)(40)}{\sqrt{[6(158) - 28^2][6(310) - 40^2]}} \]

\[ r = \frac{1278 - 1120}{\sqrt{[948 - 784][1860 - 1600]}} = \frac{158}{\sqrt{164 \times 260}} \]

\[ r = \frac{158}{\sqrt{42640}} = \frac{158}{206.5} = 0.765 \]

r ≈ 0.765 (or 0.77 to 2 d.p.)

(b) Interpretation:

The Pearson correlation coefficient of 0.765 indicates a strong positive linear correlation between hours of sunshine and ice cream sales. This suggests that as the number of sunshine hours increases, ice cream sales tend to increase. Approximately 76.5% of the maximum possible linear relationship is observed in this data.

Linear Regression

The Least Squares Regression Line

Definition: The line of best fit that minimizes the sum of squared vertical distances (residuals) between data points and the line. Used to model the relationship between variables and make predictions.

Equation of regression line (y on x):

\[ y = ax + b \quad \text{or} \quad y = mx + c \]

where a (or m) is the gradient/slope and b (or c) is the y-intercept

Formulas for coefficients:

\[ a = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum x^2 - (\sum x)^2} = r \cdot \frac{s_y}{s_x} \]

\[ b = \bar{y} - a\bar{x} \]

where \(s_x\) and \(s_y\) are standard deviations of x and y respectively

Key Properties:

  • The regression line always passes through the point \((\bar{x}, \bar{y})\)
  • Gradient a has same sign as correlation coefficient r
  • Used for interpolation (predicting within data range) – reliable
  • Used for extrapolation (predicting outside data range) – unreliable

⚠️ Common Pitfalls & Tips:

  • y on x vs x on y: y = ax + b predicts y from x; x = cy + d predicts x from y (different lines!)
  • Use your GDC to find regression equation – save time
  • Be cautious with extrapolation – relationships may not hold outside observed range
  • Always write the equation in context if given (e.g., Sales = 1.2(Temperature) + 3)
  • Round coefficients to 3-4 significant figures

📝 Worked Example 3: Linear Regression

Question: Using the ice cream sales data from Example 2:

(a) Find the equation of the regression line of y on x.

(b) Interpret the gradient and y-intercept in context.

(c) Predict the ice cream sales when there are 7 hours of sunshine.

(d) Comment on the reliability of predicting sales when there are 15 hours of sunshine.

Solution:

(a) Regression equation:

Using GDC (LinReg function with data from Example 2):

Given: \(n = 6\), \(\sum x = 28\), \(\sum y = 40\), \(\sum xy = 213\), \(\sum x^2 = 158\)

Calculate gradient a:

\[ a = \frac{6(213) - (28)(40)}{6(158) - (28)^2} = \frac{1278 - 1120}{948 - 784} = \frac{158}{164} = 0.963 \]

Calculate y-intercept b:

\(\bar{x} = \frac{28}{6} = 4.667\), \(\bar{y} = \frac{40}{6} = 6.667\)

\[ b = \bar{y} - a\bar{x} = 6.667 - 0.963(4.667) = 6.667 - 4.494 = 2.173 \]

Regression equation: \(y = 0.963x + 2.17\) (or \(y = 0.96x + 2.2\) to 2 d.p.)

(b) Interpretation:

Gradient (0.963): For each additional hour of sunshine, ice cream sales increase by approximately $963 (or $0.963 thousand).

Y-intercept (2.17): When there are 0 hours of sunshine, the model predicts ice cream sales of approximately $2,170. (Note: This may not be realistic as the model is based on positive sunshine hours.)

(c) Prediction for x = 7:

\[ y = 0.963(7) + 2.17 = 6.741 + 2.17 = 8.91 \]

Predicted sales: $8,910 (or $8.91 thousand)

This is interpolation (7 is within the data range 2-8), so this prediction is reasonably reliable.

(d) Reliability for x = 15:

Predicting sales for 15 hours of sunshine would be extrapolation (15 is well outside the observed range of 2-8 hours). This prediction would be unreliable because:

  • The linear relationship may not hold beyond the observed data range
  • Other factors may influence sales at extreme values
  • We have no evidence the pattern continues beyond 8 hours

Coefficient of Determination (\(r^2\))

Definition & Interpretation

Definition: The coefficient of determination is the square of the correlation coefficient. It measures the proportion of variance in the dependent variable that is predictable from the independent variable.

Formula:

\[ r^2 = (\text{correlation coefficient})^2 \]

Interpretation:

  • \(r^2\) ranges from 0 to 1 (or 0% to 100%)
  • \(r^2 = 0.85\) means 85% of variation in y is explained by variation in x
  • The remaining \((1 - r^2)\) is unexplained variation (due to other factors or randomness)
  • Higher \(r^2\) indicates better fit of regression line to data

General Guidelines:

  • \(r^2 > 0.9\): Excellent fit – model explains most variation
  • \(0.7 < r^2 < 0.9\): Good fit – model is useful for predictions
  • \(0.5 < r^2 < 0.7\): Moderate fit – model has some predictive power
  • \(r^2 < 0.5\): Weak fit – model explains little variation

⚠️ Common Pitfalls & Tips:

  • \(r^2\) is always positive (even when r is negative)
  • Express \(r^2\) as a percentage when interpreting: "explains X% of the variation"
  • High \(r^2\) doesn't prove causation – just association
  • Your GDC usually displays \(r^2\) automatically with regression output
  • Can be used to compare different models (higher \(r^2\) = better fit)

📝 Worked Example 4: Coefficient of Determination

Question: From Example 2, we found that Pearson's correlation coefficient between sunshine hours and ice cream sales was r = 0.765.

(a) Calculate the coefficient of determination.

(b) Interpret this value in context.

(c) What percentage of variation in ice cream sales is NOT explained by sunshine hours?

Solution:

(a) Calculating \(r^2\):

\[ r^2 = (0.765)^2 = 0.585 \]

\(r^2 = 0.585\) or 58.5%

(b) Interpretation:

The coefficient of determination of 0.585 means that 58.5% of the variation in ice cream sales can be explained by the variation in sunshine hours through the linear regression model.

This indicates a moderate to good fit. The linear relationship between sunshine hours and ice cream sales accounts for more than half of the observed variation in sales.

(c) Unexplained variation:

\[ 1 - r^2 = 1 - 0.585 = 0.415 \text{ or } 41.5\% \]

41.5% of the variation in ice cream sales is NOT explained by sunshine hours. This variation could be due to:

  • Other factors (temperature, day of week, special events)
  • Random variation
  • Measurement errors
  • Non-linear effects not captured by the linear model

Non-Linear Regression Models

Common Non-Linear Models

Definition: When data shows a curved pattern, non-linear models may provide better fit than linear regression. The IB Mathematics AI course includes several common non-linear models.

1. Quadratic Model:

\[ y = ax^2 + bx + c \]

Use when: Data shows parabolic shape (U-shaped or inverted U)

2. Exponential Model:

\[ y = ab^x \quad \text{or} \quad y = ae^{kx} \]

Use when: Data shows rapid growth or decay; rate of change proportional to current value

3. Logarithmic Model:

\[ y = a + b\ln(x) \quad \text{or} \quad y = a + b\log(x) \]

Use when: Data increases rapidly then levels off; diminishing returns

4. Power Model:

\[ y = ax^b \]

Use when: Data shows consistent proportional relationship; allometric growth

Choosing the Best Model:

  1. Visual inspection: Look at scatter plot shape
  2. Compare \(r^2\) values: Higher \(r^2\) indicates better fit
  3. Residual plots: Random scatter indicates good model
  4. Context: Consider what makes sense for the situation

⚠️ Common Pitfalls & Tips:

  • Always use your GDC to fit non-linear models – manual calculation is complex
  • Compare \(r^2\) values to determine which model fits best
  • Check that the model makes sense in context (e.g., exponential for population growth)
  • Extrapolation is even more dangerous with non-linear models
  • Know how to access different regression types on your calculator

📝 Worked Example 5: Non-Linear Regression

Question: A biologist studying bacterial growth records the population (in thousands) at different times (in hours):

Time (hrs), xPopulation (1000s), y
02.5
15.2
210.8
322.5
446.8

(a) Explain why a linear model would not be appropriate.

(b) Find an exponential model of the form \(y = ab^x\) for this data.

(c) Calculate \(r^2\) and interpret.

(d) Use the model to predict the population after 5 hours.

Solution:

(a) Why not linear?

Examining the data, the population approximately doubles each hour:

  • Hour 0 → 1: 2.5 → 5.2 (×2.08)
  • Hour 1 → 2: 5.2 → 10.8 (×2.08)
  • Hour 2 → 3: 10.8 → 22.5 (×2.08)

This shows exponential growth (constant multiplication ratio) rather than linear growth (constant addition). A scatter plot would show an upward curve, not a straight line. A linear model would underestimate growth at later times.

(b) Exponential model using GDC:

Step 1: Enter data (L1 = x, L2 = y)

Step 2: STAT → CALC → ExpReg

Step 3: Read values of a and b

GDC Output:

\(a \approx 2.50\) and \(b \approx 2.08\)

Exponential model: \(y = 2.50(2.08)^x\)

(c) Coefficient of determination:

From GDC: \(r^2 = 0.9998\) (or 99.98%)

Interpretation: The exponential model explains 99.98% of the variation in bacterial population. This is an excellent fit, indicating the exponential model is highly appropriate for this data. Almost all variation in population is explained by time through exponential growth.

(d) Prediction for x = 5:

\[ y = 2.50(2.08)^5 = 2.50 \times 39.70 = 99.2 \]

Predicted population: 99,200 bacteria

This is a slight extrapolation beyond the data (x = 0 to 4), but since \(r^2\) is very high and exponential growth is biologically plausible for bacteria, this prediction is reasonably reliable for the short term.

Residuals & Residual Plots

Understanding Residuals

Definition: A residual is the difference between an observed value and the value predicted by the regression model.

Formula:

\[ \text{Residual} = \text{Observed value} - \text{Predicted value} = y - \hat{y} \]

where \(\hat{y}\) is the value predicted by the regression equation

Properties of Residuals:

  • Positive residual: Observed value is above the regression line
  • Negative residual: Observed value is below the regression line
  • Sum of all residuals always equals zero: \(\sum(y - \hat{y}) = 0\)
  • Smaller residuals indicate better predictions

Residual Plots:

A residual plot graphs residuals against the independent variable (x) or predicted values (\(\hat{y}\)).

Interpreting Residual Plots:

  • Good model: Random scatter with no pattern, roughly equal spread
  • Poor model: Curved pattern, funnel shape, or systematic trends
  • Patterns in residual plots suggest the model doesn't fit well

⚠️ Common Pitfalls & Tips:

  • A residual plot with a pattern indicates the wrong model type was chosen
  • Use your GDC to calculate and plot residuals automatically
  • Large residuals indicate outliers or poor predictions
  • Always check residual plots when assessing model fit

📝 Worked Example 6: Calculating Residuals

Question: Using the ice cream data from Example 3, the regression line was \(y = 0.963x + 2.17\).

(a) Calculate the residual for the point (5, 7).

(b) Interpret this residual in context.

(c) If all residuals are small and randomly scattered, what does this tell us?

Solution:

(a) Calculating the residual:

Step 1: Find the predicted value for x = 5

\[ \hat{y} = 0.963(5) + 2.17 = 4.815 + 2.17 = 6.985 \]

Step 2: Calculate residual

\[ \text{Residual} = y - \hat{y} = 7 - 6.985 = 0.015 \]

Residual = 0.015 (or approximately 0.02)

(b) Interpretation:

The positive residual of 0.015 means the actual ice cream sales ($7,000) were slightly higher than predicted by the model ($6,985). The observed value is above the regression line.

In context: On the day with 5 hours of sunshine, actual sales exceeded the model's prediction by $15. This is a very small difference, indicating the model predicted well for this data point.

(c) Interpretation of small, random residuals:

If all residuals are small and randomly scattered with no pattern, this indicates:

  • The linear model is appropriate for the data
  • The model makes accurate predictions
  • No systematic errors or patterns were missed
  • The assumptions of linear regression are satisfied
  • There's no need to try alternative (non-linear) models

📊 Quick Reference Summary

Correlation

  • Pearson's r: Linear relationships
  • Spearman's \(r_s\): Ranked data
  • Range: -1 to +1
  • Strength: |r| > 0.8 is strong

Regression Line

  • Equation: \(y = ax + b\)
  • Passes through \((\bar{x}, \bar{y})\)
  • Interpolation: Reliable
  • Extrapolation: Unreliable

Coefficient of Determination

  • \(r^2\): Square of correlation
  • Proportion of variance explained
  • Range: 0 to 1 (0% to 100%)
  • Higher = better fit

Non-Linear Models

  • Exponential: \(y = ab^x\)
  • Power: \(y = ax^b\)
  • Logarithmic: \(y = a + b\ln(x)\)
  • Compare \(r^2\) to choose best

🖩 GDC Calculator Tips for Bivariate Statistics

  • Enter data: Use lists (L1 for x, L2 for y) to store bivariate data
  • Scatter plots: Learn to create scatter diagrams quickly (STAT PLOT)
  • Correlation: Access both Pearson's r and Spearman's \(r_s\) functions
  • Regression: Know shortcuts for LinReg, ExpReg, PwrReg, LnReg
  • Predictions: Use the equation or "Value" function to predict y for given x
  • Residuals: Your calculator can compute and store residuals (RESID list)
  • Store equations: Save regression equations to Y= for quick graphing

✍️ IB Exam Strategy for Bivariate Statistics

  1. Context matters: Always interpret statistical results in the context of the question
  2. Show GDC work: Write "Using GDC" and state the function used (e.g., LinReg, ExpReg)
  3. Compare models: When asked which model fits best, compare \(r^2\) values and explain
  4. Distinguish causation vs correlation: Never claim one variable causes another without justification
  5. Interpolation vs extrapolation: Always comment on reliability of predictions
  6. Interpret \(r^2\) as percentage: "X% of variation in [dependent variable] is explained by [independent variable]"
  7. Check residuals: When assessing model fit, mention residual patterns
  8. Round appropriately: Correlation coefficients to 3 decimal places; equation coefficients to 3-4 s.f.

🔗 Important Connections & Common Exam Questions

Relationship between r and \(r^2\):

If \(r^2 = 0.64\), then \(r = \pm 0.8\). Need scatter plot or context to determine sign.

When correlation is misleading:

  • Outliers: One extreme point can create false correlation
  • Non-linear relationships: Can have r ≈ 0 but strong curved relationship
  • Lurking variables: Third variable causes both (e.g., ice cream sales and drownings both increase with temperature)

Model selection tips:

  • Exponential: Consistent percentage growth (doubling, halving)
  • Linear: Consistent absolute increase/decrease
  • Power: Allometric relationships (area vs volume, biology)
  • Logarithmic: Diminishing returns, leveling off