← Year 12: Statistical Analysis

NSWMaths AdvancedSyllabus dot point

How do we describe and model the relationship between two numerical variables?

Construct scatter plots, calculate and interpret Pearson's correlation coefficient, and fit and use the least-squares regression line

A focused answer to the HSC Maths Advanced dot point on bivariate data. Scatter plots, the Pearson correlation coefficient, the least-squares regression line, prediction, and the limits of extrapolation, with worked examples and exam traps.

Generated by Claude OpusReviewed by Better Tuition Academy9 min answer

Have a quick question? Jump to the Q&A page

What this dot point is asking

NESA wants you to take a paired dataset, draw or read a scatter plot, calculate the Pearson correlation coefficient rr, fit the least-squares regression line of yy on xx, and use it to predict values. You also need to interpret what rr and the line do and do not tell you.

The answer

Scatter plots

A scatter plot displays paired data (xi,yi)(x_i, y_i) as points in the plane. Read it for:

  • Direction. Positive (upward), negative (downward), or none.
  • Form. Linear, curved, or no clear pattern.
  • Strength. How tightly the points cluster around the pattern.
  • Outliers. Single points far from the bulk.

Pearson's correlation coefficient

The Pearson correlation coefficient rr measures the strength and direction of the linear relationship between two variables. It is defined by

r=1nβˆ’1βˆ‘i=1n(xiβˆ’xΛ‰sx)(yiβˆ’yΛ‰sy),r = \frac{1}{n - 1} \sum_{i = 1}^{n} \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right),

where sxs_x and sys_y are the sample standard deviations. In practice you compute rr on your calculator from the data list, not by hand.

Key properties:

  • IMATH_11 .
  • IMATH_12 means positive linear association, r<0r < 0 means negative.
  • IMATH_14 close to 11 means a strong linear relationship, ∣r∣|r| close to 00 means weak or no linear relationship.
  • IMATH_18 measures only linear association. A perfect parabola can have r=0r = 0.
  • IMATH_20 is unitless and unchanged by linear rescaling of either variable.

Rough verbal scale: ∣r∣β‰₯0.9|r| \ge 0.9 very strong, 0.7β‰€βˆ£r∣<0.90.7 \le |r| < 0.9 strong, 0.5β‰€βˆ£r∣<0.70.5 \le |r| < 0.7 moderate, ∣r∣<0.5|r| < 0.5 weak.

The least-squares regression line

The least-squares regression line of yy on xx is the line y=a+bxy = a + b x that minimises the sum of squared vertical residuals βˆ‘(yiβˆ’(a+bxi))2\sum (y_i - (a + b x_i))^2. The solution is

b=rβ‹…sysx,a=yΛ‰βˆ’bxΛ‰.b = r \cdot \frac{s_y}{s_x}, \qquad a = \bar{y} - b \bar{x}.

The line always passes through the point of means (xˉ,yˉ)(\bar{x}, \bar{y}). Slope bb is the predicted change in yy per one-unit increase in xx.

Prediction, interpolation and extrapolation

Once you have y=a+bxy = a + b x, substitute any xx to predict yy. Prediction inside the observed range of xx is called interpolation and is usually safe. Prediction outside the observed range is extrapolation and is risky: the linear pattern may not continue.

Correlation is not causation

A strong rr tells you the two variables move together. It does not establish that one causes the other. Lurking variables, reverse causation, and pure coincidence can all produce strong correlations.

Worked examples

Reading a scatter plot

A plot of height (cm) against shoe size has points rising from lower left to upper right, tightly clustered. Direction positive, form linear, strength strong, no obvious outliers. Estimate rβ‰ˆ0.9r \approx 0.9.

Computing rr and the regression line

Suppose a small dataset gives xˉ=5\bar{x} = 5, yˉ=20\bar{y} = 20, sx=2s_x = 2, sy=6s_y = 6 and r=0.8r = 0.8.

Slope: b=0.8β‹…62=2.4b = 0.8 \cdot \frac{6}{2} = 2.4.

Intercept: a=20βˆ’2.4β‹…5=20βˆ’12=8a = 20 - 2.4 \cdot 5 = 20 - 12 = 8.

Line: y=8+2.4xy = 8 + 2.4 x.

Predict yy when x=7x = 7: y=8+2.4β‹…7=24.8y = 8 + 2.4 \cdot 7 = 24.8.

Interpreting slope and intercept

For a regression of exam mark yy on hours studied xx with y=44+2xy = 44 + 2 x:

  • Slope b=2b = 2: each extra hour of study is associated with a predicted mark increase of 22.
  • Intercept a=44a = 44: the predicted mark for a student who studies 00 hours. This is an extrapolation if no student in the data studied near 00 hours, and should be treated with caution.

Spotting an outlier

A scatter plot of weight on height has one point well above the line. That point pulls the regression line upward and inflates the residual. Refitting without it will usually increase ∣r∣|r| and shift the slope. Outliers should be checked for data entry errors before any decision to remove.

Common traps

Confusing strong with steep. A nearly horizontal line through tightly clustered points still has ∣r∣|r| close to 11. Strength is about closeness to the line, not slope size.

Treating a low rr as no relationship. rr measures only linear association. A clear curved pattern can give rβ‰ˆ0r \approx 0.

Extrapolating without warning. Predicting yy for xx values far outside the data range can give nonsense (negative weights, marks above 100100). Always check the prediction sits inside the data range, or flag the caveat.

Swapping the slope formula. It is b=rβ‹…sysxb = r \cdot \frac{s_y}{s_x}, not rβ‹…sxsyr \cdot \frac{s_x}{s_y}. The units must work out: rise over run.

Claiming causation. "Correlation does not imply causation" is a standard one-mark response to any question that asks what rr tells you about cause and effect.

In one sentence

For paired data, the Pearson correlation rr measures the strength and direction of a linear relationship, and the least-squares regression line y=a+bxy = a + b x with b=rβ‹…sy/sxb = r \cdot s_y / s_x and a=yΛ‰βˆ’bxΛ‰a = \bar{y} - b \bar{x} is the best linear predictor of yy from xx within the observed range.

Past exam questions, worked

Real questions from past NESA papers on this dot point, with our answer explainer.

2022 HSC Q264 marksThe scatter plot of weekly study hours $x$ and exam mark $y$ for ten students has $\bar{x} = 12$, $\bar{y} = 68$, $s_x = 3$, $s_y = 8$ and correlation $r = 0.75$. Find the equation of the least-squares regression line of $y$ on $x$, and use it to predict the mark of a student who studies 15 hours.
Show worked answer β†’

The least-squares slope is b=rβ‹…sysx=0.75β‹…83=2b = r \cdot \frac{s_y}{s_x} = 0.75 \cdot \frac{8}{3} = 2.

The line passes through (xΛ‰,yΛ‰)=(12,68)(\bar{x}, \bar{y}) = (12, 68), so a=yΛ‰βˆ’bxΛ‰=68βˆ’2(12)=44a = \bar{y} - b \bar{x} = 68 - 2(12) = 44.

Equation: y=44+2xy = 44 + 2 x.

Prediction at x=15x = 15: y=44+30=74y = 44 + 30 = 74.

Markers reward the slope formula, the intercept calculation through the means, the equation written cleanly, and a numerical prediction with the right substitution.

2020 HSC Q273 marksA scatter plot of two variables shows points clustered tightly along a downward straight line. Estimate the Pearson correlation coefficient and explain what it tells you about the relationship.
Show worked answer β†’

Tight clustering means the points sit close to the line, so ∣r∣|r| is close to 11. The downward trend means rr is negative.

A reasonable estimate is rβ‰ˆβˆ’0.95r \approx -0.95.

Interpretation: there is a very strong negative linear relationship. As one variable increases, the other decreases in an almost perfectly straight-line fashion. Correlation does not imply causation, only association.

Markers expect the sign, a magnitude close to 11, and the caveat that correlation is not causation.

Related dot points