β Year 12: Statistical Analysis
How do we describe and model the relationship between two numerical variables?
Construct scatter plots, calculate and interpret Pearson's correlation coefficient, and fit and use the least-squares regression line
A focused answer to the HSC Maths Advanced dot point on bivariate data. Scatter plots, the Pearson correlation coefficient, the least-squares regression line, prediction, and the limits of extrapolation, with worked examples and exam traps.
Have a quick question? Jump to the Q&A page
What this dot point is asking
NESA wants you to take a paired dataset, draw or read a scatter plot, calculate the Pearson correlation coefficient , fit the least-squares regression line of on , and use it to predict values. You also need to interpret what and the line do and do not tell you.
The answer
Scatter plots
A scatter plot displays paired data as points in the plane. Read it for:
- Direction. Positive (upward), negative (downward), or none.
- Form. Linear, curved, or no clear pattern.
- Strength. How tightly the points cluster around the pattern.
- Outliers. Single points far from the bulk.
Pearson's correlation coefficient
The Pearson correlation coefficient measures the strength and direction of the linear relationship between two variables. It is defined by
where and are the sample standard deviations. In practice you compute on your calculator from the data list, not by hand.
Key properties:
- IMATH_11 .
- IMATH_12 means positive linear association, means negative.
- IMATH_14 close to means a strong linear relationship, close to means weak or no linear relationship.
- IMATH_18 measures only linear association. A perfect parabola can have .
- IMATH_20 is unitless and unchanged by linear rescaling of either variable.
Rough verbal scale: very strong, strong, moderate, weak.
The least-squares regression line
The least-squares regression line of on is the line that minimises the sum of squared vertical residuals . The solution is
The line always passes through the point of means . Slope is the predicted change in per one-unit increase in .
Prediction, interpolation and extrapolation
Once you have , substitute any to predict . Prediction inside the observed range of is called interpolation and is usually safe. Prediction outside the observed range is extrapolation and is risky: the linear pattern may not continue.
Correlation is not causation
A strong tells you the two variables move together. It does not establish that one causes the other. Lurking variables, reverse causation, and pure coincidence can all produce strong correlations.
Worked examples
Reading a scatter plot
A plot of height (cm) against shoe size has points rising from lower left to upper right, tightly clustered. Direction positive, form linear, strength strong, no obvious outliers. Estimate .
Computing and the regression line
Suppose a small dataset gives , , , and .
Slope: .
Intercept: .
Line: .
Predict when : .
Interpreting slope and intercept
For a regression of exam mark on hours studied with :
- Slope : each extra hour of study is associated with a predicted mark increase of .
- Intercept : the predicted mark for a student who studies hours. This is an extrapolation if no student in the data studied near hours, and should be treated with caution.
Spotting an outlier
A scatter plot of weight on height has one point well above the line. That point pulls the regression line upward and inflates the residual. Refitting without it will usually increase and shift the slope. Outliers should be checked for data entry errors before any decision to remove.
Common traps
Confusing strong with steep. A nearly horizontal line through tightly clustered points still has close to . Strength is about closeness to the line, not slope size.
Treating a low as no relationship. measures only linear association. A clear curved pattern can give .
Extrapolating without warning. Predicting for values far outside the data range can give nonsense (negative weights, marks above ). Always check the prediction sits inside the data range, or flag the caveat.
Swapping the slope formula. It is , not . The units must work out: rise over run.
Claiming causation. "Correlation does not imply causation" is a standard one-mark response to any question that asks what tells you about cause and effect.
In one sentence
For paired data, the Pearson correlation measures the strength and direction of a linear relationship, and the least-squares regression line with and is the best linear predictor of from within the observed range.
Past exam questions, worked
Real questions from past NESA papers on this dot point, with our answer explainer.
2022 HSC Q264 marksThe scatter plot of weekly study hours $x$ and exam mark $y$ for ten students has $\bar{x} = 12$, $\bar{y} = 68$, $s_x = 3$, $s_y = 8$ and correlation $r = 0.75$. Find the equation of the least-squares regression line of $y$ on $x$, and use it to predict the mark of a student who studies 15 hours.Show worked answer β
The least-squares slope is .
The line passes through , so .
Equation: .
Prediction at : .
Markers reward the slope formula, the intercept calculation through the means, the equation written cleanly, and a numerical prediction with the right substitution.
2020 HSC Q273 marksA scatter plot of two variables shows points clustered tightly along a downward straight line. Estimate the Pearson correlation coefficient and explain what it tells you about the relationship.Show worked answer β
Tight clustering means the points sit close to the line, so is close to . The downward trend means is negative.
A reasonable estimate is .
Interpretation: there is a very strong negative linear relationship. As one variable increases, the other decreases in an almost perfectly straight-line fashion. Correlation does not imply causation, only association.
Markers expect the sign, a magnitude close to , and the caveat that correlation is not causation.