Skip to main content
NSWMaths AdvancedSyllabus dot point

How do we describe and model the relationship between two numerical variables?

Construct scatter plots, calculate and interpret Pearson's correlation coefficient, and fit and use the least-squares regression line

A focused answer to the HSC Maths Advanced dot point on bivariate data. Scatter plots, the Pearson correlation coefficient, the least-squares regression line, prediction, and the limits of extrapolation, with worked examples and exam traps.

Generated by Claude Opus 4.814 min answer

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

What this dot point is asking

NESA wants you to take a paired dataset, draw or read a scatter plot, calculate the Pearson correlation coefficient rr, fit the least-squares regression line of yy on xx, and use it to predict values. You also need to interpret what rr and the line do and do not tell you. The three skills form one pipeline: the scatter plot is the picture, rr puts a single number on how linear and how tight that picture is, and the regression line turns the relationship into a formula you can predict with. Read the plot well and the calculator work that follows is almost automatic.

The answer

Scatterplot with least-squares line and strong positive correlationTwelve data points rising from lower left to upper right, clustered closely around the least-squares regression line with faint residual stubs, indicating a strong positive linear relationship with correlation about zero point nine five.study hours / weekmark / 10004812162024406080100y = 43 + 2.4xStrong positive linear association: r ≈ 0.95.The line is the best linear predictor of mark from hours.

Scatter plots

A scatter plot displays paired data (xi,yi)(x_i, y_i) as points in the plane, one dot per observation, and is the first thing you draw because your eyes catch structure that a single number hides. By convention the independent (explanatory) variable goes on the xx-axis and the dependent (response) variable on the yy-axis. Never join the dots: a scatter plot is a cloud of separate observations, not a quantity changing over time, and joining them turns it into a line graph.

Read every scatter plot for four things:

  • Direction. Positive (the cloud rises to the right), negative (it falls to the right), or none.
  • Form. Linear (the points hug a straight band), curved, or no clear pattern.
  • Strength. How tightly the points cluster around the pattern. Strength is about scatter, not steepness: a gentle but tight band is strong, a steep but loose one is weak.
  • Outliers. Single points sitting far from the bulk.

The reason the picture comes first is that the correlation coefficient and the regression line both assume the relationship is a straight line. The scatter plot is the check that this assumption is reasonable before you let the calculator fit anything.

Pearson's correlation coefficient

The Pearson correlation coefficient rr measures the strength and direction of the linear relationship between two variables. It is defined by

r=1n−1∑i=1n(xi−xˉsx)(yi−yˉsy),r = \frac{1}{n - 1} \sum_{i = 1}^{n} \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right),

where sxs_x and sys_y are the sample standard deviations. The formula is a sum of products of zz-scores: each point contributes a positive term when both coordinates sit on the same side of their means and a negative term when they sit on opposite sides, so the total is large and positive for a tidy upward cloud and large and negative for a tidy downward one. In practice you compute rr on your calculator from the data list, not by hand, but knowing what the formula is doing tells you why the value behaves as it does.

Key properties:

  • −1≤r≤1-1 \le r \le 1.
  • r>0r > 0 means positive linear association, r<0r < 0 means negative.
  • ∣r∣|r| close to 11 means a strong linear relationship, ∣r∣|r| close to 00 means weak or no linear relationship.
  • rr measures only linear association. A perfect parabola can have r=0r = 0.
  • rr is unitless and unchanged by linear rescaling of either variable, so converting hours to minutes or marks to percentages leaves rr exactly the same.

Rough verbal scale: ∣r∣≥0.9|r| \ge 0.9 very strong, 0.7≤∣r∣<0.90.7 \le |r| < 0.9 strong, 0.5≤∣r∣<0.70.5 \le |r| < 0.7 moderate, ∣r∣<0.5|r| < 0.5 weak.

The least-squares regression line

The least-squares regression line of yy on xx is the line y=a+bxy = a + b x that minimises the sum of squared vertical residuals ∑(yi−(a+bxi))2\sum (y_i - (a + b x_i))^2. The residuals are the vertical gaps from each point to the line; squaring them before adding penalises big misses heavily and refuses to let the line drift, so the result is the single straight line that fits the cloud best overall. The solution is

b=r⋅sysx,a=yˉ−bxˉ.b = r \cdot \frac{s_y}{s_x}, \qquad a = \bar{y} - b \bar{x}.

Two facts fall straight out of these formulas and are worth committing to memory. First, the line always passes through the point of means (xˉ,yˉ)(\bar{x}, \bar{y}), because substituting x=xˉx = \bar{x} gives y=yˉy = \bar{y}. Second, the slope bb carries the sign of rr (since sx,sy>0s_x, s_y > 0), so a positive correlation gives an upward line and a negative correlation a downward one. The slope itself is the predicted change in yy per one-unit increase in xx.

Prediction, interpolation and extrapolation

Once you have y=a+bxy = a + b x, substitute any xx to predict yy. The prediction is the model's best estimate; the actual value usually differs a little because the line is a best fit over the whole dataset, not a guarantee for any single case. Prediction inside the observed range of xx is called interpolation and is usually safe. Prediction outside the observed range is extrapolation and is risky: nothing guarantees the linear pattern continues, and the further out you go the worse it can be (a study-hours model predicting a mark above 100100, or a negative weight, has clearly broken down).

Correlation is not causation

A strong rr tells you the two variables move together. It does not establish that one causes the other. Lurking variables, reverse causation, and pure coincidence can all produce strong correlations: ice-cream sales and shark attacks rise together every summer, but hot weather drives both. Use cautious language ("is associated with", "tends to") unless the question hands you an actual mechanism.

Build the analysis, stage by stage

Here is the full routine on one dataset: weekly study hours (xx) against an HSC trial mark out of 100100 (yy) for a class of 1212 students. Watch it go from a blank grid to a fitted line and an interpreted correlation.

Stage 1, set up and scale the axes, then plot the first points. Put the independent variable (study hours) on the horizontal axis and the response (mark) on the vertical axis. Choose scales that spread the data across the whole plot: hours run 00 to about 2424, marks from the mid 4040s to the mid 9090s, so the mark axis need not start at zero. Then plot the first few pairs, here (2,48)(2, 48), (4,56)(4, 56) and (5,50)(5, 50), as single dots.

Set up the axes and plot the first pointsAxes labelled study hours per week against trial mark out of one hundred, with the first three data pairs plotted as accent dots.study hours / weekmark / 10004812162024406080100Step 1Hours on the x-axis, mark on the y-axis.Plot (2, 48), (4, 56), (5, 50) as single dots.

Stage 2, plot every data pair. Continue until all 1212 points are on the plot, each pair one dot, left unconnected. The shape now emerges: a cloud running from the lower left to the upper right, which says the direction is positive and the form looks linear.

Plot every data pairThe full scatterplot with all twelve data pairs plotted as separate dots rising from lower left to upper right.study hours / weekmark / 10004812162024406080100Step 2Plot all 12 pairs as separate dots; never join them.The cloud rises to the right: a positive linear trend.

Stage 3, fit the least-squares line. Enter the pairs into the calculator and read off the slope and intercept; here b≈2.4b \approx 2.4 and a≈43a \approx 43, so the line is y=43+2.4xy = 43 + 2.4 x. It passes through the point of means and is drawn straight through the middle of the cloud. The short dashed stubs are the residuals, the vertical gaps from each point to the line, and least squares is precisely the line that makes the total of their squares as small as possible.

Fit the least-squares regression lineThe same scatterplot with the least-squares regression line drawn straight through the middle of the cloud, with faint dashed residual stubs from each point to the line, labelled y equals 43 plus 2 point 4 x.study hours / weekmark / 10004812162024406080100y = 43 + 2.4xStep 3The calculator gives the best-fit line through the means.Each stub is a residual: the gap the line minimises.

Stage 4, read the correlation from the scatter. The same stubs tell you the strength. They are all short, so the points sit close to the line, which means ∣r∣|r| is close to 11; the cloud rises, so rr is positive. The calculator confirms r≈0.95r \approx 0.95, a strong positive linear relationship. Strength and direction together are exactly what rr packages into one number.

Read the correlation from the scatterThe fitted line with short dashed residual stubs from each point to the line; the points sit close to the line, so the correlation coefficient is strong and positive, about zero point nine five.study hours / weekmark / 10004812162024406080100r ≈ 0.95Step 4Short stubs mean tight clustering around the line,so r is close to +1: a strong positive relationship.

How exam questions ask about bivariate data

The wording varies but the task is almost always one of these, and the verb tells you the method:

  • "Describe the relationship shown in the scatter plot." Give direction, form and strength, each with a one-line justification from the plot.
  • "Calculate the correlation coefficient" or "comment on the value of rr." Read rr off the calculator, state its sign and magnitude, and translate it into words ("strong negative linear association").
  • "Find the equation of the least-squares regression line." Use b=râ‹…sy/sxb = r \cdot s_y / s_x and a=yˉ−bxˉa = \bar{y} - b\bar{x} if given the summary statistics, or read aa and bb from the calculator if given the data, then write y=a+bxy = a + b x explicitly.
  • "Use the line to predict yy when x=…x = \dots" Substitute and state the result with units; note whether it is interpolation (safe) or extrapolation (flag it).
  • "Interpret the slope / intercept." The slope is the average change in yy per one-unit increase in xx with units; the intercept is the predicted yy at x=0x = 0, with an extrapolation caveat if x=0x = 0 sits outside the data.
  • "Does this show that xx causes yy?" No: correlation is association, not cause. Mention a possible lurking variable and answer in cautious language.

Edge cases worth knowing

  • A low rr with an obvious pattern. A clear curve (a U-shape or a hump) can give r≈0r \approx 0, because rr only measures the linear part. The scatter plot, not rr, is what tells you the relationship is real but non-linear, and a straight-line model would mislead.
  • A subgroup effect. Two distinct clouds plotted together (weekday and weekend data, say) can produce a misleading single rr and a meaningless single line. Describe or model the groups separately.
  • A prediction that is impossible. A model predicting a mark above 100100 or a negative quantity has been extrapolated past where it makes sense; report the value but state that the model has broken down there.
  • Reversing the variables. The least-squares line of yy on xx is not the line of xx on yy. Fit with the response as yy, matching what the question asks to predict.

Exam-style practice questions

Practice questions written in the style of NESA exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.

2022 HSC Q264 marksThe scatter plot of weekly study hours xx and exam mark yy for ten students has xˉ=12\bar{x} = 12, yˉ=68\bar{y} = 68, sx=3s_x = 3, sy=8s_y = 8 and correlation r=0.75r = 0.75. Find the equation of the least-squares regression line of yy on xx, and use it to predict the mark of a student who studies 15 hours.
Show worked answer →

The least-squares slope is b=râ‹…sysx=0.75â‹…83=2b = r \cdot \frac{s_y}{s_x} = 0.75 \cdot \frac{8}{3} = 2.

The line passes through (xˉ,yˉ)=(12,68)(\bar{x}, \bar{y}) = (12, 68), so a=yˉ−bxˉ=68−2(12)=44a = \bar{y} - b \bar{x} = 68 - 2(12) = 44.

Equation: y=44+2xy = 44 + 2 x.

Prediction at x=15x = 15: y=44+30=74y = 44 + 30 = 74.

Markers reward the slope formula, the intercept calculation through the means, the equation written cleanly, and a numerical prediction with the right substitution.

2020 HSC Q273 marksA scatter plot of two variables shows points clustered tightly along a downward straight line. Estimate the Pearson correlation coefficient and explain what it tells you about the relationship.
Show worked answer →

Tight clustering means the points sit close to the line, so ∣r∣|r| is close to 11. The downward trend means rr is negative.

A reasonable estimate is r≈−0.95r \approx -0.95.

Interpretation: there is a very strong negative linear relationship. As one variable increases, the other decreases in an almost perfectly straight-line fashion. Correlation does not imply causation, only association.

Markers expect the sign, a magnitude close to 11, and the caveat that correlation is not causation.

Related dot points