NSWMaths AdvancedSyllabus dot point

How do we describe and model the relationship between two numerical variables?

Construct scatter plots, calculate and interpret Pearson's correlation coefficient, and fit and use the least-squares regression line

Q: 2022 HSC Q26 HSC: The scatter plot of weekly study hours $x$ and exam mark $y$ for ten students has $\bar{x} = 12$, $\bar{y} = 68$, $s_x = 3$, $s_y = 8$ and correlation $r = 0.75$. Find the equation of the least-squares regression line of $y$ on $x$, and use it to predict the mark of a student who studies 15 hours.

The least-squares slope is $b = r \cdot \frac{s_y}{s_x} = 0.75 \cdot \frac{8}{3} = 2$. The line passes through $(\bar{x}, \bar{y}) = (12, 68)$, so $a = \bar{y} - b \bar{x} = 68 - 2(12) = 44$. Equation: $y = 44 + 2 x$. Prediction at $x = 15$: $y = 44 + 30 = 74$. Markers reward the slope formula, the intercept calculation through the means, the equation written cleanly, and a numerical prediction with the right substitution.

Q: 2020 HSC Q27 HSC: A scatter plot of two variables shows points clustered tightly along a downward straight line. Estimate the Pearson correlation coefficient and explain what it tells you about the relationship.

Tight clustering means the points sit close to the line, so $|r|$ is close to $1$. The downward trend means $r$ is negative. A reasonable estimate is $r \approx -0.95$. Interpretation: there is a very strong negative linear relationship. As one variable increases, the other decreases in an almost perfectly straight-line fashion. Correlation does not imply causation, only association. Markers expect the sign, a magnitude close to $1$, and the caveat that correlation is not causation.

A focused answer to the HSC Maths Advanced dot point on bivariate data. Scatter plots, the Pearson correlation coefficient, the least-squares regression line, prediction, and the limits of extrapolation, with worked examples and exam traps.

Generated by Claude OpusReviewed by Better Tuition Academy9 min answerUpdated 2026-05-18

Have a quick question? Jump to the Q&A page

What this dot point is asking

NESA wants you to take a paired dataset, draw or read a scatter plot, calculate the Pearson correlation coefficient $r$ , fit the least-squares regression line of $y$ on $x$ , and use it to predict values. You also need to interpret what $r$ and the line do and do not tell you.

The answer

Scatter plots

A scatter plot displays paired data $(x_i, y_i)$ as points in the plane. Read it for:

Direction. Positive (upward), negative (downward), or none.
Form. Linear, curved, or no clear pattern.
Strength. How tightly the points cluster around the pattern.
Outliers. Single points far from the bulk.

Pearson's correlation coefficient

The Pearson correlation coefficient $r$ measures the strength and direction of the linear relationship between two variables. It is defined by

r = \frac{1}{n - 1} \sum_{i = 1}^{n} \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right),

where $s_x$ and $s_y$ are the sample standard deviations. In practice you compute $r$ on your calculator from the data list, not by hand.

Key properties:

IMATH_11 .
IMATH_12 means positive linear association, $r < 0$ means negative.
IMATH_14 close to $1$ means a strong linear relationship, $|r|$ close to $0$ means weak or no linear relationship.
IMATH_18 measures only linear association. A perfect parabola can have $r = 0$ .
IMATH_20 is unitless and unchanged by linear rescaling of either variable.

Rough verbal scale: $|r| \ge 0.9$ very strong, $0.7 \le |r| < 0.9$ strong, $0.5 \le |r| < 0.7$ moderate, $|r| < 0.5$ weak.

The least-squares regression line

The least-squares regression line of $y$ on $x$ is the line $y = a + b x$ that minimises the sum of squared vertical residuals $\sum (y_i - (a + b x_i))^2$ . The solution is

b = r \cdot \frac{s_y}{s_x}, \qquad a = \bar{y} - b \bar{x}.

The line always passes through the point of means $(\bar{x}, \bar{y})$ . Slope $b$ is the predicted change in $y$ per one-unit increase in $x$ .

Prediction, interpolation and extrapolation

Once you have $y = a + b x$ , substitute any $x$ to predict $y$ . Prediction inside the observed range of $x$ is called interpolation and is usually safe. Prediction outside the observed range is extrapolation and is risky: the linear pattern may not continue.

Correlation is not causation

A strong $r$ tells you the two variables move together. It does not establish that one causes the other. Lurking variables, reverse causation, and pure coincidence can all produce strong correlations.

Worked examples

Reading a scatter plot

A plot of height (cm) against shoe size has points rising from lower left to upper right, tightly clustered. Direction positive, form linear, strength strong, no obvious outliers. Estimate $r \approx 0.9$ .

Computing $r$ and the regression line

Suppose a small dataset gives $\bar{x} = 5$ , $\bar{y} = 20$ , $s_x = 2$ , $s_y = 6$ and $r = 0.8$ .

Slope: $b = 0.8 \cdot \frac{6}{2} = 2.4$ .

Intercept: $a = 20 - 2.4 \cdot 5 = 20 - 12 = 8$ .

Line: $y = 8 + 2.4 x$ .

Predict $y$ when $x = 7$ : $y = 8 + 2.4 \cdot 7 = 24.8$ .

Interpreting slope and intercept

For a regression of exam mark $y$ on hours studied $x$ with $y = 44 + 2 x$ :

Slope $b = 2$ : each extra hour of study is associated with a predicted mark increase of $2$ .
Intercept $a = 44$ : the predicted mark for a student who studies $0$ hours. This is an extrapolation if no student in the data studied near $0$ hours, and should be treated with caution.

Spotting an outlier

A scatter plot of weight on height has one point well above the line. That point pulls the regression line upward and inflates the residual. Refitting without it will usually increase $|r|$ and shift the slope. Outliers should be checked for data entry errors before any decision to remove.

Common traps

Confusing strong with steep. A nearly horizontal line through tightly clustered points still has $|r|$ close to $1$ . Strength is about closeness to the line, not slope size.

Treating a low $r$ as no relationship. $r$ measures only linear association. A clear curved pattern can give $r \approx 0$ .

Extrapolating without warning. Predicting $y$ for $x$ values far outside the data range can give nonsense (negative weights, marks above $100$ ). Always check the prediction sits inside the data range, or flag the caveat.

Swapping the slope formula. It is $b = r \cdot \frac{s_y}{s_x}$ , not $r \cdot \frac{s_x}{s_y}$ . The units must work out: rise over run.

Claiming causation. "Correlation does not imply causation" is a standard one-mark response to any question that asks what $r$ tells you about cause and effect.

In one sentence

For paired data, the Pearson correlation $r$ measures the strength and direction of a linear relationship, and the least-squares regression line $y = a + b x$ with $b = r \cdot s_y / s_x$ and $a = \bar{y} - b \bar{x}$ is the best linear predictor of $y$ from $x$ within the observed range.

Past exam questions, worked

Real questions from past NESA papers on this dot point, with our answer explainer.

2022 HSC Q264 marksThe scatter plot of weekly study hours $x$ and exam mark $y$ for ten students has $\bar{x} = 12$, $\bar{y} = 68$, $s_x = 3$, $s_y = 8$ and correlation $r = 0.75$. Find the equation of the least-squares regression line of $y$ on $x$, and use it to predict the mark of a student who studies 15 hours.

Show worked answer →

The least-squares slope is $b = r \cdot \frac{s_y}{s_x} = 0.75 \cdot \frac{8}{3} = 2$ .

The line passes through $(\bar{x}, \bar{y}) = (12, 68)$ , so $a = \bar{y} - b \bar{x} = 68 - 2(12) = 44$ .

Equation: $y = 44 + 2 x$ .

Prediction at $x = 15$ : $y = 44 + 30 = 74$ .

Markers reward the slope formula, the intercept calculation through the means, the equation written cleanly, and a numerical prediction with the right substitution.

2020 HSC Q273 marksA scatter plot of two variables shows points clustered tightly along a downward straight line. Estimate the Pearson correlation coefficient and explain what it tells you about the relationship.