How do we describe and model the relationship between two numerical variables?
Construct scatter plots, calculate and interpret Pearson's correlation coefficient, and fit and use the least-squares regression line
A focused answer to the HSC Maths Advanced dot point on bivariate data. Scatter plots, the Pearson correlation coefficient, the least-squares regression line, prediction, and the limits of extrapolation, with worked examples and exam traps.
Reviewed by: AI editorial process; not yet individually human-reviewed
Have a quick question? Jump to the Q&A page
What this dot point is asking
NESA wants you to take a paired dataset, draw or read a scatter plot, calculate the Pearson correlation coefficient , fit the least-squares regression line of on , and use it to predict values. You also need to interpret what and the line do and do not tell you. The three skills form one pipeline: the scatter plot is the picture, puts a single number on how linear and how tight that picture is, and the regression line turns the relationship into a formula you can predict with. Read the plot well and the calculator work that follows is almost automatic.
The answer
Scatter plots
A scatter plot displays paired data as points in the plane, one dot per observation, and is the first thing you draw because your eyes catch structure that a single number hides. By convention the independent (explanatory) variable goes on the -axis and the dependent (response) variable on the -axis. Never join the dots: a scatter plot is a cloud of separate observations, not a quantity changing over time, and joining them turns it into a line graph.
Read every scatter plot for four things:
- Direction. Positive (the cloud rises to the right), negative (it falls to the right), or none.
- Form. Linear (the points hug a straight band), curved, or no clear pattern.
- Strength. How tightly the points cluster around the pattern. Strength is about scatter, not steepness: a gentle but tight band is strong, a steep but loose one is weak.
- Outliers. Single points sitting far from the bulk.
The reason the picture comes first is that the correlation coefficient and the regression line both assume the relationship is a straight line. The scatter plot is the check that this assumption is reasonable before you let the calculator fit anything.
Pearson's correlation coefficient
The Pearson correlation coefficient measures the strength and direction of the linear relationship between two variables. It is defined by
where and are the sample standard deviations. The formula is a sum of products of -scores: each point contributes a positive term when both coordinates sit on the same side of their means and a negative term when they sit on opposite sides, so the total is large and positive for a tidy upward cloud and large and negative for a tidy downward one. In practice you compute on your calculator from the data list, not by hand, but knowing what the formula is doing tells you why the value behaves as it does.
Key properties:
- .
- means positive linear association, means negative.
- close to means a strong linear relationship, close to means weak or no linear relationship.
- measures only linear association. A perfect parabola can have .
- is unitless and unchanged by linear rescaling of either variable, so converting hours to minutes or marks to percentages leaves exactly the same.
Rough verbal scale: very strong, strong, moderate, weak.
The least-squares regression line
The least-squares regression line of on is the line that minimises the sum of squared vertical residuals . The residuals are the vertical gaps from each point to the line; squaring them before adding penalises big misses heavily and refuses to let the line drift, so the result is the single straight line that fits the cloud best overall. The solution is
Two facts fall straight out of these formulas and are worth committing to memory. First, the line always passes through the point of means , because substituting gives . Second, the slope carries the sign of (since ), so a positive correlation gives an upward line and a negative correlation a downward one. The slope itself is the predicted change in per one-unit increase in .
Prediction, interpolation and extrapolation
Once you have , substitute any to predict . The prediction is the model's best estimate; the actual value usually differs a little because the line is a best fit over the whole dataset, not a guarantee for any single case. Prediction inside the observed range of is called interpolation and is usually safe. Prediction outside the observed range is extrapolation and is risky: nothing guarantees the linear pattern continues, and the further out you go the worse it can be (a study-hours model predicting a mark above , or a negative weight, has clearly broken down).
Correlation is not causation
A strong tells you the two variables move together. It does not establish that one causes the other. Lurking variables, reverse causation, and pure coincidence can all produce strong correlations: ice-cream sales and shark attacks rise together every summer, but hot weather drives both. Use cautious language ("is associated with", "tends to") unless the question hands you an actual mechanism.
Build the analysis, stage by stage
Here is the full routine on one dataset: weekly study hours () against an HSC trial mark out of () for a class of students. Watch it go from a blank grid to a fitted line and an interpreted correlation.
Stage 1, set up and scale the axes, then plot the first points. Put the independent variable (study hours) on the horizontal axis and the response (mark) on the vertical axis. Choose scales that spread the data across the whole plot: hours run to about , marks from the mid s to the mid s, so the mark axis need not start at zero. Then plot the first few pairs, here , and , as single dots.
Stage 2, plot every data pair. Continue until all points are on the plot, each pair one dot, left unconnected. The shape now emerges: a cloud running from the lower left to the upper right, which says the direction is positive and the form looks linear.
Stage 3, fit the least-squares line. Enter the pairs into the calculator and read off the slope and intercept; here and , so the line is . It passes through the point of means and is drawn straight through the middle of the cloud. The short dashed stubs are the residuals, the vertical gaps from each point to the line, and least squares is precisely the line that makes the total of their squares as small as possible.
Stage 4, read the correlation from the scatter. The same stubs tell you the strength. They are all short, so the points sit close to the line, which means is close to ; the cloud rises, so is positive. The calculator confirms , a strong positive linear relationship. Strength and direction together are exactly what packages into one number.
How exam questions ask about bivariate data
The wording varies but the task is almost always one of these, and the verb tells you the method:
- "Describe the relationship shown in the scatter plot." Give direction, form and strength, each with a one-line justification from the plot.
- "Calculate the correlation coefficient" or "comment on the value of ." Read off the calculator, state its sign and magnitude, and translate it into words ("strong negative linear association").
- "Find the equation of the least-squares regression line." Use and if given the summary statistics, or read and from the calculator if given the data, then write explicitly.
- "Use the line to predict when " Substitute and state the result with units; note whether it is interpolation (safe) or extrapolation (flag it).
- "Interpret the slope / intercept." The slope is the average change in per one-unit increase in with units; the intercept is the predicted at , with an extrapolation caveat if sits outside the data.
- "Does this show that causes ?" No: correlation is association, not cause. Mention a possible lurking variable and answer in cautious language.
Edge cases worth knowing
- A low with an obvious pattern. A clear curve (a U-shape or a hump) can give , because only measures the linear part. The scatter plot, not , is what tells you the relationship is real but non-linear, and a straight-line model would mislead.
- A subgroup effect. Two distinct clouds plotted together (weekday and weekend data, say) can produce a misleading single and a meaningless single line. Describe or model the groups separately.
- A prediction that is impossible. A model predicting a mark above or a negative quantity has been extrapolated past where it makes sense; report the value but state that the model has broken down there.
- Reversing the variables. The least-squares line of on is not the line of on . Fit with the response as , matching what the question asks to predict.
Exam-style practice questions
Practice questions written in the style of NESA exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.
2022 HSC Q264 marksThe scatter plot of weekly study hours and exam mark for ten students has , , , and correlation . Find the equation of the least-squares regression line of on , and use it to predict the mark of a student who studies 15 hours.Show worked answer →
The least-squares slope is .
The line passes through , so .
Equation: .
Prediction at : .
Markers reward the slope formula, the intercept calculation through the means, the equation written cleanly, and a numerical prediction with the right substitution.
2020 HSC Q273 marksA scatter plot of two variables shows points clustered tightly along a downward straight line. Estimate the Pearson correlation coefficient and explain what it tells you about the relationship.Show worked answer →
Tight clustering means the points sit close to the line, so is close to . The downward trend means is negative.
A reasonable estimate is .
Interpretation: there is a very strong negative linear relationship. As one variable increases, the other decreases in an almost perfectly straight-line fashion. Correlation does not imply causation, only association.
Markers expect the sign, a magnitude close to , and the caveat that correlation is not causation.
Related dot points
- Use the normal distribution, z-scores, the empirical rule and the standard normal table to find probabilities and percentiles
A focused answer to the HSC Maths Advanced dot point on the normal distribution. Standardising with z-scores, the 68-95-99.7 empirical rule, computing probabilities as areas under the curve and inverse-normal percentiles, with worked examples and exam traps.
- Use probability density functions and cumulative distribution functions to find probabilities, medians, modes, means and variances of continuous random variables
A focused answer to the HSC Maths Advanced dot point on continuous random variables. Probability density functions, cumulative distribution functions, computing probabilities by integration, and finding mean, median, mode and variance, with worked examples.
- Define a discrete random variable by its probability distribution, and calculate the expected value, variance and standard deviation
A focused answer to the HSC Maths Advanced dot point on discrete random variables. Probability distributions, expected value, variance, standard deviation, and linear transformations of a discrete random variable, with worked examples.
- Find antiderivatives of standard functions, apply integration by substitution and evaluate definite integrals using the Fundamental Theorem of Calculus
A focused answer to the HSC Maths Advanced dot point on integration. Antiderivatives of standard functions, integration by substitution, definite integrals and the Fundamental Theorem of Calculus, with worked examples.