Skip to main content
SAGeneral MathematicsSyllabus dot point

How do we fit the best straight line to data and use it to predict?

Determine and interpret the least-squares regression line, use it to make predictions, and assess fit using residuals.

How to find and interpret the least-squares line y = a + bx, use it for prediction, distinguish interpolation from extrapolation, and read residuals to judge the fit.

Generated by Claude Opus 4.77 min answer

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

Jump to a section
  1. What this dot point is asking
  2. The least-squares line
  3. Prediction, interpolation and extrapolation
  4. Using residuals to assess fit
  5. Linking back to correlation

What this dot point is asking

You must find the regression line (usually with a calculator), interpret its slope and intercept, use it to predict, and judge its fit using residuals.

The least-squares line

The least-squares line is written y=a+bxy = a + bx, where bb is the slope and aa is the vertical intercept. It is the unique line that makes the total of the squared vertical distances from the points to the line as small as possible.

In practice you read aa and bb from a calculator, but you must interpret both:

  • The slope bb is the predicted change in yy for each one-unit increase in xx.
  • The intercept aa is the predicted value of yy when x=0x = 0.

Prediction, interpolation and extrapolation

Substituting an xx-value inside the range of the data is interpolation, which is usually reliable. Substituting beyond the data range is extrapolation, which is risky because the linear pattern may not continue.

Using residuals to assess fit

After fitting the line, residuals tell you how good the fit is. A residual plot graphs each residual against xx.

  • If the residuals are scattered randomly above and below zero with no pattern, a linear model is appropriate.
  • If the residual plot shows a clear curve or pattern, the relationship is not really linear and a straight line is the wrong model.

Linking back to correlation

The coefficient of determination r2r^2 from the correlation work tells you the proportion of variation in yy explained by the line. A high r2r^2 together with a patternless residual plot means the linear model is a strong fit.

Exam-style practice questions

Practice questions written in the style of SACE Board exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.

2022 SACE Stage 22 marksThe area of Arctic sea ice (S million km2) is recorded against the year of observation (D) from 1980 to 2020. Using correct variables, state the equation for the least squares regression line (line of best fit, y = ax + b) for the linear model.
Show worked answer →

Enter the (D, S) data pairs into a calculator's linear regression. With D as the explanatory variable and S as the response, the calculator gives approximately a = -0.0872 and b = 178.

Writing it with the correct variables:
S = -0.0872 D + 178 (values rounded; small rounding differences are accepted).

Award 1 mark for the correct slope and intercept from the regression, and 1 mark for stating the equation using the context variables S and D rather than generic x and y. The negative slope reflects the steady decline in sea-ice area over time.

2023 SACE Stage 21 marksUsing the least squares regression line N = 2.01t + 79.5 for the number of potoroos N after t months in a breeding program, predict the time when the number of potoroos would reach 300.
Show worked answer →

Substitute N = 300 and solve for t.

300 = 2.01 t + 79.5
300 - 79.5 = 2.01 t
220.5 = 2.01 t
t = 220.5 / 2.01 = 109.7 months.

So the model predicts about 110 months for the population to reach 300. The single mark is for correctly substituting 300 and solving to get approximately 110 months. Note this is an extrapolation well beyond the 60-month data range, so it should be treated with caution.

2019 SACE Stage 22 marksNetflix subscriber data (millions, y) is recorded against years since 2007 (x), with linear model y = 9.35x - 4.44 (r squared = 0.919). Using residuals, the r squared value, and the residual plots provided, discuss which of a linear or exponential model best fits the Netflix data.
Show worked answer →

The exponential model is the better fit, and the discussion must use the supplied evidence.

r squared: the exponential model has an r squared closer to 1 than the linear model's 0.919, indicating it explains more of the variation.

Residual plot: the linear model's residuals show a clear curved (U-shaped) pattern, which signals that a straight line is not appropriate. The exponential model's residual plot is more randomly scattered about zero with no obvious pattern.

Conclusion: because the exponential residual plot shows no pattern and its r squared is higher, the exponential model fits the Netflix subscriber growth better. Award 1 mark for citing the residual-pattern evidence and 1 mark for a justified conclusion favouring the exponential model.

2021 SACE Stage 21 marksAfter removing an outlier, the least squares regression equation for relative humidity H against temperature T is found. Predictions are then made for H at 100% humidity and for the relationship at T = 32 degrees C. Both are extrapolations. Explain why one of these predictions is the less reliable of the two.
Show worked answer →

Both predictions lie outside the range of the collected data, so both are extrapolations, but reliability depends on how far outside the data range each one is.

The recorded temperatures span roughly 12 to 24 degrees C and humidity roughly 33% to 84%. Predicting the temperature at 100% humidity, or the humidity at 32 degrees C, requires going beyond these limits.

The prediction that lies further from the observed data range is the less reliable, because the linear relationship is only verified within the sampled range and may not hold beyond it. Award the mark for identifying the further extrapolation and explaining that reliability drops the further you move outside the data range.