How do we fit the best straight line to data and use it to predict?
Determine and interpret the least-squares regression line, use it to make predictions, and assess fit using residuals.
How to find and interpret the least-squares line y = a + bx, use it for prediction, distinguish interpolation from extrapolation, and read residuals to judge the fit.
Reviewed by: AI editorial process; not yet individually human-reviewed
Have a quick question? Jump to the Q&A page
Jump to a section
What this dot point is asking
You must find the regression line (usually with a calculator), interpret its slope and intercept, use it to predict, and judge its fit using residuals.
The least-squares line
The least-squares line is written , where is the slope and is the vertical intercept. It is the unique line that makes the total of the squared vertical distances from the points to the line as small as possible.
In practice you read and from a calculator, but you must interpret both:
- The slope is the predicted change in for each one-unit increase in .
- The intercept is the predicted value of when .
Prediction, interpolation and extrapolation
Substituting an -value inside the range of the data is interpolation, which is usually reliable. Substituting beyond the data range is extrapolation, which is risky because the linear pattern may not continue.
Using residuals to assess fit
After fitting the line, residuals tell you how good the fit is. A residual plot graphs each residual against .
- If the residuals are scattered randomly above and below zero with no pattern, a linear model is appropriate.
- If the residual plot shows a clear curve or pattern, the relationship is not really linear and a straight line is the wrong model.
Linking back to correlation
The coefficient of determination from the correlation work tells you the proportion of variation in explained by the line. A high together with a patternless residual plot means the linear model is a strong fit.
Exam-style practice questions
Practice questions written in the style of SACE Board exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.
2022 SACE Stage 22 marksThe area of Arctic sea ice (S million km2) is recorded against the year of observation (D) from 1980 to 2020. Using correct variables, state the equation for the least squares regression line (line of best fit, y = ax + b) for the linear model.Show worked answer →
Enter the (D, S) data pairs into a calculator's linear regression. With D as the explanatory variable and S as the response, the calculator gives approximately a = -0.0872 and b = 178.
Writing it with the correct variables:
S = -0.0872 D + 178 (values rounded; small rounding differences are accepted).
Award 1 mark for the correct slope and intercept from the regression, and 1 mark for stating the equation using the context variables S and D rather than generic x and y. The negative slope reflects the steady decline in sea-ice area over time.
2023 SACE Stage 21 marksUsing the least squares regression line N = 2.01t + 79.5 for the number of potoroos N after t months in a breeding program, predict the time when the number of potoroos would reach 300.Show worked answer →
Substitute N = 300 and solve for t.
300 = 2.01 t + 79.5
300 - 79.5 = 2.01 t
220.5 = 2.01 t
t = 220.5 / 2.01 = 109.7 months.
So the model predicts about 110 months for the population to reach 300. The single mark is for correctly substituting 300 and solving to get approximately 110 months. Note this is an extrapolation well beyond the 60-month data range, so it should be treated with caution.
2019 SACE Stage 22 marksNetflix subscriber data (millions, y) is recorded against years since 2007 (x), with linear model y = 9.35x - 4.44 (r squared = 0.919). Using residuals, the r squared value, and the residual plots provided, discuss which of a linear or exponential model best fits the Netflix data.Show worked answer →
The exponential model is the better fit, and the discussion must use the supplied evidence.
r squared: the exponential model has an r squared closer to 1 than the linear model's 0.919, indicating it explains more of the variation.
Residual plot: the linear model's residuals show a clear curved (U-shaped) pattern, which signals that a straight line is not appropriate. The exponential model's residual plot is more randomly scattered about zero with no obvious pattern.
Conclusion: because the exponential residual plot shows no pattern and its r squared is higher, the exponential model fits the Netflix subscriber growth better. Award 1 mark for citing the residual-pattern evidence and 1 mark for a justified conclusion favouring the exponential model.
2021 SACE Stage 21 marksAfter removing an outlier, the least squares regression equation for relative humidity H against temperature T is found. Predictions are then made for H at 100% humidity and for the relationship at T = 32 degrees C. Both are extrapolations. Explain why one of these predictions is the less reliable of the two.Show worked answer →
Both predictions lie outside the range of the collected data, so both are extrapolations, but reliability depends on how far outside the data range each one is.
The recorded temperatures span roughly 12 to 24 degrees C and humidity roughly 33% to 84%. Predicting the temperature at 100% humidity, or the humidity at 32 degrees C, requires going beyond these limits.
The prediction that lies further from the observed data range is the less reliable, because the linear relationship is only verified within the sampled range and may not hold beyond it. Award the mark for identifying the further extrapolation and explaining that reliability drops the further you move outside the data range.