How is the least-squares regression line calculated, and how is it used to model a linear relationship between two variables?
Find and use the equation of the least-squares regression line to model a linear relationship between two variables
A focused answer to the HSC Maths Standard 2 dot point on the least-squares regression line. The equation , finding the gradient and intercept on a calculator, predicting and reading values off the line, interpreting the gradient and intercept, interpolation versus extrapolation, and worked Australian examples.
Reviewed by: AI editorial process; not yet individually human-reviewed
Have a quick question? Jump to the Q&A page
What this dot point is asking
NESA wants you to do four things. Find the equation of the least-squares regression line using your calculator's statistics functions. Write it in form. Use it to predict from . Interpret the gradient and intercept in the context of the worded problem. This is where the scatterplot and the correlation coefficient pay off: once the data looks linear and is reasonably strong, the regression line turns the relationship into a formula you can predict with.
The answer
The least-squares regression line
Bivariate data is data where each item has two measurements, an and a . For such data, the least-squares regression line is the straight line that makes the sum of the squared vertical distances from the data points to the line as small as possible. Those vertical distances are called residuals: how far each actual point sits above or below the line. Squaring them before adding gives big misses extra weight, so the line cannot drift away from the points. The result is the single straight line that fits the cloud best overall. It is the standard "line of best fit" for linear data.
The equation has the form:
where is the gradient and is the -intercept. The whole point of finding it is that it lets you predict: feed in an , get the model's best estimate of .
Finding the line on a calculator
You will not be asked to compute the gradient and intercept by hand. The procedure on a NESA-approved scientific calculator:
- Enter statistics mode (typically MODE STAT, then a 2-variable or option).
- Enter the pairs into the statistical lists.
- Read off (sometimes labelled or ) and (sometimes labelled or ) from the regression-result menu.
- Read off (the correlation coefficient) at the same time, to check the fit is worth using.
Different calculator models use different labels, and some even swap the roles of and . So check which one is the gradient on your model. Practise on the exact calculator you will use in the exam, and clear any previous data first.
Build and use the line, stage by stage
Here is the full routine on one dataset: daily ice-cream van sales (in hundreds of dollars) against the daily maximum temperature in degrees Celsius over summer days. Watch it go from a cloud of dots to a usable prediction.
Stage 1, plot the bivariate data. Put the independent variable (temperature) on the horizontal axis and the response (sales) on the vertical axis, then plot each pair. The cloud rises to the right, so a positive linear model looks reasonable, which is the green light to fit a line.
Stage 2, find and draw the least-squares line. Enter the pairs into the calculator and read off the gradient and intercept; here and , so the line is . Draw it straight through the middle of the cloud. It is the best linear fit, not a line forced through any two particular points.
Stage 3, read a prediction off the line. To predict sales on a degree day, you can either substitute into the equation, (so about $650), or read it straight off the graph: go up from on the temperature axis to the line, then across to the sales axis, landing at . Both routes agree, and on a graphical question you show the dashed read-off lines.
Stage 4, predict only inside the data range. The line can be extended forever, but you can only trust it across the temperatures the data actually covers (about to degrees), shaded below. A prediction inside that band is called interpolation and is safe. A prediction outside it, say at degrees, is called extrapolation and may be wrong. Nothing guarantees the trend keeps going: sales might level off or even fall once it is too hot to queue outside.
Predicting from
Substitute the value into the line equation. The result is the model's predicted at that . The actual value will usually differ a little, because the line is the best fit over the whole dataset, not a guarantee for any single case. Quote the prediction with its units and round sensibly for the context.
Interpreting the gradient
The gradient has units of (y units) per (x unit). For every -unit increase in , the model says changes by units on average.
Always include the word "average" or "on average" and the units. The "on average" matters because the line is a best fit across many points, not an exact rule for any one observation, and markers reward it explicitly.
Interpreting the -intercept
The intercept is the predicted value when . In context this is sometimes meaningful (a base salary at zero years of experience, say) and sometimes pure extrapolation (predicted food spending at zero income, which no real household has).
If lies well outside the range of the data, say so: the intercept is an extrapolation and may carry no real-world meaning. State it as a number, then add the caveat.
Interpolation versus extrapolation
This is the single most common follow-up, so make it a habit:
- Interpolation: predicting for an inside the range of the data. Generally reliable, because the line is fitted there.
- Extrapolation: predicting for an outside the range of the data. Treat with caution; the linear trend may not continue, and the further out you go, the worse it can be.
Whenever you predict, glance at whether the sits inside the data range. If it does not, name the prediction as extrapolation and flag that it may be unreliable. (The separate interpolation and extrapolation dot point goes deeper on this.)
When to use the line
The least-squares line is appropriate when:
- the scatterplot suggests an approximately linear relationship,
- the correlation coefficient is moderately strong or stronger,
- there are no extreme outliers distorting the fit.
If the scatterplot is clearly non-linear, the line will be a poor model even if is not zero, and a prediction from it can be badly off.
How exam questions ask about the regression line
- "Write the equation of the least-squares regression line." Put the calculator's and into explicitly.
- "Use the line to predict when " Substitute and compute, then state the result with units; note if it is extrapolation.
- "Interpret the gradient." Average change in per -unit increase in , in context, with units, including "on average".
- "Interpret the -intercept." Predicted at , in context, with a note if is outside the data (extrapolation).
- "Is this prediction reliable?" Reliable if interpolating; flag it if extrapolating or if the fit () is weak.
Edge cases worth knowing
- A prediction that comes out negative or impossible. A model can predict a negative resale value or negative sales at extreme . That is a sign you have extrapolated past where the line makes sense; report the value but note the model has broken down.
- A weak fit. If is small, the line still exists but its predictions are poor. Mention the weak correlation when judging reliability.
- Outliers tilting the line. Because least squares squares the residuals, one far-off point can swing the gradient and intercept noticeably. Identify outliers from the scatterplot first.
- Reversing the variables. The regression line of on is not the same as on . Fit the line with the response variable as , matching the question.
Exam-style practice questions
Practice questions written in the style of NESA exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.
2022 HSC-style4 marksFor a dataset of pairs, calculator output gives gradient and intercept for the least-squares regression line. Write the equation, predict when , and interpret the gradient.Show worked answer →
Equation: .
At : .
Interpretation: for every increase of in , increases by (on average, according to the model).
Markers reward the equation, the substitution, and an interpretation of the gradient that uses the word "average" or "on average" to acknowledge the model is a best fit, not exact.
2023 HSC-style3 marksA linear model of weekly food spending (, in $) on weekly income (, in $) for Sydney households gives the line . Interpret the gradient and the -intercept in this context.Show worked answer →
Gradient : for every extra dollar of weekly income, the household spends, on average, an additional $0.18 on food. In other words, about of each additional dollar of income goes to food.
-intercept : a household with zero income is predicted to spend $95 per week on food. This is the model's baseline. In practice, $0 income is well outside the dataset, so the intercept may not be reliable (extrapolation).
Markers reward the gradient interpretation in context with units, the intercept interpretation in context, and a brief caveat about extrapolation for the intercept.
Practice questions
Original practice questions graded from foundation to exam level, each with a full worked solution. Try them before revealing the solution.
foundation2 marksA least-squares regression line is given by . (a) Write down the gradient and the -intercept. (b) Use the line to predict when .Show worked solution →
Read off the gradient and intercept. The line is in the form , so the gradient is and the -intercept is .
Predict by substituting. Put into the equation:
Check. The gradient is positive, so a larger should give a larger ; the predicted is above the intercept of , as expected. So the gradient is , the intercept is , and when .
foundation3 marksA phone plan's monthly bill is modelled by the least-squares line , where is the number of call minutes used and is the bill in dollars. (a) Predict the bill for a month with call minutes. (b) Interpret the gradient in this context.Show worked solution →
Predict the bill at . Substitute into the line:
So the predicted bill is $65.
Interpret the gradient. The gradient is . For each extra call minute, the bill rises by an average of $0.15 (that is, cents per minute). The word "average" matters because the line is a best fit, not an exact rule for every month.
Check. The intercept $20 is the fixed monthly charge at minutes; adding minutes at cents each is dollars, and , confirming the prediction.
foundation3 marksFor a group of students, the least-squares line relating exam mark to weekly study hours is . (a) Predict the mark for a student who studies hours per week. (b) Interpret the gradient in context.Show worked solution →
Predict the mark at . Substitute into the line:
So the predicted exam mark is .
Interpret the gradient. The gradient is . For each extra hour of weekly study, the model predicts an exam mark that is, on average, marks higher. Including "on average" acknowledges that the line is a best fit across the whole group, not a guarantee for any one student.
Check. The gradient is positive, so more study should mean a higher mark; the predicted sits above the intercept of , which is consistent. The predicted mark is .
foundation3 marksAn electrician's charge is modelled by the least-squares line , where is the number of hours worked and is the total charge in dollars. (a) Predict the charge for a hour job. (b) Interpret the -intercept in this context.Show worked solution →
Predict the charge at . Substitute into the line:
So the predicted charge is $325.
Interpret the intercept. The -intercept is . It is the predicted charge when hours, so it represents a fixed call-out fee of $70 that applies before any time is billed. Here is meaningful, so the intercept has a sensible real-world reading.
Check. The call-out fee of $70 plus hours at $85 each is , matching the prediction. The charge is $325.
core4 marksFive students recorded weekly study hours and exam mark as the pairs , , , , . A calculator gives the least-squares line as . (a) State the gradient and intercept. (b) Predict the mark for a student who studies hours. (c) Interpret the gradient.Show worked solution →
State the gradient and intercept. From the calculator line , the gradient is and the -intercept is .
Predict the mark at . Substitute :
So the predicted mark is about (to the nearest whole mark).
Interpret the gradient. The gradient means that for each extra hour of weekly study, the exam mark increases by an average of marks.
Check. The line should pass through the mean point. The mean of the values is and the mean of the values is ; substituting gives , which matches , so the line is correct.
core4 marksOver five days a kiosk recorded the maximum temperature (in degrees Celsius) and cold-drink sales (in hundreds of dollars): , , , , . A calculator gives the least-squares line . (a) Predict sales when the temperature is degrees. (b) Predict sales at degrees. (c) Interpret the gradient in context.Show worked solution →
Predict sales at . Substitute into the line:
so predicted sales are hundreds of dollars, that is $3400.
Predict sales at . Substitute :
so predicted sales are hundreds of dollars, that is $4600.
Interpret the gradient. The gradient is . For each extra degree of maximum temperature, sales rise by an average of hundred dollars, that is $200 per degree.
Check. The mean temperature is and the mean sales figure is ; substituting gives , matching , so the line passes through the mean point as it should.
core4 marksA used-car dealer records the age (in years) and value (in thousands of dollars) of six cars of one model: , , , , , . A calculator gives the least-squares line . (a) Predict the value of a car that is years old. (b) Interpret the gradient. (c) Interpret the -intercept, commenting on its reliability.Show worked solution →
Predict the value at . Substitute into the line:
so the predicted value is thousand dollars, that is $15,500.
- Interpret the gradient
- The gradient is . For each extra year of age, the value drops by an average of thousand dollars, that is $3000 per year. The negative sign shows value falls as age rises.
- Interpret the intercept
- The intercept is , the predicted value at , a brand-new car: about $29,000. The youngest car in the data is year old, so is just outside the data range; the intercept is a mild extrapolation and should be treated as an estimate.
- Check
- The mean age is and the mean value is ; substituting gives , matching , so the line is correct.
core4 marksFor a set of bivariate data the mean of the values is and the mean of the values is . The least-squares regression line has gradient . (a) Using the fact that the line passes through the point of means , find the -intercept. (b) Write the equation of the line. (c) Predict when .Show worked solution →
Use the mean point to find the intercept. The least-squares line always passes through . With , substitute the mean point:
so .
Write the equation. With gradient and intercept :
Predict at . Substitute :
Check. Substituting the mean back into the line gives , confirming the intercept is correct. So when .
exam6 marksA researcher records the daily maximum temperature (in degrees Celsius) and the number of visitors (in hundreds) at a pool on six days: , , , , , . A calculator gives the least-squares line . (a) Predict the number of visitors when the temperature is degrees. (b) Predict the number of visitors at degrees. (c) State which of your two predictions is more reliable, and explain why.Show worked solution →
Predict at . Substitute into the line:
so about hundred visitors, that is visitors.
Predict at . Substitute :
so about hundred visitors, that is visitors.
Compare reliability. The temperatures in the data run from to degrees. The prediction at lies inside this range, so it is interpolation and is reliable. The prediction at lies outside the range, so it is extrapolation; the linear trend may not continue (a temperature of degrees is also physically unrealistic), so that prediction is unreliable.
Check. The mean temperature is and the mean visitor figure is ; substituting gives , so the line passes through the mean point and the prediction is sound.
exam5 marksA least-squares regression line relating a worker's productivity score to years of experience is . (a) Predict the productivity score for a worker with years of experience. (b) Use the line to estimate the experience needed for a productivity score of , giving your answer to one decimal place. (c) The data covered to years of experience. Comment on whether part (a) is interpolation or extrapolation.Show worked solution →
Predict the score at . Substitute into the line:
so the predicted productivity score is about .
Solve for the experience giving . Set and solve for :
So about years of experience is needed (to one decimal place).
Classify part (a). The data covered to years. Since lies inside this range, part (a) is interpolation and is reliable.
Check. Substituting back gives , confirming the estimate. So the score at years is about , and needs about years.
exam6 marksA gym records each member's number of weekly sessions and a fitness score for seven members: , , , , , , . A calculator gives the least-squares line . (a) State the gradient and intercept. (b) Interpret the gradient in context. (c) Predict the fitness score for a member doing sessions a week. (d) Predict the score for sessions a week and explain why this prediction should be treated with caution.Show worked solution →
- State the gradient and intercept
- From , the gradient is and the -intercept is .
- Interpret the gradient
- For each extra weekly session, the fitness score rises by an average of points.
- Predict at
- Substitute :
so the predicted fitness score is .
Predict at and caution. Substitute :
The data covered to sessions, so is well outside the range. This is extrapolation: the linear trend may not continue (fitness gains often level off), so the predicted is unreliable.
Check. The mean number of sessions is and the mean score is ; substituting gives , so the line passes through the mean point and the in-range prediction is sound.
exam7 marksAn analyst records weekly income (in hundreds of dollars) and weekly savings (in dollars) for six people: , , , , , . A calculator gives the least-squares line . (a) Interpret the gradient, taking care with the units. (b) Predict the weekly savings of a person with an income of $1100 per week. (c) Predict the savings for an income of $2000 per week, and state whether this is interpolation or extrapolation. (d) Interpret the -intercept and comment on its reliability.Show worked solution →
Interpret the gradient. Here is income in hundreds of dollars, so unit of is $100 of weekly income. The gradient means that for each extra $100 of weekly income, weekly savings rise by an average of $6.
Predict at an income of $1100. An income of $1100 is (hundreds). Substitute :
so the predicted weekly savings are $68.
Predict at an income of $2000. An income of $2000 is . Substitute :
so about $122. The data covered incomes from to (that is $600 to $1600), so is outside the range: this is extrapolation.
Interpret the intercept. The intercept is , the predicted savings at (zero income): about $2. No one in the data has zero income, so is well outside the range; the intercept is an extrapolation and should not be read literally.
Check. The mean income is and the mean savings figure is ; substituting gives , so the line passes through the mean point and the $1100 prediction is sound.
exam6 marksFor summer days, a researcher records the daily maximum temperature (in degrees Celsius) and the region's peak electricity demand (in hundreds of megawatts). No raw points are given, only these summary statistics: , , , , and the correlation coefficient . The temperatures in the data range from to degrees. (a) Using and the fact that the line passes through the point of means, find the equation of the least-squares regression line. (b) Interpret the gradient and the -intercept in context. (c) Predict the peak demand on a degree day and state, with a reason, whether the prediction is reliable.Show worked solution →
Find the gradient from the correlation and spreads. The gradient of the least-squares line is the correlation coefficient scaled by the ratio of the standard deviations:
Find the intercept using the point of means. The least-squares line always passes through . With , substitute the mean point and solve for :
So the equation is .
- Interpret the gradient
- The gradient is . For each extra degree of maximum temperature, peak demand rises by an average of hundred megawatts, that is MW per degree. The word "average" matters because the line is a best fit, not an exact rule for any single day.
- Interpret the intercept
- The intercept is , the predicted demand at degrees: hundred megawatts, that is MW. Since the data only covers to degrees, is far outside that range, so this baseline is an extrapolation and should not be read literally.
- Predict at and judge reliability
- Substitute into the line:
so about hundred megawatts, that is roughly MW. Because lies inside the data range to , this is interpolation, and with a strong correlation of the prediction is reliable.
Answer: the line is ; demand rises by an average of MW per degree and the intercept of MW at degrees is an unreliable extrapolation; the predicted demand at degrees is about MW, and this interpolation is reliable.
exam6 marksA surf club studies junior members. For each it records , the number of training sessions attended per week, and , a rescue-skills score out of . The data give , , , , and , with attendance ranging from to sessions. (a) Comment on the strength and direction of the linear relationship. (b) Find the equation of the least-squares regression line using . (c) Predict the score of a member who trains times a week. (d) Predict the score for sessions a week and explain, referring to both the data range and the value of , why this prediction is the least trustworthy of the two.Show worked solution →
Comment on the relationship. Since is positive and close to , there is a strong positive linear relationship: members who train more often tend to score higher. As a guide, , so about of the variation in scores is explained by the linear model.
Find the gradient and intercept. The gradient is
The line passes through , so
giving the equation .
Predict at . Substitute :
so the predicted score is out of .
Predict at and judge it. Substitute :
The attendance data runs only from to sessions, so is well outside the range: this is extrapolation. The linear trend may not continue (scores cannot exceed and improvement tends to level off), and even where the model does apply the correlation is only , not perfect, so predictions carry error. The prediction is therefore the least trustworthy, whereas is interpolation inside the data and is reliable.
Answer: the relationship is strong and positive (); the line is ; the predicted score at sessions is , while the predicted at sessions is an unreliable extrapolation beyond the data range.
exam5 marksSix Australian households were surveyed. For each, is the number of residents and is the average daily electricity use in kilowatt-hours. The number of residents took the values , , , , across five of the homes (the sixth is omitted here) and the corresponding electricity figures were , , , , . These five pairs have , and . (a) Find the mean number of residents and the mean daily usage from the raw values. (b) Hence find the equation of the least-squares regression line, using and the point of means. (c) Predict the daily usage of a person household and of a person household, classifying each prediction as interpolation or extrapolation.Show worked solution →
Find the two means from the raw values. Add the residents and divide by , then do the same for the usage figures:
Find the gradient and intercept. The gradient is
The line passes through , so
giving the equation .
Predict at . Substitute :
so about kWh per day. Since lies inside the data range to residents, this is interpolation.
Predict at . Substitute :
so about kWh per day. Since is outside the range to , this is extrapolation and should be treated with caution.
Answer: the means are residents and kWh; the line is ; the predicted usage is about kWh for residents (interpolation, reliable) and about kWh for residents (extrapolation, unreliable).
