Skip to main content
ExamExplained
NSW · Maths Standard 2
Maths Standard 2 study scene
§-Syllabus dot point
NSWMaths Standard 2Syllabus dot point

How is the least-squares regression line calculated, and how is it used to model a linear relationship between two variables?

Find and use the equation of the least-squares regression line to model a linear relationship between two variables

A focused answer to the HSC Maths Standard 2 dot point on the least-squares regression line. The equation y=mx+by = mx + b, finding the gradient and intercept on a calculator, predicting and reading values off the line, interpreting the gradient and intercept, interpolation versus extrapolation, and worked Australian examples.

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

What this dot point is asking

NESA wants you to do four things. Find the equation of the least-squares regression line using your calculator's statistics functions. Write it in y=mx+by = mx + b form. Use it to predict yy from xx. Interpret the gradient and intercept in the context of the worded problem. This is where the scatterplot and the correlation coefficient pay off: once the data looks linear and rr is reasonably strong, the regression line turns the relationship into a formula you can predict with.

The answer

Scatterplot with least squares regression line A scatter of data points with an upward trend, overlaid with the best fit straight line that minimises the sum of squared vertical distances from each point to the line. x y y = mx + b vertical distances to line are residuals

The least-squares regression line

Bivariate data is data where each item has two measurements, an xx and a yy. For such data, the least-squares regression line is the straight line that makes the sum of the squared vertical distances from the data points to the line as small as possible. Those vertical distances are called residuals: how far each actual point sits above or below the line. Squaring them before adding gives big misses extra weight, so the line cannot drift away from the points. The result is the single straight line that fits the cloud best overall. It is the standard "line of best fit" for linear data.

The equation has the form:

y=mx+by = m x + b

where mm is the gradient and bb is the yy-intercept. The whole point of finding it is that it lets you predict: feed in an xx, get the model's best estimate of yy.

Finding the line on a calculator

You will not be asked to compute the gradient and intercept by hand. The procedure on a NESA-approved scientific calculator:

  1. Enter statistics mode (typically MODE STAT, then a 2-variable or A+BxA + Bx option).
  2. Enter the (x,y)(x, y) pairs into the statistical lists.
  3. Read off mm (sometimes labelled aa or BB) and bb (sometimes labelled AA or aa) from the regression-result menu.
  4. Read off rr (the correlation coefficient) at the same time, to check the fit is worth using.

Different calculator models use different labels, and some even swap the roles of AA and BB. So check which one is the gradient on your model. Practise on the exact calculator you will use in the exam, and clear any previous data first.

Build and use the line, stage by stage

Here is the full routine on one dataset: daily ice-cream van sales (in hundreds of dollars) against the daily maximum temperature in degrees Celsius over 1515 summer days. Watch it go from a cloud of dots to a usable prediction.

Stage 1, plot the bivariate data. Put the independent variable (temperature) on the horizontal axis and the response (sales) on the vertical axis, then plot each pair. The cloud rises to the right, so a positive linear model looks reasonable, which is the green light to fit a line.

Plot the bivariate dataA scatterplot of ice-cream sales against daily maximum temperature for fifteen days, with the points rising from lower left to upper right.temperature (°C)sales($100s)2025303540246810Step 1Plot each (temperature, sales) pair. The cloud rises to the right.

Stage 2, find and draw the least-squares line. Enter the pairs into the calculator and read off the gradient and intercept; here m=0.33m = 0.33 and b=3.4b = -3.4, so the line is y=0.33x3.4y = 0.33x - 3.4. Draw it straight through the middle of the cloud. It is the best linear fit, not a line forced through any two particular points.

Draw the least-squares lineThe same scatterplot with the least-squares regression line drawn through the middle of the cloud as a straight accent line.temperature (°C)sales($100s)2025303540246810y = 0.33x − 3.4Step 2The calculator gives m = 0.33 and b = −3.4, so the line is y = 0.33x − 3.4.

Stage 3, read a prediction off the line. To predict sales on a 3030 degree day, you can either substitute into the equation, y=0.33×303.4=6.5y = 0.33 \times 30 - 3.4 = 6.5 (so about $650), or read it straight off the graph: go up from 3030 on the temperature axis to the line, then across to the sales axis, landing at 6.56.5. Both routes agree, and on a graphical question you show the dashed read-off lines.

Read a prediction off the lineFrom temperature thirty on the horizontal axis a dashed line goes up to the regression line and then across to the vertical axis, giving predicted sales of about six point five hundred dollars.temperature (°C)sales($100s)2025354024810306.5Step 3At x = 30: go up to the line, across to the axis. Predicted sales ≈ $650.

Stage 4, predict only inside the data range. The line can be extended forever, but you can only trust it across the temperatures the data actually covers (about 1818 to 3939 degrees), shaded below. A prediction inside that band is called interpolation and is safe. A prediction outside it, say at 4545 degrees, is called extrapolation and may be wrong. Nothing guarantees the trend keeps going: sales might level off or even fall once it is too hot to queue outside.

Interpolation versus extrapolationThe regression line with the temperature range covered by the data shaded as the safe interpolation zone, and the regions beyond the data on each side marked as extrapolation where predictions are unreliable.temperature (°C)sales($100s)2025303540246810interpolation (safe)extra-polationStep 4Predict only inside the data range. Beyond it (e.g. 45°C) is extrapolation.

Predicting yy from xx

Substitute the xx value into the line equation. The result is the model's predicted yy at that xx. The actual value will usually differ a little, because the line is the best fit over the whole dataset, not a guarantee for any single case. Quote the prediction with its units and round sensibly for the context.

Interpreting the gradient

The gradient mm has units of (y units) per (x unit). For every 11-unit increase in xx, the model says yy changes by mm units on average.

Always include the word "average" or "on average" and the units. The "on average" matters because the line is a best fit across many points, not an exact rule for any one observation, and markers reward it explicitly.

Interpreting the yy-intercept

The intercept bb is the predicted yy value when x=0x = 0. In context this is sometimes meaningful (a base salary at zero years of experience, say) and sometimes pure extrapolation (predicted food spending at zero income, which no real household has).

If x=0x = 0 lies well outside the range of the data, say so: the intercept is an extrapolation and may carry no real-world meaning. State it as a number, then add the caveat.

Interpolation versus extrapolation

This is the single most common follow-up, so make it a habit:

  • Interpolation: predicting for an xx inside the range of the data. Generally reliable, because the line is fitted there.
  • Extrapolation: predicting for an xx outside the range of the data. Treat with caution; the linear trend may not continue, and the further out you go, the worse it can be.

Whenever you predict, glance at whether the xx sits inside the data range. If it does not, name the prediction as extrapolation and flag that it may be unreliable. (The separate interpolation and extrapolation dot point goes deeper on this.)

When to use the line

The least-squares line is appropriate when:

  • the scatterplot suggests an approximately linear relationship,
  • the correlation coefficient r|r| is moderately strong or stronger,
  • there are no extreme outliers distorting the fit.

If the scatterplot is clearly non-linear, the line will be a poor model even if rr is not zero, and a prediction from it can be badly off.

How exam questions ask about the regression line

  • "Write the equation of the least-squares regression line." Put the calculator's mm and bb into y=mx+by = mx + b explicitly.
  • "Use the line to predict yy when x=x = \dots" Substitute and compute, then state the result with units; note if it is extrapolation.
  • "Interpret the gradient." Average change in yy per 11-unit increase in xx, in context, with units, including "on average".
  • "Interpret the yy-intercept." Predicted yy at x=0x = 0, in context, with a note if x=0x = 0 is outside the data (extrapolation).
  • "Is this prediction reliable?" Reliable if interpolating; flag it if extrapolating or if the fit (rr) is weak.

Edge cases worth knowing

  • A prediction that comes out negative or impossible. A model can predict a negative resale value or negative sales at extreme xx. That is a sign you have extrapolated past where the line makes sense; report the value but note the model has broken down.
  • A weak fit. If r|r| is small, the line still exists but its predictions are poor. Mention the weak correlation when judging reliability.
  • Outliers tilting the line. Because least squares squares the residuals, one far-off point can swing the gradient and intercept noticeably. Identify outliers from the scatterplot first.
  • Reversing the variables. The regression line of yy on xx is not the same as xx on yy. Fit the line with the response variable as yy, matching the question.

Exam-style practice questions

Practice questions written in the style of NESA exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.

2022 HSC-style4 marksFor a dataset of 5050 pairs, calculator output gives gradient m=2.5m = 2.5 and intercept b=8b = -8 for the least-squares regression line. Write the equation, predict yy when x=12x = 12, and interpret the gradient.
Show worked answer →

Equation: y=2.5x8y = 2.5 x - 8.

At x=12x = 12: y=2.5×128=308=22y = 2.5 \times 12 - 8 = 30 - 8 = 22.

Interpretation: for every increase of 11 in xx, yy increases by 2.52.5 (on average, according to the model).

Markers reward the equation, the substitution, and an interpretation of the gradient that uses the word "average" or "on average" to acknowledge the model is a best fit, not exact.

2023 HSC-style3 marksA linear model of weekly food spending (yy, in $) on weekly income (xx, in $) for 8080 Sydney households gives the line y=0.18x+95y = 0.18 x + 95. Interpret the gradient and the yy-intercept in this context.
Show worked answer →

Gradient 0.180.18: for every extra dollar of weekly income, the household spends, on average, an additional $0.18 on food. In other words, about 18%18\% of each additional dollar of income goes to food.

yy-intercept 9595: a household with zero income is predicted to spend $95 per week on food. This is the model's baseline. In practice, $0 income is well outside the dataset, so the intercept may not be reliable (extrapolation).

Markers reward the gradient interpretation in context with units, the intercept interpretation in context, and a brief caveat about extrapolation for the intercept.

Practice questions

Original practice questions graded from foundation to exam level, each with a full worked solution. Try them before revealing the solution.

foundation2 marksA least-squares regression line is given by y=4x+12y = 4x + 12. (a) Write down the gradient and the yy-intercept. (b) Use the line to predict yy when x=8x = 8.
Show worked solution →

Read off the gradient and intercept. The line is in the form y=mx+by = mx + b, so the gradient is m=4m = 4 and the yy-intercept is b=12b = 12.

Predict by substituting. Put x=8x = 8 into the equation:

y=4×8+12=32+12=44.y = 4 \times 8 + 12 = 32 + 12 = 44.

Check. The gradient is positive, so a larger xx should give a larger yy; the predicted 4444 is above the intercept of 1212, as expected. So the gradient is 44, the intercept is 1212, and y=44y = 44 when x=8x = 8.

foundation3 marksA phone plan's monthly bill is modelled by the least-squares line y=0.15x+20y = 0.15x + 20, where xx is the number of call minutes used and yy is the bill in dollars. (a) Predict the bill for a month with 300300 call minutes. (b) Interpret the gradient in this context.
Show worked solution →

Predict the bill at x=300x = 300. Substitute x=300x = 300 into the line:

y=0.15×300+20=45+20=65.y = 0.15 \times 300 + 20 = 45 + 20 = 65.

So the predicted bill is $65.

Interpret the gradient. The gradient is 0.150.15. For each extra call minute, the bill rises by an average of $0.15 (that is, 1515 cents per minute). The word "average" matters because the line is a best fit, not an exact rule for every month.

Check. The intercept $20 is the fixed monthly charge at 00 minutes; adding 300300 minutes at 1515 cents each is 300×0.15=45300 \times 0.15 = 45 dollars, and 20+45=6520 + 45 = 65, confirming the prediction.

foundation3 marksFor a group of students, the least-squares line relating exam mark yy to weekly study hours xx is y=6.2x+58y = 6.2x + 58. (a) Predict the mark for a student who studies 55 hours per week. (b) Interpret the gradient in context.
Show worked solution →

Predict the mark at x=5x = 5. Substitute x=5x = 5 into the line:

y=6.2×5+58=31+58=89.y = 6.2 \times 5 + 58 = 31 + 58 = 89.

So the predicted exam mark is 8989.

Interpret the gradient. The gradient is 6.26.2. For each extra hour of weekly study, the model predicts an exam mark that is, on average, 6.26.2 marks higher. Including "on average" acknowledges that the line is a best fit across the whole group, not a guarantee for any one student.

Check. The gradient is positive, so more study should mean a higher mark; the predicted 8989 sits above the intercept of 5858, which is consistent. The predicted mark is 8989.

foundation3 marksAn electrician's charge is modelled by the least-squares line y=85x+70y = 85x + 70, where xx is the number of hours worked and yy is the total charge in dollars. (a) Predict the charge for a 33 hour job. (b) Interpret the yy-intercept in this context.
Show worked solution →

Predict the charge at x=3x = 3. Substitute x=3x = 3 into the line:

y=85×3+70=255+70=325.y = 85 \times 3 + 70 = 255 + 70 = 325.

So the predicted charge is $325.

Interpret the intercept. The yy-intercept is 7070. It is the predicted charge when x=0x = 0 hours, so it represents a fixed call-out fee of $70 that applies before any time is billed. Here x=0x = 0 is meaningful, so the intercept has a sensible real-world reading.

Check. The call-out fee of $70 plus 33 hours at $85 each is 70+3×85=70+255=32570 + 3 \times 85 = 70 + 255 = 325, matching the prediction. The charge is $325.

core4 marksFive students recorded weekly study hours xx and exam mark yy as the pairs (1,52)(1, 52), (2,55)(2, 55), (3,62)(3, 62), (4,68)(4, 68), (5,73)(5, 73). A calculator gives the least-squares line as y=5.5x+45.5y = 5.5x + 45.5. (a) State the gradient and intercept. (b) Predict the mark for a student who studies 3.53.5 hours. (c) Interpret the gradient.
Show worked solution →

State the gradient and intercept. From the calculator line y=5.5x+45.5y = 5.5x + 45.5, the gradient is 5.55.5 and the yy-intercept is 45.545.5.

Predict the mark at x=3.5x = 3.5. Substitute x=3.5x = 3.5:

y=5.5×3.5+45.5=19.25+45.5=64.75.y = 5.5 \times 3.5 + 45.5 = 19.25 + 45.5 = 64.75.

So the predicted mark is about 6565 (to the nearest whole mark).

Interpret the gradient. The gradient 5.55.5 means that for each extra hour of weekly study, the exam mark increases by an average of 5.55.5 marks.

Check. The line should pass through the mean point. The mean of the xx values is 33 and the mean of the yy values is 6262; substituting x=3x = 3 gives 5.5×3+45.5=16.5+45.5=625.5 \times 3 + 45.5 = 16.5 + 45.5 = 62, which matches yˉ\bar{y}, so the line is correct.

core4 marksOver five days a kiosk recorded the maximum temperature xx (in degrees Celsius) and cold-drink sales yy (in hundreds of dollars): (15,22)(15, 22), (18,32)(18, 32), (21,37)(21, 37), (24,42)(24, 42), (27,47)(27, 47). A calculator gives the least-squares line y=2x6y = 2x - 6. (a) Predict sales when the temperature is 2020 degrees. (b) Predict sales at 2626 degrees. (c) Interpret the gradient in context.
Show worked solution →

Predict sales at x=20x = 20. Substitute x=20x = 20 into the line:

y=2×206=406=34,y = 2 \times 20 - 6 = 40 - 6 = 34,

so predicted sales are 3434 hundreds of dollars, that is $3400.

Predict sales at x=26x = 26. Substitute x=26x = 26:

y=2×266=526=46,y = 2 \times 26 - 6 = 52 - 6 = 46,

so predicted sales are 4646 hundreds of dollars, that is $4600.

Interpret the gradient. The gradient is 22. For each extra degree of maximum temperature, sales rise by an average of 22 hundred dollars, that is $200 per degree.

Check. The mean temperature is 2121 and the mean sales figure is 3636; substituting x=21x = 21 gives 2×216=362 \times 21 - 6 = 36, matching yˉ\bar{y}, so the line passes through the mean point as it should.

core4 marksA used-car dealer records the age xx (in years) and value yy (in thousands of dollars) of six cars of one model: (1,24)(1, 24), (2,24)(2, 24), (3,21)(3, 21), (4,18)(4, 18), (5,15)(5, 15), (6,9)(6, 9). A calculator gives the least-squares line y=3x+29y = -3x + 29. (a) Predict the value of a car that is 4.54.5 years old. (b) Interpret the gradient. (c) Interpret the yy-intercept, commenting on its reliability.
Show worked solution →

Predict the value at x=4.5x = 4.5. Substitute x=4.5x = 4.5 into the line:

y=3×4.5+29=13.5+29=15.5,y = -3 \times 4.5 + 29 = -13.5 + 29 = 15.5,

so the predicted value is 15.515.5 thousand dollars, that is $15,500.

Interpret the gradient
The gradient is 3-3. For each extra year of age, the value drops by an average of 33 thousand dollars, that is $3000 per year. The negative sign shows value falls as age rises.
Interpret the intercept
The intercept is 2929, the predicted value at x=0x = 0, a brand-new car: about $29,000. The youngest car in the data is 11 year old, so x=0x = 0 is just outside the data range; the intercept is a mild extrapolation and should be treated as an estimate.
Check
The mean age is 3.53.5 and the mean value is 18.518.5; substituting x=3.5x = 3.5 gives 3×3.5+29=10.5+29=18.5-3 \times 3.5 + 29 = -10.5 + 29 = 18.5, matching yˉ\bar{y}, so the line is correct.
core4 marksFor a set of bivariate data the mean of the xx values is 1212 and the mean of the yy values is 4848. The least-squares regression line has gradient 2.52.5. (a) Using the fact that the line passes through the point of means (12,48)(12, 48), find the yy-intercept. (b) Write the equation of the line. (c) Predict yy when x=20x = 20.
Show worked solution →

Use the mean point to find the intercept. The least-squares line always passes through (xˉ,yˉ)=(12,48)(\bar{x}, \bar{y}) = (12, 48). With y=2.5x+by = 2.5x + b, substitute the mean point:

48=2.5×12+b=30+b,48 = 2.5 \times 12 + b = 30 + b,

so b=4830=18b = 48 - 30 = 18.

Write the equation. With gradient 2.52.5 and intercept 1818:

y=2.5x+18.y = 2.5x + 18.

Predict at x=20x = 20. Substitute x=20x = 20:

y=2.5×20+18=50+18=68.y = 2.5 \times 20 + 18 = 50 + 18 = 68.

Check. Substituting the mean x=12x = 12 back into the line gives 2.5×12+18=30+18=48=yˉ2.5 \times 12 + 18 = 30 + 18 = 48 = \bar{y}, confirming the intercept is correct. So y=68y = 68 when x=20x = 20.

exam6 marksA researcher records the daily maximum temperature xx (in degrees Celsius) and the number of visitors yy (in hundreds) at a pool on six days: (10,38)(10, 38), (20,51)(20, 51), (30,61)(30, 61), (40,72)(40, 72), (50,79)(50, 79), (60,89)(60, 89). A calculator gives the least-squares line y=x+30y = x + 30. (a) Predict the number of visitors when the temperature is 3535 degrees. (b) Predict the number of visitors at 8080 degrees. (c) State which of your two predictions is more reliable, and explain why.
Show worked solution →

Predict at x=35x = 35. Substitute x=35x = 35 into the line:

y=35+30=65,y = 35 + 30 = 65,

so about 6565 hundred visitors, that is 65006500 visitors.

Predict at x=80x = 80. Substitute x=80x = 80:

y=80+30=110,y = 80 + 30 = 110,

so about 110110 hundred visitors, that is 1100011\,000 visitors.

Compare reliability. The temperatures in the data run from 1010 to 6060 degrees. The prediction at x=35x = 35 lies inside this range, so it is interpolation and is reliable. The prediction at x=80x = 80 lies outside the range, so it is extrapolation; the linear trend may not continue (a temperature of 8080 degrees is also physically unrealistic), so that prediction is unreliable.

Check. The mean temperature is 3535 and the mean visitor figure is 6565; substituting x=35x = 35 gives 35+30=65=yˉ35 + 30 = 65 = \bar{y}, so the line passes through the mean point and the x=35x = 35 prediction is sound.

exam5 marksA least-squares regression line relating a worker's productivity score yy to years of experience xx is y=1.9x+12.4y = 1.9x + 12.4. (a) Predict the productivity score for a worker with 1818 years of experience. (b) Use the line to estimate the experience needed for a productivity score of 5050, giving your answer to one decimal place. (c) The data covered 22 to 2020 years of experience. Comment on whether part (a) is interpolation or extrapolation.
Show worked solution →

Predict the score at x=18x = 18. Substitute x=18x = 18 into the line:

y=1.9×18+12.4=34.2+12.4=46.6,y = 1.9 \times 18 + 12.4 = 34.2 + 12.4 = 46.6,

so the predicted productivity score is about 46.646.6.

Solve for the experience giving y=50y = 50. Set y=50y = 50 and solve for xx:

50=1.9x+12.4    1.9x=37.6    x=37.61.919.8.50 = 1.9x + 12.4 \implies 1.9x = 37.6 \implies x = \frac{37.6}{1.9} \approx 19.8.

So about 19.819.8 years of experience is needed (to one decimal place).

Classify part (a). The data covered 22 to 2020 years. Since x=18x = 18 lies inside this range, part (a) is interpolation and is reliable.

Check. Substituting x=19.8x = 19.8 back gives 1.9×19.8+12.4=37.62+12.4=50.02501.9 \times 19.8 + 12.4 = 37.62 + 12.4 = 50.02 \approx 50, confirming the estimate. So the score at 1818 years is about 46.646.6, and 5050 needs about 19.819.8 years.

exam6 marksA gym records each member's number of weekly sessions xx and a fitness score yy for seven members: (2,14)(2, 14), (4,19)(4, 19), (6,21)(6, 21), (8,26)(8, 26), (10,30)(10, 30), (12,33)(12, 33), (14,39)(14, 39). A calculator gives the least-squares line y=2x+10y = 2x + 10. (a) State the gradient and intercept. (b) Interpret the gradient in context. (c) Predict the fitness score for a member doing 99 sessions a week. (d) Predict the score for 2525 sessions a week and explain why this prediction should be treated with caution.
Show worked solution →
State the gradient and intercept
From y=2x+10y = 2x + 10, the gradient is 22 and the yy-intercept is 1010.
Interpret the gradient
For each extra weekly session, the fitness score rises by an average of 22 points.
Predict at x=9x = 9
Substitute x=9x = 9:

y=2×9+10=18+10=28,y = 2 \times 9 + 10 = 18 + 10 = 28,

so the predicted fitness score is 2828.

Predict at x=25x = 25 and caution. Substitute x=25x = 25:

y=2×25+10=50+10=60.y = 2 \times 25 + 10 = 50 + 10 = 60.

The data covered 22 to 1414 sessions, so x=25x = 25 is well outside the range. This is extrapolation: the linear trend may not continue (fitness gains often level off), so the predicted 6060 is unreliable.

Check. The mean number of sessions is 88 and the mean score is 2626; substituting x=8x = 8 gives 2×8+10=26=yˉ2 \times 8 + 10 = 26 = \bar{y}, so the line passes through the mean point and the in-range prediction is sound.

exam7 marksAn analyst records weekly income xx (in hundreds of dollars) and weekly savings yy (in dollars) for six people: (6,35)(6, 35), (8,50)(8, 50), (10,65)(10, 65), (12,77)(12, 77), (14,86)(14, 86), (16,95)(16, 95). A calculator gives the least-squares line y=6x+2y = 6x + 2. (a) Interpret the gradient, taking care with the units. (b) Predict the weekly savings of a person with an income of $1100 per week. (c) Predict the savings for an income of $2000 per week, and state whether this is interpolation or extrapolation. (d) Interpret the yy-intercept and comment on its reliability.
Show worked solution →

Interpret the gradient. Here xx is income in hundreds of dollars, so 11 unit of xx is $100 of weekly income. The gradient 66 means that for each extra $100 of weekly income, weekly savings rise by an average of $6.

Predict at an income of $1100. An income of $1100 is x=11x = 11 (hundreds). Substitute x=11x = 11:

y=6×11+2=66+2=68,y = 6 \times 11 + 2 = 66 + 2 = 68,

so the predicted weekly savings are $68.

Predict at an income of $2000. An income of $2000 is x=20x = 20. Substitute x=20x = 20:

y=6×20+2=120+2=122,y = 6 \times 20 + 2 = 120 + 2 = 122,

so about $122. The data covered incomes from x=6x = 6 to x=16x = 16 (that is $600 to $1600), so x=20x = 20 is outside the range: this is extrapolation.

Interpret the intercept. The intercept is 22, the predicted savings at x=0x = 0 (zero income): about $2. No one in the data has zero income, so x=0x = 0 is well outside the range; the intercept is an extrapolation and should not be read literally.

Check. The mean income is x=11x = 11 and the mean savings figure is 6868; substituting x=11x = 11 gives 6×11+2=68=yˉ6 \times 11 + 2 = 68 = \bar{y}, so the line passes through the mean point and the $1100 prediction is sound.

exam6 marksFor 1414 summer days, a researcher records the daily maximum temperature xx (in degrees Celsius) and the region's peak electricity demand yy (in hundreds of megawatts). No raw points are given, only these summary statistics: xˉ=20\bar{x} = 20, yˉ=60\bar{y} = 60, sx=5s_x = 5, sy=15s_y = 15, and the correlation coefficient r=0.8r = 0.8. The temperatures in the data range from 1212 to 3030 degrees. (a) Using m=r×sysxm = r \times \dfrac{s_y}{s_x} and the fact that the line passes through the point of means, find the equation of the least-squares regression line. (b) Interpret the gradient and the yy-intercept in context. (c) Predict the peak demand on a 2828 degree day and state, with a reason, whether the prediction is reliable.
Show worked solution →

Find the gradient from the correlation and spreads. The gradient of the least-squares line is the correlation coefficient scaled by the ratio of the standard deviations:

m=r×sysx=0.8×155=0.8×3=2.4.m = r \times \frac{s_y}{s_x} = 0.8 \times \frac{15}{5} = 0.8 \times 3 = 2.4.

Find the intercept using the point of means. The least-squares line always passes through (xˉ,yˉ)=(20,60)(\bar{x}, \bar{y}) = (20, 60). With y=2.4x+by = 2.4x + b, substitute the mean point and solve for bb:

b=yˉmxˉ=602.4×20=6048=12.b = \bar{y} - m\bar{x} = 60 - 2.4 \times 20 = 60 - 48 = 12.

So the equation is y=2.4x+12y = 2.4x + 12.

Interpret the gradient
The gradient is 2.42.4. For each extra degree of maximum temperature, peak demand rises by an average of 2.42.4 hundred megawatts, that is 240240 MW per degree. The word "average" matters because the line is a best fit, not an exact rule for any single day.
Interpret the intercept
The intercept is 1212, the predicted demand at x=0x = 0 degrees: 1212 hundred megawatts, that is 12001200 MW. Since the data only covers 1212 to 3030 degrees, x=0x = 0 is far outside that range, so this baseline is an extrapolation and should not be read literally.
Predict at x=28x = 28 and judge reliability
Substitute x=28x = 28 into the line:

y=2.4×28+12=67.2+12=79.2,y = 2.4 \times 28 + 12 = 67.2 + 12 = 79.2,

so about 79.279.2 hundred megawatts, that is roughly 79207920 MW. Because 2828 lies inside the data range 1212 to 3030, this is interpolation, and with a strong correlation of r=0.8r = 0.8 the prediction is reliable.

Answer: the line is y=2.4x+12y = 2.4x + 12; demand rises by an average of 240240 MW per degree and the intercept of 12001200 MW at 00 degrees is an unreliable extrapolation; the predicted demand at 2828 degrees is about 79207920 MW, and this interpolation is reliable.

exam6 marksA surf club studies 1515 junior members. For each it records xx, the number of training sessions attended per week, and yy, a rescue-skills score out of 100100. The data give xˉ=8\bar{x} = 8, yˉ=55\bar{y} = 55, sx=3s_x = 3, sy=12.5s_y = 12.5, and r=0.84r = 0.84, with attendance ranging from 22 to 1414 sessions. (a) Comment on the strength and direction of the linear relationship. (b) Find the equation of the least-squares regression line using m=r×sysxm = r \times \dfrac{s_y}{s_x}. (c) Predict the score of a member who trains 1010 times a week. (d) Predict the score for 2020 sessions a week and explain, referring to both the data range and the value of rr, why this prediction is the least trustworthy of the two.
Show worked solution →

Comment on the relationship. Since r=0.84r = 0.84 is positive and close to 11, there is a strong positive linear relationship: members who train more often tend to score higher. As a guide, r2=0.8420.71r^2 = 0.84^2 \approx 0.71, so about 71%71\% of the variation in scores is explained by the linear model.

Find the gradient and intercept. The gradient is

m=r×sysx=0.84×12.53=0.84×4.16=3.5.m = r \times \frac{s_y}{s_x} = 0.84 \times \frac{12.5}{3} = 0.84 \times 4.1\overline{6} = 3.5.

The line passes through (xˉ,yˉ)=(8,55)(\bar{x}, \bar{y}) = (8, 55), so

b=yˉmxˉ=553.5×8=5528=27,b = \bar{y} - m\bar{x} = 55 - 3.5 \times 8 = 55 - 28 = 27,

giving the equation y=3.5x+27y = 3.5x + 27.

Predict at x=10x = 10. Substitute x=10x = 10:

y=3.5×10+27=35+27=62,y = 3.5 \times 10 + 27 = 35 + 27 = 62,

so the predicted score is 6262 out of 100100.

Predict at x=20x = 20 and judge it. Substitute x=20x = 20:

y=3.5×20+27=70+27=97.y = 3.5 \times 20 + 27 = 70 + 27 = 97.

The attendance data runs only from 22 to 1414 sessions, so x=20x = 20 is well outside the range: this is extrapolation. The linear trend may not continue (scores cannot exceed 100100 and improvement tends to level off), and even where the model does apply the correlation is only 0.840.84, not perfect, so predictions carry error. The x=20x = 20 prediction is therefore the least trustworthy, whereas x=10x = 10 is interpolation inside the data and is reliable.

Answer: the relationship is strong and positive (r=0.84r = 0.84); the line is y=3.5x+27y = 3.5x + 27; the predicted score at 1010 sessions is 6262, while the predicted 9797 at 2020 sessions is an unreliable extrapolation beyond the data range.

exam5 marksSix Australian households were surveyed. For each, xx is the number of residents and yy is the average daily electricity use in kilowatt-hours. The number of residents took the values 22, 33, 44, 55, 66 across five of the homes (the sixth is omitted here) and the corresponding electricity figures were 3333, 4141, 4949, 5656, 6666. These five pairs have sx=1.5s_x = 1.5, sy=12s_y = 12 and r=0.9r = 0.9. (a) Find the mean number of residents and the mean daily usage from the raw values. (b) Hence find the equation of the least-squares regression line, using m=r×sysxm = r \times \dfrac{s_y}{s_x} and the point of means. (c) Predict the daily usage of a 33 person household and of a 1010 person household, classifying each prediction as interpolation or extrapolation.
Show worked solution →

Find the two means from the raw values. Add the residents and divide by 55, then do the same for the usage figures:

xˉ=2+3+4+5+65=205=4,\bar{x} = \frac{2 + 3 + 4 + 5 + 6}{5} = \frac{20}{5} = 4,

yˉ=33+41+49+56+665=2455=49.\bar{y} = \frac{33 + 41 + 49 + 56 + 66}{5} = \frac{245}{5} = 49.

Find the gradient and intercept. The gradient is

m=r×sysx=0.9×121.5=0.9×8=7.2.m = r \times \frac{s_y}{s_x} = 0.9 \times \frac{12}{1.5} = 0.9 \times 8 = 7.2.

The line passes through (xˉ,yˉ)=(4,49)(\bar{x}, \bar{y}) = (4, 49), so

b=yˉmxˉ=497.2×4=4928.8=20.2,b = \bar{y} - m\bar{x} = 49 - 7.2 \times 4 = 49 - 28.8 = 20.2,

giving the equation y=7.2x+20.2y = 7.2x + 20.2.

Predict at x=3x = 3. Substitute x=3x = 3:

y=7.2×3+20.2=21.6+20.2=41.8,y = 7.2 \times 3 + 20.2 = 21.6 + 20.2 = 41.8,

so about 41.841.8 kWh per day. Since 33 lies inside the data range 22 to 66 residents, this is interpolation.

Predict at x=10x = 10. Substitute x=10x = 10:

y=7.2×10+20.2=72+20.2=92.2,y = 7.2 \times 10 + 20.2 = 72 + 20.2 = 92.2,

so about 92.292.2 kWh per day. Since 1010 is outside the range 22 to 66, this is extrapolation and should be treated with caution.

Answer: the means are xˉ=4\bar{x} = 4 residents and yˉ=49\bar{y} = 49 kWh; the line is y=7.2x+20.2y = 7.2x + 20.2; the predicted usage is about 41.841.8 kWh for 33 residents (interpolation, reliable) and about 92.292.2 kWh for 1010 residents (extrapolation, unreliable).

ExamExplained