How do you measure the association between two numerical variables and fit a least-squares line to model and predict?
Investigate the association between two numerical variables using a scatterplot, the correlation coefficient and the coefficient of determination, fit a least-squares regression line, and interpret its slope, intercept and residuals
A focused answer to the VCE General Mathematics Unit 3 Data analysis key-knowledge point on bivariate data. Reading scatterplots, the correlation coefficient r, the coefficient of determination, fitting and interpreting a least-squares line, residual plots and the dangers of extrapolation.
Reviewed by: AI editorial process; not yet individually human-reviewed
Have a quick question? Jump to the Q&A page
Jump to a section
What this dot point is asking
VCAA wants you to investigate the relationship between two numerical variables. You describe a scatterplot, quantify the strength of a linear association with the correlation coefficient , fit a least-squares regression line, interpret its slope and intercept in context, use it to predict, and judge whether the line is a good model using the coefficient of determination and a residual plot. This is the heart of Unit 3 and the single most heavily examined area in the course.
The correlation coefficient
Pearson's correlation coefficient measures the strength and direction of a linear association. It satisfies . A value near means a strong positive linear association, near a strong negative one, and near no linear association. As a rough guide, is strong, is moderate, and is weak. Correlation requires both variables to be numerical and the relationship to be linear with no clear outliers.
The coefficient of determination
The coefficient of determination is , usually quoted as a percentage. It states the proportion of the variation in the response variable that is explained by the variation in the explanatory variable through the linear model.
The least-squares regression line
The least-squares line minimises the sum of the squared vertical distances (residuals) from the data points to the line. Its equation is written
where is the response variable, is the explanatory variable, is the slope and is the intercept. The slope and intercept can be found from
Here and are the standard deviations, and and the means, of the two variables.
Interpreting the slope. The slope is the average change in the response variable for each one-unit increase in the explanatory variable. Interpreting the intercept. The intercept is the predicted response when the explanatory variable equals zero, which is only meaningful if zero is in the range of the data.
Residuals and residual plots
A residual is the actual value minus the predicted value:
A positive residual means the point lies above the line. To check whether a straight line is the right model, plot the residuals against the explanatory variable. If the residual plot shows random scatter about zero, the linear model is appropriate. If it shows a clear pattern (such as a curve), a linear model is not appropriate and the data should be transformed.
Interpolation, extrapolation and causation
Predicting within the range of the data is interpolation and is reasonably reliable. Predicting outside the range is extrapolation and is unreliable because there is no evidence the pattern continues. Finally, a strong correlation does not prove that one variable causes the other; a lurking third variable or coincidence may be responsible.
Why this matters for the exams
Regression questions are worth a large share of every Data analysis exam and SAC. Markers reward correct interpretation in context far more than bare numbers, so practise writing slope, intercept and sentences. When the residual plot reveals curvature, the course moves on to data transformations (squaring, log, or reciprocal) to straighten the data before refitting, which is the natural next step from this dot point.
Exam-style practice questions
Practice questions written in the style of VCAA exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.
2023 VCAA1 marksA least squares line can be used to model the birth rate (children per 1000 population) from the average daily food energy intake (megajoules). For a country with an intake of 8.53 megajoules the birth rate is 32.2 children per 1000, and for a country with an intake of 14.9 megajoules the birth rate is 9.9 children per 1000. The slope of this least squares line is closest to A. -4.7 B. -3.5 C. -0.29 D. 2.7 E. 25Show worked answer →
The slope of a line is the change in the response variable divided by the change in the explanatory variable.
Here the explanatory variable is food energy intake and the response variable is birth rate.
slope = (9.9 - 32.2) / (14.9 - 8.53) = -22.3 / 6.37 = -3.50.
This is closest to -3.5, so the answer is B. The negative slope shows that as food energy intake rises, the predicted birth rate falls.
2023 VCAA1 marksA study of Year 10 students shows a negative association between topic test scores and time spent on social media. The coefficient of determination is 0.72. From this information it can be concluded that A. a decreased time spent on social media is associated with an increased topic test score. B. less time spent on social media causes an increase in topic test performance. C. an increased time spent on social media is associated with an increased topic test score. D. too much time spent on social media causes a reduction in topic test performance. E. a decreased time spent on social media is associated with a decreased topic test score.Show worked answer →
The association is negative, so as one variable increases the other tends to decrease. This means less social media time is associated with higher test scores, and more social media time with lower scores.
The coefficient of determination measures only the strength of a linear association; it says nothing about cause and effect. Options B and D claim social media causes a change in performance, which is not justified by observational data, so they are wrong.
Option A correctly states an association (not causation) in the right direction, so the answer is A.
2025 VCAA1 marksIn a scatterplot, games won is plotted against goals against for 12 A-League men's teams. The correlation coefficient between games won and goals against is r = -0.466. Based on the correlation coefficient, it can be concluded that A. fewer goals scored against is associated with a smaller number of wins. B. fewer goals scored against causes a smaller number of wins. C. more goals scored against causes a smaller number of wins. D. more goals scored against is associated with a smaller number of wins.Show worked answer →
A negative correlation coefficient means the two variables move in opposite directions: as goals against increases, games won tends to decrease.
Correlation does not establish cause and effect, so options B and C, which use the word causes, are not justified by this data.
Option A describes the wrong direction (fewer goals against with fewer wins contradicts the negative association). Option D correctly states that more goals against is associated with fewer wins, so the answer is D.