Inquiry Question 2: How are machine learning systems used to develop solutions?
Explain how the quality and representativeness of training data affect a model, including the risks of bias and overfitting
A focused answer to the HSC Software Engineering Module 3 dot point on training data. Sample bias, label bias, the train/test split, overfitting and underfitting, worked examples, and the traps markers look for.
Reviewed by: AI editorial process; not yet individually human-reviewed
Have a quick question? Jump to the Q&A page
What this dot point is asking
NESA wants you to explain how the data used to train a model shapes its behaviour, and identify two failure modes: bias (systematic errors against specific groups) and overfitting (memorising the training data instead of learning generalisable patterns).
The answer
Garbage in, garbage out
A model can only learn what is in the data. If the training data has flaws, the model will replicate or amplify them. Common data flaws:
- Sample bias: the training data does not represent the population. A model trained on Sydney pedestrians may misbehave in Tokyo.
- Label bias: the labels themselves are wrong or reflect human prejudice. A "good employee" label assigned by biased managers reproduces their bias.
- Measurement bias: the features measure different things in different contexts (a thermometer reading differently in two cities).
- Historical bias: the data reflects past inequities. A loan approval model trained on past approvals reflects past discrimination.
Real examples
- Amazon's hiring tool: trained on a decade of CVs, mostly from male engineers. The model learned to downrank CVs that mentioned "women's chess club" and similar phrases. Amazon scrapped the project.
- Facial recognition disparities: commercial systems audited in 2018 had error rates under 1 percent for white men and over 30 percent for dark-skinned women, traced to the training set composition.
- Healthcare risk scores: a US algorithm using healthcare spending as a proxy for need underestimated illness in Black patients because they had less access to care, so lower past spending.
Sample bias in detail
If 90 percent of training data comes from group A and 10 percent from group B, the model's loss function rewards being right about group A more than group B. The model can achieve high overall accuracy by performing well on A and poorly on B.
Fix: oversample under-represented groups, undersample over-represented groups, or use class weights so each group contributes equally to the loss.
Overfitting and underfitting
Overfitting: the model memorises the training data. High training accuracy, low test accuracy. The model has learned noise, not patterns. Solutions: more data, smaller model, regularisation (dropout, L2 weight decay), early stopping.
Underfitting: the model is too simple to capture the patterns. Low training accuracy, low test accuracy. Solutions: more powerful model, more features, less aggressive regularisation.
from sklearn.tree import DecisionTreeClassifier
# Overfit: very deep tree memorises training data.
overfit = DecisionTreeClassifier(max_depth=None)
overfit.fit(X_train, y_train)
print("train:", overfit.score(X_train, y_train)) # very high
print("test:", overfit.score(X_test, y_test)) # much lower
# Better: limit depth to force generalisation.
fit_ok = DecisionTreeClassifier(max_depth=5)
fit_ok.fit(X_train, y_train)
print("train:", fit_ok.score(X_train, y_train)) # high
print("test:", fit_ok.score(X_test, y_test)) # close to train
Per-group evaluation
A 95 percent overall accuracy can hide an 80 percent accuracy on one demographic. Always compute metrics per group:
import pandas as pd
results = pd.DataFrame({
"group": ["A"] * 90 + ["B"] * 10,
"correct": [True] * 88 + [False] * 2 + [True] * 7 + [False] * 3,
})
print(results.groupby("group")["correct"].mean())
# group
# A 0.977
# B 0.700
Group B's 70 percent accuracy is hidden by the 97 percent on the majority group.
Train, validation, test discipline
A clean split prevents data contamination:
- Training set (60-80 percent): used to fit the model.
- Validation set (10-20 percent): used during development to tune hyperparameters.
- Test set (10-20 percent): used once, at the end, to estimate real-world performance.
Looking at the test set during development leaks information and inflates reported accuracy.
Data documentation
Every dataset should be accompanied by:
- A datasheet explaining where the data came from, who labelled it and how.
- The demographic composition of the data.
- Known limitations (sample bias, label noise, the time period covered).
- A statement of intended use and out-of-scope use cases.
This is the "Datasheets for Datasets" standard (Gebru et al., 2018), now an industry norm.
Exam-style practice questions
Practice questions written in the style of NESA exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.
2025 HSC6 marksExplain how unrepresentative training data can introduce bias into a machine learning system. Describe two methods to detect or mitigate this bias.Show worked answer →
A machine learning model can only learn patterns that exist in its training data. If the training data does not represent the population the model will be used on, the model's predictions are systematically wrong for the under-represented group. This is sample bias.
Worked example: a facial recognition model trained mostly on photos of light-skinned faces will be much less accurate on dark-skinned faces, because the training data did not represent that group well. A medical diagnosis model trained on male patients may miss conditions that present differently in female patients. A hiring screening model trained on past hires reflects past biases (e.g. preferring men if the company historically hired men).
Mitigation 1: audit the training data. Measure the representation of each demographic group in the dataset. If a group is under-represented, collect more examples from that group or use weighted sampling so under-represented examples count more during training.
Mitigation 2: evaluate per group. Compute accuracy, false positive rate and false negative rate separately for each demographic group, not just overall. A model with 95 percent overall accuracy might be 99 percent accurate on group A and 70 percent on group B - the per-group view exposes the problem.
Other valid answers: fairness-aware training objectives, removing protected attributes from features (with the caveat that proxy features can still leak the attribute), engaging affected communities in the model design.
Markers reward a clear definition of bias originating from data, a real-world example, and two distinct detection or mitigation methods (data-side and evaluation-side are the cleanest).
Practice questions
Original practice questions graded from foundation to exam level, each with a full worked solution. Try them before revealing the solution.
foundation2 marksDistinguish between sample bias and label bias, giving a one-sentence definition of each.Show worked solution →
Sample bias: the training data does not represent the population the model will be used on (e.g. mostly one demographic group).
Label bias: the labels themselves are wrong or reflect human prejudice, even if the underlying examples are representative.
Marking criteria: 1 mark for a correct definition of sample bias, 1 mark for a correct definition of label bias, that clearly distinguishes it from sample bias (labels being wrong, not the sample being unrepresentative).
foundation3 marksA car insurer trains a claims-risk model using five years of policy data from drivers in Sydney only, then deploys it nationally. Identify the type of bias most likely to occur and explain its likely consequence.Show worked solution →
This is sample bias: the training data (Sydney drivers only) does not represent the national population the model is deployed on (different road types, traffic density, weather, and driving conditions in regional and rural areas).
Consequence: the model's risk predictions will be systematically less accurate for non-Sydney drivers, e.g. it may overestimate risk for careful regional drivers whose conditions differ from Sydney traffic patterns, or underestimate risks specific to regional/rural driving (long distances, wildlife strikes, unsealed roads) that never appeared in the training data.
Marking criteria: 1 mark for correctly naming sample bias, 1 mark for identifying that the training population does not match the deployment population, 1 mark for a specific, plausible consequence (systematically wrong predictions for non-Sydney drivers).
core4 marksThe table shows a model's accuracy broken down by two demographic groups after deployment.
| Group | Number of cases | Accuracy |
|---|---|---|
| A | 900 | 98% |
| B | 100 | 62% |
(a) Calculate the overall accuracy across both groups combined. (b) Explain why reporting only the overall accuracy would hide a serious problem.Show worked solution →
(a) Overall accuracy.
(b) Why this hides a problem. The overall figure of 94.4 percent looks strong and would pass most casual quality checks, but it is dominated by Group A, which makes up 90 percent of the cases. Group B's accuracy of only 62 percent, far below an acceptable standard, is effectively averaged away. Anyone in Group B experiences a model that is wrong more than one time in three, a serious fairness and reliability issue that the single overall number completely conceals.
Marking criteria: 1 mark for correct working, 1 mark for the correct overall accuracy (94.4%), 1 mark for explaining that the larger group dominates the average, 1 mark for explaining the consequence for Group B specifically.
core5 marksA decision tree classifier is trained with no depth limit on a dataset of 1,000 labelled examples, achieving 99.5 percent accuracy on the training set but 71 percent accuracy on a held-out test set of 250 examples. (a) Identify the failure mode shown. (b) Recommend two specific changes to the model or training process to address it, explaining how each helps.Show worked solution →
(a) Failure mode. This is overfitting. The very large gap between training accuracy (99.5%) and test accuracy (71%) shows the tree has memorised the specific examples (including noise) in the training set rather than learning patterns that generalise to new data.
(b) Two mitigations.
Limit the tree's max_depth (e.g. to 5 or so). A shallower tree cannot create a separate branch for every individual training example, forcing it to capture only the broader, generalisable patterns rather than memorising noise.
Get more training data, or apply regularisation/early stopping. More diverse examples make it harder for the model to memorise the whole dataset, and stopping training before it fully fits every quirk (or penalising complexity via regularisation) directly targets the mechanism causing the overfit.
Marking criteria: 1 mark for correctly identifying overfitting, 1 mark for correctly justifying it using the train/test gap, 1 mark each for two valid, specific mitigations (2 marks), 1 mark for explaining the mechanism by which at least one mitigation reduces overfitting.
exam6 marksA public hospital network builds a triage-priority model trained on ten years of historical patient records, using length of past hospital stay as a proxy label for 'severity of illness'. Explain how this design choice could produce a biased model, and evaluate whether removing the patient's postcode from the features would fix the bias.Show worked solution →
- How the design choice introduces bias
- Length of past hospital stay is not the same thing as true severity of illness; it also reflects factors such as access to care, insurance status, and whether a patient could afford or arrange follow-up outside hospital. If historically disadvantaged groups had shorter stays not because they were less severely ill but because they were discharged earlier due to under-resourcing, cost pressure, or systemic under-treatment, the model will learn to associate that group with lower severity. This is a form of historical/label bias: the label (length of stay) is a flawed proxy that encodes past inequity rather than genuine medical need, similar to the well-documented case of a healthcare risk-scoring algorithm that used past spending as a severity proxy and underestimated illness in Black patients because they had historically received less care.
- Would removing postcode fix it
- Removing postcode alone would not reliably fix the bias. Postcode is one possible proxy feature for the disadvantage driving the bias, but the underlying problem is the flawed label (length of stay), not a single input feature. Even without postcode, other correlated features, such as insurance status, referring hospital, or even certain diagnosis codes, could still let the model reconstruct the same pattern indirectly. Removing one proxy feature can also make the bias harder to detect, since an obvious explanatory variable has been hidden without removing its underlying effect.
- Better approach
- Replace or supplement the flawed label with a more clinically direct severity measure (e.g. validated clinical severity scores recorded at admission), and evaluate the model's error rates separately across demographic and socioeconomic groups before and after deployment, rather than relying on removing a single feature.
Marking criteria: 1 mark for identifying length of stay as a flawed proxy label, 1 mark for explaining the mechanism by which this encodes historical inequity (naming or clearly describing historical/label bias), 1 mark for a concrete real-world parallel or plausible consequence, 1 mark for correctly concluding that removing postcode alone is insufficient, 1 mark for explaining why (proxy features/correlated variables can still leak the same signal), 1 mark for proposing a genuinely better mitigation (better label, per-group evaluation).
exam7 marksAssess the claim that collecting more training data is always the best way to improve a machine learning model's fairness and accuracy.Show worked solution →
This is an ASSESS question: markers reward a supported judgement, not a one-sided description.
- Plan
- Thesis: collecting more data is not always the best fix, and can actively worsen fairness, because the VALUE of additional data depends entirely on whether it improves the data's representativeness and label quality, not merely its volume.
- Case for more data helping
- More data generally helps a model learn more robust, generalisable patterns and reduces overfitting, because there are more examples to average noise away from and the model is less likely to memorise idiosyncrasies of a small dataset. For underfitting caused by too little signal, more diverse examples can also expose patterns a smaller dataset never showed.
- Case against "more data" as a universal fix
- If the additional data is collected using the same flawed process as the original set, e.g. still drawn overwhelmingly from one demographic group, or still using the same biased labelling process, then the imbalance in the dataset is simply reproduced at a larger scale. This does not fix sample bias or label bias; it can entrench them further, because the model now has even stronger statistical evidence supporting the same skewed pattern. Amazon's scrapped hiring tool is a case in point: adding more historical CVs, almost all from men, would not have fixed its bias against female-associated language, it would have reinforced it. Similarly, collecting more data does nothing to fix overfitting caused by an excessively complex model relative to a genuinely small underlying signal, nor does it fix underfitting caused by choosing too simple a model or missing features.
- Model paragraph (excerpt)
- The claim treats "more data" as a single lever, but data quantity and data quality are separate variables that a good engineer must evaluate independently. A dataset that doubles in size while keeping the same demographic skew has not become more representative, it has simply given the model twice as much evidence for the same unbalanced pattern, and per-group evaluation would likely show the accuracy gap between groups persist or widen rather than close. The genuinely effective fix is targeted: identifying which group or label process is under-represented or flawed, and correcting that specific gap, whether through stratified collection, reweighting, or fixing the labelling method, rather than an undirected increase in volume.
Marker's note: top-band answers (1) take an explicit position rather than agreeing uncritically with the claim, (2) distinguish overfitting/underfitting (where more data often does help) from bias (where it often does not), (3) use a specific named or described real-world example, and (4) close with a reasoned judgement about when more data helps versus when targeted, representative collection is required instead.
