← Module 3: Software Automation
Inquiry Question 2: How are machine learning systems used to develop solutions?
Explain how the quality and representativeness of training data affect a model, including the risks of bias and overfitting
A focused answer to the HSC Software Engineering Module 3 dot point on training data. Sample bias, label bias, the train/test split, overfitting and underfitting, the worked example, and the traps markers look for.
Have a quick question? Jump to the Q&A page
What this dot point is asking
NESA wants you to explain how the data used to train a model shapes its behaviour, and identify two failure modes: bias (systematic errors against specific groups) and overfitting (memorising the training data instead of learning generalisable patterns).
The answer
Garbage in, garbage out
A model can only learn what is in the data. If the training data has flaws, the model will replicate or amplify them. Common data flaws:
- Sample bias: the training data does not represent the population. A model trained on Sydney pedestrians may misbehave in Tokyo.
- Label bias: the labels themselves are wrong or reflect human prejudice. A "good employee" label assigned by biased managers reproduces their bias.
- Measurement bias: the features measure different things in different contexts (a thermometer reading differently in two cities).
- Historical bias: the data reflects past inequities. A loan approval model trained on past approvals reflects past discrimination.
Real examples
- Amazon's hiring tool: trained on a decade of CVs, mostly from male engineers. The model learned to downrank CVs that mentioned "women's chess club" and similar phrases. Amazon scrapped the project.
- Facial recognition disparities: commercial systems audited in 2018 had error rates under 1 percent for white men and over 30 percent for dark-skinned women, traced to the training set composition.
- Healthcare risk scores: a US algorithm using healthcare spending as a proxy for need underestimated illness in Black patients because they had less access to care, so lower past spending.
Sample bias in detail
If 90 percent of training data comes from group A and 10 percent from group B, the model's loss function rewards being right about group A more than group B. The model can achieve high overall accuracy by performing well on A and poorly on B.
Fix: oversample under-represented groups, undersample over-represented groups, or use class weights so each group contributes equally to the loss.
Overfitting and underfitting
Overfitting: the model memorises the training data. High training accuracy, low test accuracy. The model has learned noise, not patterns. Solutions: more data, smaller model, regularisation (dropout, L2 weight decay), early stopping.
Underfitting: the model is too simple to capture the patterns. Low training accuracy, low test accuracy. Solutions: more powerful model, more features, less aggressive regularisation.
from sklearn.tree import DecisionTreeClassifier
# Overfit: very deep tree memorises training data.
overfit = DecisionTreeClassifier(max_depth=None)
overfit.fit(X_train, y_train)
print("train:", overfit.score(X_train, y_train)) # very high
print("test:", overfit.score(X_test, y_test)) # much lower
# Better: limit depth to force generalisation.
fit_ok = DecisionTreeClassifier(max_depth=5)
fit_ok.fit(X_train, y_train)
print("train:", fit_ok.score(X_train, y_train)) # high
print("test:", fit_ok.score(X_test, y_test)) # close to train
Per-group evaluation
A 95 percent overall accuracy can hide an 80 percent accuracy on one demographic. Always compute metrics per group:
import pandas as pd
results = pd.DataFrame({
"group": ["A"] * 90 + ["B"] * 10,
"correct": [True] * 88 + [False] * 2 + [True] * 7 + [False] * 3,
})
print(results.groupby("group")["correct"].mean())
# group
# A 0.977
# B 0.700
Group B's 70 percent accuracy is hidden by the 97 percent on the majority group.
Train, validation, test discipline
A clean split prevents data contamination:
- Training set (60-80 percent): used to fit the model.
- Validation set (10-20 percent): used during development to tune hyperparameters.
- Test set (10-20 percent): used once, at the end, to estimate real-world performance.
Looking at the test set during development leaks information and inflates reported accuracy.
Data documentation
Every dataset should be accompanied by:
- A datasheet explaining where the data came from, who labelled it and how.
- The demographic composition of the data.
- Known limitations (sample bias, label noise, the time period covered).
- A statement of intended use and out-of-scope use cases.
This is the "Datasheets for Datasets" standard (Gebru et al., 2018), now an industry norm.
Past exam questions, worked
Real questions from past NESA papers on this dot point, with our answer explainer.
2025 HSC6 marksExplain how unrepresentative training data can introduce bias into a machine learning system. Describe two methods to detect or mitigate this bias.Show worked answer →
A machine learning model can only learn patterns that exist in its training data. If the training data does not represent the population the model will be used on, the model's predictions are systematically wrong for the under-represented group. This is sample bias.
Worked example: a facial recognition model trained mostly on photos of light-skinned faces will be much less accurate on dark-skinned faces, because the training data did not represent that group well. A medical diagnosis model trained on male patients may miss conditions that present differently in female patients. A hiring screening model trained on past hires reflects past biases (e.g. preferring men if the company historically hired men).
Mitigation 1: audit the training data. Measure the representation of each demographic group in the dataset. If a group is under-represented, collect more examples from that group or use weighted sampling so under-represented examples count more during training.
Mitigation 2: evaluate per group. Compute accuracy, false positive rate and false negative rate separately for each demographic group, not just overall. A model with 95 percent overall accuracy might be 99 percent accurate on group A and 70 percent on group B - the per-group view exposes the problem.
Other valid answers: fairness-aware training objectives, removing protected attributes from features (with the caveat that proxy features can still leak the attribute), engaging affected communities in the model design.
Markers reward a clear definition of bias originating from data, a real-world example, and two distinct detection or mitigation methods (data-side and evaluation-side are the cleanest).
Related dot points
- Distinguish machine learning from classical programming, and define the roles of model, features, training data and predictions
A focused answer to the HSC Software Engineering Module 3 dot point on what machine learning is. Classical programming vs ML, the role of training data, features, model and predictions, the worked example, and the traps markers look for.
- Identify the ethical implications of automation and artificial intelligence, including accountability, transparency, employment effects and the use of personal data
A focused answer to the HSC Software Engineering Module 3 dot point on AI ethics. Accountability, transparency, employment, personal data, real cases (COMPAS, Amazon hiring, Robodebt), the worked example, and the traps markers look for.
- Describe the basic structure of a neural network, including neurons, layers, weights, activation functions and training by backpropagation
A focused answer to the HSC Software Engineering Module 3 dot point on neural networks. Neurons, layers, weights, activation functions, forward pass, backpropagation, the worked example, and the traps markers look for.