Back to the full dot-point answer
NSWSoftware EngineeringQuick questions
Module 3: Software Automation
Quick questions on Training data quality and bias explained: HSC Software Engineering Module 3
12short Q&A pairs drawn directly from our worked dot-point answer. For full context and worked exam questions, read the parent dot-point page.
What is garbage in, garbage out?Show answer
A model can only learn what is in the data. If the training data has flaws, the model will replicate or amplify them. Common data flaws:
What is sample bias in detail?Show answer
If 90 percent of training data comes from group A and 10 percent from group B, the model's loss function rewards being right about group A more than group B. The model can achieve high overall accuracy by performing well on A and poorly on B.
What is overfitting and underfitting?Show answer
Overfitting: the model memorises the training data. High training accuracy, low test accuracy. The model has learned noise, not patterns. Solutions: more data, smaller model, regularisation (dropout, L2 weight decay), early stopping.
What is per-group evaluation?Show answer
A 95 percent overall accuracy can hide an 80 percent accuracy on one demographic. Always compute metrics per group:
What is train, validation, test discipline?Show answer
A clean split prevents data contamination:
What is data documentation?Show answer
Every dataset should be accompanied by:
What is overfitting?Show answer
the model memorises the training data. High training accuracy, low test accuracy. The model has learned noise, not patterns.
What is underfitting?Show answer
the model is too simple to capture the patterns. Low training accuracy, low test accuracy. Solutions: more powerful model, more features, less aggressive regularisation.
What is reporting only overall accuracy?Show answer
A model can have 99 percent overall accuracy and 50 percent on a minority group. Markers want per-group evaluation.
What is confusing overfitting and bias?Show answer
Overfitting is when a model memorises noise in the training data. Bias is when the data systematically misrepresents the world. They are different failure modes, though both originate in the training set.
What is believing "more data" always fixes bias?Show answer
More biased data does not fix bias - it amplifies it. The data must be more representative, not just more abundant.
What is trying to remove bias by deleting the protected feature?Show answer
"Don't include gender in features" does not fix bias if other features (postcode, occupation, vocabulary) correlate with gender. Mitigation requires representative data and group-aware evaluation. :::