Inquiry Question 1: How do machine learning systems work?
Describe the basic structure of a neural network, including neurons, layers, weights, activation functions and training by backpropagation
A full study guide to the HSC Software Engineering Module 3 dot point on neural networks: neurons, layers, weights, activation functions, an owned network figure, forward pass and backpropagation, worked numeric examples and graded practice.
Reviewed by: AI editorial process; not yet individually human-reviewed
Have a quick question? Jump to the Q&A page
What this dot point is asking
NESA wants you to describe the basic architecture of a feed-forward neural network and the mechanics of how it is trained. You do not need to derive gradients, but you should know the components by name and what each does.
The answer
A feed-forward neural network is a stack of layers. The diagram shows a small network with an input layer of four features, one hidden layer of five neurons, and an output layer of three classes. Every neuron in one layer is connected to every neuron in the next.
The artificial neuron
The basic unit. It takes inputs from the previous layer, multiplies each by a weight, sums them, adds a bias, and applies an activation function:
Common activation functions:
- ReLU: . The default in hidden layers.
- Sigmoid: . Squashes the output to . Used for binary classification output.
- Softmax: turns a vector of scores into probabilities summing to 1. Used for multi-class classification output.
- Tanh: . Squashes to . Older default.
Layers
A neural network is a stack of layers:
- Input layer: one neuron per feature. For an image, that might be 28 x 28 = 784 input neurons.
- Hidden layers: one or more layers between input and output. Each neuron is connected to every neuron in the previous layer (in a fully connected network).
- Output layer: one neuron for regression, neurons for -class classification.
A "deep" network has many hidden layers. Each layer learns increasingly abstract features.
Forward pass
To make a prediction, feed the input through every layer in turn. Each layer computes its weighted sums and activations. The output layer produces the prediction.
For a 784-input, 128-hidden, 10-output digit classifier:
- Input: 784 pixel values, normalised to .
- Hidden layer: 128 neurons, each computing a weighted sum of the 784 inputs and applying ReLU.
- Output layer: 10 neurons (one per digit 0-9), each computing a weighted sum of the 128 hidden activations and applying softmax.
The output is a probability distribution over the 10 digits. The predicted digit is the one with the highest probability.
Loss
Measures how wrong the prediction is.
- Cross-entropy loss for classification: low when the predicted probability of the correct class is high.
- Mean squared error for regression: low when the predicted value is close to the true value.
Backpropagation
The training algorithm. For each batch of training examples:
- Forward pass: compute predictions and the loss.
- Backward pass: compute the gradient of the loss with respect to every weight in the network, using the chain rule of calculus.
- Update: adjust every weight by a small step opposite to its gradient. The step size is the learning rate.
After many passes through the training data (epochs), the weights settle into values that produce good predictions.
The optimiser controls how the updates are applied. Stochastic gradient descent (SGD) updates after each mini-batch. Adam is a popular adaptive variant.
A worked code example
A minimal feed-forward network in PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
return self.fc2(x) # softmax applied via the loss
model = MLP()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# One training step:
predictions = model(batch_x) # forward pass
loss = loss_fn(predictions, batch_y) # compute loss
loss.backward() # backpropagation
optimizer.step() # update weights
optimizer.zero_grad() # reset gradients
The framework handles the gradient calculations automatically.
Hyperparameters
Choices the developer makes that affect training:
- Number of layers and neurons per layer (network architecture).
- Activation functions (ReLU, sigmoid, tanh).
- Learning rate (how big each weight update is).
- Batch size (examples per gradient step).
- Number of epochs (passes through the training data).
- Regularisation (dropout, L2 weight decay) to prevent overfitting.
Overfitting
A neural network with enough parameters can memorise the training data exactly. Such a model has zero training error but performs poorly on new data. Detection: training loss keeps falling while validation loss starts rising. Prevention: more training data, smaller network, dropout, regularisation, early stopping.
Beyond feed-forward
For images and other structured inputs, specialised architectures perform far better:
- Convolutional neural networks (CNNs) for images.
- Recurrent neural networks (RNNs) for sequences.
- Transformers for language and many other domains.
These are out of HSC scope at the architectural level, but you should recognise the names.
Exam-style practice questions
Practice questions written in the style of NESA exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.
2024 HSC6 marksDescribe the structure of a simple feed-forward neural network with one hidden layer and explain how it learns from training data.Show worked answer →
A feed-forward neural network has three kinds of layer. The input layer has one neuron per feature. The hidden layer(s) contain artificial neurons that combine inputs from the previous layer. The output layer produces the prediction - one neuron for regression, one neuron per class for classification.
Each neuron computes a weighted sum of its inputs plus a bias, then passes the result through an activation function (typically ReLU in hidden layers, softmax or sigmoid in the output). Mathematically: where are learned weights, are the inputs from the previous layer, is the bias and is the activation function.
Training has two phases per batch of examples.
Forward pass: feed the input through every layer in turn to produce a prediction. Compare the prediction to the true label to compute a loss (cross-entropy for classification, mean squared error for regression).
Backward pass (backpropagation): compute the gradient of the loss with respect to every weight, using the chain rule of calculus. Update each weight by a small step in the direction that reduces the loss. The optimiser (typically stochastic gradient descent or Adam) controls how large the step is.
Repeat for many epochs (passes through the training data) until the loss stops decreasing on a held-out validation set. The network then generalises to new examples.
Markers reward the three-layer structure, weights/bias/activation in the neuron description, both forward and backward pass, and recognising that backpropagation uses the chain rule to compute gradients.
Practice questions
Original practice questions graded from foundation to exam level, each with a full worked solution. Try them before revealing the solution.
foundation3 marksA neuron has inputs and , weights and , and bias . Calculate the neuron's pre-activation value and its output after a ReLU activation.Show worked solution →
Step 1: pre-activation (weighted sum plus bias).
Step 2: apply ReLU.
Since is already positive, ReLU leaves it unchanged.
Marking criteria: 1 mark for the correct weighted sum, 1 mark for correctly adding the bias, 1 mark for correctly applying ReLU (2.4).
foundation3 marksExplain why a neuron needs a learnable bias term as well as its weights.Show worked solution →
Without a bias, a neuron's pre-activation is , which is forced to equal 0 whenever every input is 0, regardless of the weights chosen. A bias adds a constant, , letting the neuron shift its activation threshold up or down independently of the inputs. This gives the network far more flexibility to fit a relationship that does not naturally pass through the origin, in the same way a straight line needs the term to fit data that does not pass through .
Marking criteria: 1 mark for stating that without bias, zero input forces zero pre-activation, 1 mark for explaining the bias shifts the activation threshold, 1 mark for a valid analogy or consequence (e.g. the network could not fit data offset from the origin).
core4 marksThe table below shows training and validation loss for a network over 20 epochs.
| Epoch | Training loss | Validation loss |
|-------|----------------|-------------------|
| 1 | 0.90 | 0.95 |
| 5 | 0.40 | 0.42 |
| 10 | 0.15 | 0.30 |
| 15 | 0.05 | 0.45 |
| 20 | 0.02 | 0.60 |
(a) Identify the approximate epoch at which the network begins to overfit, and justify your answer using the table. (b) Propose one specific fix.
Show worked solution →
(a) Epoch and justification. Overfitting begins at around epoch 10. Up to epoch 10, both training loss and validation loss fall together (0.90 to 0.15 and 0.95 to 0.30). After epoch 10, training loss keeps falling (to 0.02 by epoch 20) while validation loss rises again (from 0.30 back up to 0.60). This divergence, where training loss keeps improving but validation loss gets worse, is the defining signature of overfitting.
(b) Fix. Apply early stopping: monitor validation loss during training and stop (or restore the weights from) around epoch 10, before validation loss starts rising, instead of training all the way to epoch 20. (Dropout, L2 regularisation, or a smaller network would also be accepted.)
Marking criteria: 1 mark for identifying epoch 10 (or a value in that range) as the turning point, 2 marks for justifying it using the divergence between training and validation loss (not just quoting one row), 1 mark for a specific, workable fix.
core5 marksA network has two inputs and , one hidden neuron with weights , , bias and ReLU activation, feeding a single output neuron with weight , bias and sigmoid activation. Calculate the network's output.Show worked solution →
Step 1: hidden neuron pre-activation.
Step 2: hidden neuron activation (ReLU).
Step 3: output neuron pre-activation.
Step 4: output neuron activation (sigmoid).
The network's output is approximately 0.55.
Marking criteria: 1 mark for the correct hidden pre-activation, 1 mark for correctly applying ReLU, 1 mark for the correct output pre-activation, 2 marks for correctly evaluating the sigmoid to approximately 0.55.
exam6 marksDuring backpropagation, a particular weight currently has value . The gradient of the loss with respect to this weight is calculated (via the chain rule) as , and the learning rate is . (a) Calculate the updated weight. (b) Explain why the gradient is subtracted, not added. (c) Explain what would likely go wrong if the learning rate were set to instead.Show worked solution →
(a) Updated weight.
(b) Why subtract. The gradient points in the direction of steepest INCREASE of the loss. Since training wants to MINIMISE the loss, the weight is moved in the opposite direction to the gradient (subtracted), so that a positive gradient (increasing would increase the loss) correctly decreases , and a negative gradient would correctly increase .
(c) Learning rate too large. With a learning rate of 5.0, the update would be , giving , a huge swing in a single step. Such large steps tend to overshoot the minimum of the loss repeatedly, causing the loss to oscillate or diverge (increase) instead of steadily decreasing, so the network fails to train.
Marking criteria: 1 mark for the correct calculation (0.775), 2 marks for correctly explaining the gradient points toward increasing loss so it is subtracted, 3 marks for explaining that too large a learning rate causes overshooting and unstable or diverging training, with a supporting calculation.
exam7 marksEvaluate the statement "a neural network with more hidden layers always produces a better model." Refer to the forward pass, overfitting, and computational cost in your answer.Show worked solution →
This is an EVALUATE question: markers reward a clear judgement supported by specific mechanisms, not a list of vaguely related facts.
- Thesis
- The statement is false as an unconditional claim. Adding hidden layers increases a network's capacity to represent complex functions, but capacity is not the only thing that determines whether a model is actually "better" on new data.
- Forward pass and representational power
- Each additional hidden layer lets the network compose simpler features into more abstract ones during the forward pass, in principle allowing it to model more complex relationships between inputs and outputs than a shallow network could. This is the argument in favour of more layers: greater depth can, up to a point, reduce the error a network can achieve on a genuinely complex problem.
- Overfitting
- However, more layers also mean more weights to learn. With a fixed amount of training data, a deeper network can fit the training set (including its noise) more and more closely, which is exactly the signature of overfitting: training loss keeps falling while validation loss stops improving or starts rising. Beyond some depth, extra layers actively make the model worse on new data, not better, because the network is memorising rather than generalising.
- Computational cost
- Every additional layer also increases the number of weight updates needed during backpropagation and the compute required for both training and inference. A model that is only marginally more accurate but many times slower or more expensive to run may not be "better" in any practical industrial sense, especially for real-time systems like fraud detection or predictive maintenance, where inference speed matters as much as raw accuracy.
- Judgement
- More hidden layers can improve a model only up to the point supported by the available training data, regularisation, and the compute budget available for training and serving; beyond that point extra depth increases overfitting risk and cost without improving real-world performance. The statement should therefore be rejected as an overgeneralisation, with depth being one hyperparameter to be tuned against validation performance, not maximised.
Marker's note: top-band answers (1) give an explicit thesis, (2) address all three named factors (forward pass, overfitting, cost) with the correct mechanism for each, (3) explain the trade-off rather than treating "more layers" as purely good or purely bad, and (4) end with a direct judgement on the original statement.
