NSWSoftware EngineeringSyllabus dot point

Inquiry Question 1: How do machine learning systems work?

Describe the basic structure of a neural network, including neurons, layers, weights, activation functions and training by backpropagation

A focused answer to the HSC Software Engineering Module 3 dot point on neural networks. Neurons, layers, weights, activation functions, forward pass, backpropagation, the worked example, and the traps markers look for.

Generated by Claude OpusReviewed by Better Tuition Academy6 min answerUpdated 2026-05-20

Have a quick question? Jump to the Q&A page

What this dot point is asking

NESA wants you to describe the basic architecture of a feed-forward neural network and the mechanics of how it is trained. You do not need to derive gradients, but you should know the components by name and what each does.

The answer

A feed-forward neural network is a stack of layers. The diagram shows a small network with an input layer of four features, one hidden layer of five neurons, and an output layer of three classes. Every neuron in one layer is connected to every neuron in the next.

The artificial neuron

The basic unit. It takes inputs $x_1, x_2, \dots, x_n$ from the previous layer, multiplies each by a weight, sums them, adds a bias, and applies an activation function:

a = f\left(\sum_{i=1}^{n} w_i x_i + b\right)

Common activation functions:

ReLU: $f(z) = \max(0, z)$ . The default in hidden layers.
Sigmoid: $f(z) = \frac{1}{1 + e^{-z}}$ . Squashes the output to $(0, 1)$ . Used for binary classification output.
Softmax: turns a vector of scores into probabilities summing to 1. Used for multi-class classification output.
Tanh: $f(z) = \tanh(z)$ . Squashes to $(-1, 1)$ . Older default.

Layers

A neural network is a stack of layers:

Input layer: one neuron per feature. For an image, that might be 28 x 28 = 784 input neurons.
Hidden layers: one or more layers between input and output. Each neuron is connected to every neuron in the previous layer (in a fully connected network).
Output layer: one neuron for regression, $n$ neurons for $n$ -class classification.

A "deep" network has many hidden layers. Each layer learns increasingly abstract features.

Forward pass

To make a prediction, feed the input through every layer in turn. Each layer computes its weighted sums and activations. The output layer produces the prediction.

For a 784-input, 128-hidden, 10-output digit classifier:

Input: 784 pixel values, normalised to $[0, 1]$ .
Hidden layer: 128 neurons, each computing a weighted sum of the 784 inputs and applying ReLU.
Output layer: 10 neurons (one per digit 0-9), each computing a weighted sum of the 128 hidden activations and applying softmax.

The output is a probability distribution over the 10 digits. The predicted digit is the one with the highest probability.

Loss

Measures how wrong the prediction is.

Cross-entropy loss for classification: low when the predicted probability of the correct class is high.
Mean squared error for regression: low when the predicted value is close to the true value.

Backpropagation

The training algorithm. For each batch of training examples:

Forward pass: compute predictions and the loss.
Backward pass: compute the gradient of the loss with respect to every weight in the network, using the chain rule of calculus.
Update: adjust every weight by a small step opposite to its gradient. The step size is the learning rate.

After many passes through the training data (epochs), the weights settle into values that produce good predictions.

The optimiser controls how the updates are applied. Stochastic gradient descent (SGD) updates after each mini-batch. Adam is a popular adaptive variant.

A worked code example

A minimal feed-forward network in PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)  # softmax applied via the loss

model = MLP()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# One training step:
predictions = model(batch_x)              # forward pass
loss = loss_fn(predictions, batch_y)      # compute loss
loss.backward()                           # backpropagation
optimizer.step()                          # update weights
optimizer.zero_grad()                     # reset gradients

The framework handles the gradient calculations automatically.

Hyperparameters

Choices the developer makes that affect training:

Number of layers and neurons per layer (network architecture).
Activation functions (ReLU, sigmoid, tanh).
Learning rate (how big each weight update is).
Batch size (examples per gradient step).
Number of epochs (passes through the training data).
Regularisation (dropout, L2 weight decay) to prevent overfitting.

Overfitting

A neural network with enough parameters can memorise the training data exactly. Such a model has zero training error but performs poorly on new data. Detection: training loss keeps falling while validation loss starts rising. Prevention: more training data, smaller network, dropout, regularisation, early stopping.

Beyond feed-forward

For images and other structured inputs, specialised architectures perform far better:

Convolutional neural networks (CNNs) for images.
Recurrent neural networks (RNNs) for sequences.
Transformers for language and many other domains.

These are out of HSC scope at the architectural level, but you should recognise the names.

Worked example

A neural network is trained to classify HSC essays by predicted band (1-6). It has 50 input features (word count, sentence length, vocabulary richness, etc.), one hidden layer of 32 neurons, and 6 output neurons.

After training, training accuracy is 95 percent but validation accuracy is 60 percent. Diagnose and propose two fixes.

Diagnosis: the network has overfit the training data. It memorised idiosyncrasies of the training essays rather than learning patterns that generalise.

Fixes (any two):

More training data: collect more graded essays.
Smaller network: reduce hidden neurons (32 -> 8).
Regularisation: add dropout (randomly disable neurons during training) or L2 weight decay.
Early stopping: monitor validation loss and stop training once it starts rising.
Data augmentation: simulate more examples by paraphrasing or perturbing existing ones.

Common traps

Calling backpropagation "the network": Backpropagation is the training algorithm. The trained network (weights and architecture) is the model.
Forgetting the bias term: Each neuron has a learnable bias, not just weights. Without it, the network is constrained to predictions that pass through the origin.
Treating layers as identical: Different layers can have different activations. Convolutional layers, pooling layers, normalisation layers are different from fully connected layers.
Saying neural networks "think like a brain": The biological metaphor is loose. Artificial neurons are simple weighted sums; biological neurons are far more complex. Use the analogy carefully.
Confusing training and inference: Training adjusts weights; inference applies the trained weights to new data. Inference is much cheaper than training.

Past exam questions, worked

Real questions from past NESA papers on this dot point, with our answer explainer.

2024 HSC6 marksDescribe the structure of a simple feed-forward neural network with one hidden layer and explain how it learns from training data.

Show worked answer →

A feed-forward neural network has three kinds of layer. The input layer has one neuron per feature. The hidden layer(s) contain artificial neurons that combine inputs from the previous layer. The output layer produces the prediction - one neuron for regression, one neuron per class for classification.

Each neuron computes a weighted sum of its inputs plus a bias, then passes the result through an activation function (typically ReLU in hidden layers, softmax or sigmoid in the output). Mathematically: $a = f(\sum_i w_i x_i + b)$ where $w_i$ are learned weights, $x_i$ are the inputs from the previous layer, $b$ is the bias and $f$ is the activation function.

Training has two phases per batch of examples.

Forward pass: feed the input through every layer in turn to produce a prediction. Compare the prediction to the true label to compute a loss (cross-entropy for classification, mean squared error for regression).

Backward pass (backpropagation): compute the gradient of the loss with respect to every weight, using the chain rule of calculus. Update each weight by a small step in the direction that reduces the loss. The optimiser (typically stochastic gradient descent or Adam) controls how large the step is.

Repeat for many epochs (passes through the training data) until the loss stops decreasing on a held-out validation set. The network then generalises to new examples.

Markers reward the three-layer structure, weights/bias/activation in the neuron description, both forward and backward pass, and recognising that backpropagation uses the chain rule to compute gradients.