← Module 3: Software Automation

NSWSoftware EngineeringSyllabus dot point

Inquiry Question 1: How do machine learning systems work?

Describe the basic structure of a neural network, including neurons, layers, weights, activation functions and training by backpropagation

A focused answer to the HSC Software Engineering Module 3 dot point on neural networks. Neurons, layers, weights, activation functions, forward pass, backpropagation, the worked example, and the traps markers look for.

Generated by Claude OpusReviewed by Better Tuition Academy6 min answer

Have a quick question? Jump to the Q&A page

What this dot point is asking

NESA wants you to describe the basic architecture of a feed-forward neural network and the mechanics of how it is trained. You do not need to derive gradients, but you should know the components by name and what each does.

The answer

A feed-forward neural network is a stack of layers. The diagram shows a small network with an input layer of four features, one hidden layer of five neurons, and an output layer of three classes. Every neuron in one layer is connected to every neuron in the next.

Feed-forward neural network with one hidden layer Three columns of circles representing neurons. The input layer has four neurons, the hidden layer has five neurons, the output layer has three neurons. Lines connect every neuron in one layer to every neuron in the next, representing weights. Arrows along the top indicate the forward pass from input to output. Input layer features Hidden layer ReLU activation Output layer softmax x1 x2 x3 x4 y1 y2 y3 weights w on every connection plus bias b per neuron

The artificial neuron

The basic unit. It takes inputs x1,x2,…,xnx_1, x_2, \dots, x_n from the previous layer, multiplies each by a weight, sums them, adds a bias, and applies an activation function:

a=f(βˆ‘i=1nwixi+b)a = f\left(\sum_{i=1}^{n} w_i x_i + b\right)

Common activation functions:

  • ReLU: f(z)=max⁑(0,z)f(z) = \max(0, z). The default in hidden layers.
  • Sigmoid: f(z)=11+eβˆ’zf(z) = \frac{1}{1 + e^{-z}}. Squashes the output to (0,1)(0, 1). Used for binary classification output.
  • Softmax: turns a vector of scores into probabilities summing to 1. Used for multi-class classification output.
  • Tanh: f(z)=tanh⁑(z)f(z) = \tanh(z). Squashes to (βˆ’1,1)(-1, 1). Older default.

Layers

A neural network is a stack of layers:

  • Input layer: one neuron per feature. For an image, that might be 28 x 28 = 784 input neurons.
  • Hidden layers: one or more layers between input and output. Each neuron is connected to every neuron in the previous layer (in a fully connected network).
  • Output layer: one neuron for regression, nn neurons for nn-class classification.

A "deep" network has many hidden layers. Each layer learns increasingly abstract features.

Forward pass

To make a prediction, feed the input through every layer in turn. Each layer computes its weighted sums and activations. The output layer produces the prediction.

For a 784-input, 128-hidden, 10-output digit classifier:

  • Input: 784 pixel values, normalised to [0,1][0, 1].
  • Hidden layer: 128 neurons, each computing a weighted sum of the 784 inputs and applying ReLU.
  • Output layer: 10 neurons (one per digit 0-9), each computing a weighted sum of the 128 hidden activations and applying softmax.

The output is a probability distribution over the 10 digits. The predicted digit is the one with the highest probability.

Loss

Measures how wrong the prediction is.

  • Cross-entropy loss for classification: low when the predicted probability of the correct class is high.
  • Mean squared error for regression: low when the predicted value is close to the true value.

Backpropagation

The training algorithm. For each batch of training examples:

  1. Forward pass: compute predictions and the loss.
  2. Backward pass: compute the gradient of the loss with respect to every weight in the network, using the chain rule of calculus.
  3. Update: adjust every weight by a small step opposite to its gradient. The step size is the learning rate.

After many passes through the training data (epochs), the weights settle into values that produce good predictions.

The optimiser controls how the updates are applied. Stochastic gradient descent (SGD) updates after each mini-batch. Adam is a popular adaptive variant.

A worked code example

A minimal feed-forward network in PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)  # softmax applied via the loss

model = MLP()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# One training step:
predictions = model(batch_x)              # forward pass
loss = loss_fn(predictions, batch_y)      # compute loss
loss.backward()                           # backpropagation
optimizer.step()                          # update weights
optimizer.zero_grad()                     # reset gradients

The framework handles the gradient calculations automatically.

Hyperparameters

Choices the developer makes that affect training:

  • Number of layers and neurons per layer (network architecture).
  • Activation functions (ReLU, sigmoid, tanh).
  • Learning rate (how big each weight update is).
  • Batch size (examples per gradient step).
  • Number of epochs (passes through the training data).
  • Regularisation (dropout, L2 weight decay) to prevent overfitting.

Overfitting

A neural network with enough parameters can memorise the training data exactly. Such a model has zero training error but performs poorly on new data. Detection: training loss keeps falling while validation loss starts rising. Prevention: more training data, smaller network, dropout, regularisation, early stopping.

Beyond feed-forward

For images and other structured inputs, specialised architectures perform far better:

  • Convolutional neural networks (CNNs) for images.
  • Recurrent neural networks (RNNs) for sequences.
  • Transformers for language and many other domains.

These are out of HSC scope at the architectural level, but you should recognise the names.

Past exam questions, worked

Real questions from past NESA papers on this dot point, with our answer explainer.

2024 HSC6 marksDescribe the structure of a simple feed-forward neural network with one hidden layer and explain how it learns from training data.
Show worked answer β†’

A feed-forward neural network has three kinds of layer. The input layer has one neuron per feature. The hidden layer(s) contain artificial neurons that combine inputs from the previous layer. The output layer produces the prediction - one neuron for regression, one neuron per class for classification.

Each neuron computes a weighted sum of its inputs plus a bias, then passes the result through an activation function (typically ReLU in hidden layers, softmax or sigmoid in the output). Mathematically: a=f(βˆ‘iwixi+b)a = f(\sum_i w_i x_i + b) where wiw_i are learned weights, xix_i are the inputs from the previous layer, bb is the bias and ff is the activation function.

Training has two phases per batch of examples.

Forward pass: feed the input through every layer in turn to produce a prediction. Compare the prediction to the true label to compute a loss (cross-entropy for classification, mean squared error for regression).

Backward pass (backpropagation): compute the gradient of the loss with respect to every weight, using the chain rule of calculus. Update each weight by a small step in the direction that reduces the loss. The optimiser (typically stochastic gradient descent or Adam) controls how large the step is.

Repeat for many epochs (passes through the training data) until the loss stops decreasing on a held-out validation set. The network then generalises to new examples.

Markers reward the three-layer structure, weights/bias/activation in the neuron description, both forward and backward pass, and recognising that backpropagation uses the chain rule to compute gradients.

Related dot points