Skip to main content
NSWMaths AdvancedSyllabus dot point

How do we display a single set of data and summarise its centre, spread and shape?

Display and summarise univariate data using frequency and cumulative-frequency tables, histograms, ogives, the mean and standard deviation, the five-number summary, box plots and outliers

A focused answer to the HSC Maths Advanced dot point on displaying and summarising one-variable data. Frequency and cumulative-frequency tables, histograms and ogives, the mean and standard deviation from data and a frequency table, the median, quartiles and IQR, the five-number summary, box and parallel box plots, the 1.5 IQR outlier rule and shape, with worked examples and traps.

Generated by Claude Opus 4.818 min answer

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

What this dot point is asking

NESA wants you to take a single set of numbers (univariate data), display it well, and summarise it numerically. Displaying means frequency and cumulative-frequency tables, histograms and ogives (cumulative-frequency polygons). Summarising means a measure of centre (the mean or the median), a measure of spread (the standard deviation or the interquartile range), and a sense of shape (symmetric or skewed). The five-number summary feeds the box plot, and the 1.5×IQR1.5 \times \text{IQR} rule flags outliers. The single idea that ties it together is that one variable can be described by where its values sit (centre), how widely they vary (spread), and the overall pattern (shape), and that every display and every statistic is a tool for reading one of those three things.

The answer

Frequency and cumulative-frequency tables

A frequency table lists each value (or score) xx alongside its frequency ff, the number of times it occurs. The cumulative frequency at a value is the running total of frequencies up to and including that value, so it answers "how many readings are this value or less". The final cumulative frequency must equal N=fN = \sum f, the total number of readings, which is a quick check that the table is right.

Data come in types, and the type decides the display. A variable is categorical if its values are labels (eye colour, suburb), and numeric if its values are numbers. A numeric variable is discrete if its values can be listed (0,1,2,0, 1, 2, \dots pets), and continuous if it is measured on a scale and can take any value in a range (height, time, mass). Continuous data are always grouped, because almost every measured value is unique.

Here are the numbers of pets owned by 4040 Year 12 students, a discrete variable, as a frequency and cumulative-frequency table.

Pets xx Frequency ff Cumulative frequency
00 66 66
11 1111 1717
22 99 2626
33 88 3434
44 44 3838
55 22 4040

The cumulative-frequency column rises to 40=N40 = N, as it must. Reading it off: 1717 students own at most 11 pet, and 4026=1440 - 26 = 14 own at least 33.

Grouped data, class centres and the modal class

When a variable is continuous, or when a discrete variable takes too many distinct values, group the data into class intervals of equal width. Each class is represented by its class centre, the midpoint of the interval, which stands in for every reading in that class. Grouping trades detail for clarity: you gain a readable overview but you lose the exact values, so any statistic computed from grouped data is an estimate. Never discard the original data.

The class that contains the most readings is the modal class. A boundary value (a reading sitting exactly on a class edge) must be assigned by a stated convention, usually into the upper class, and you note the convention if it matters.

Here are the commute times of 5050 Sydney workers, a continuous variable grouped into 1010-minute classes. The notation 1010 to 2020 means 10t<2010 \le t < 20.

Commute (min) Class centre xx Frequency ff Cumulative frequency
1010 to 2020 1515 44 44
2020 to 3030 2525 1010 1414
3030 to 4040 3535 1616 3030
4040 to 5050 4545 1111 4141
5050 to 6060 5555 66 4747
6060 to 7070 6565 33 5050

The modal class is 3030 to 4040 minutes (frequency 1616), and the cumulative column reaches 50=N50 = N.

Histograms and ogives (cumulative-frequency polygons)

A histogram displays the frequency table as a bar chart in which the bars touch (no gaps), because the horizontal axis is a continuous numeric scale. For grouped data each bar is centred on its class interval; for ungrouped discrete data each bar is centred on the value.

An ogive (cumulative-frequency polygon) is the graph of cumulative frequency against the upper boundary of each class, joined by straight segments and started on the axis at the lower boundary of the first class. It rises from 00 to NN in a characteristic S shape. Its real power is reading positions off it: go up to a cumulative frequency, across to the curve, and down to the value. The median sits at a cumulative frequency of N2\frac{N}{2}, the lower quartile at N4\frac{N}{4} and the upper quartile at 3N4\frac{3N}{4}.

The diagram overlays the histogram and the ogive for the commute-time data, with frequency on the left axis and cumulative frequency on the right.

Histogram and ogive of 50 commute timesA frequency histogram of commute times grouped into ten-minute classes, with the cumulative-frequency polygon (ogive) rising in an S shape to the total of fifty on the right-hand cumulative axis.10203040506070036912151801020304050Commute time (minutes)fcum fhistogramogive (cumulative)

Reading the ogive at cumulative frequency 2525 (which is N2\frac{N}{2}) gives a median of about 3737 minutes; at 12.512.5 (N4\frac{N}{4}) the lower quartile is about 2828 minutes, and at 37.537.5 (3N4\frac{3N}{4}) the upper quartile is about 4747 minutes.

Mean and standard deviation from data

The mean is the balance point of the data, xˉ=xn\bar{x} = \dfrac{\sum x}{n} for a raw list. The standard deviation measures the typical distance of a reading from the mean. In this course you use the population standard deviation, which divides by nn:

σ=(xxˉ)2n.\sigma = \sqrt{\dfrac{\sum (x - \bar{x})^2}{n}}.

On a calculator this is the σn\sigma_n (or σx\sigma_x) key, not the sn1s_{n-1} (sample) key, and choosing the wrong one is the single most common error here. You are expected to put data into the calculator's statistics mode and read xˉ\bar{x} and σn\sigma_n off it rather than computing the sum of squares by hand, but knowing the formula tells you what the machine is doing.

Mean and standard deviation from a frequency table

When data are tabulated, every value xx occurs ff times, so weight by frequency. With N=fN = \sum f,

xˉ=fxN,σ=fx2Nxˉ2.\bar{x} = \frac{\sum fx}{N}, \qquad \sigma = \sqrt{\frac{\sum fx^2}{N} - \bar{x}^2}.

The second form of the variance, fx2Nxˉ2\dfrac{\sum fx^2}{N} - \bar{x}^2, is the practical one: it needs only the two column totals fx\sum fx and fx2\sum fx^2, with no need to compute each squared deviation. For grouped data, use the class centre as xx; the answer is then an estimate.

For the pet-ownership table above, N=40N = 40, and adding the weighted columns gives fx=79\sum fx = 79 and fx2=233\sum fx^2 = 233. So

xˉ=7940=1.975,σ=233401.9752=5.8253.9006251.39 pets.\bar{x} = \frac{79}{40} = 1.975, \qquad \sigma = \sqrt{\frac{233}{40} - 1.975^2} = \sqrt{5.825 - 3.900625} \approx 1.39 \text{ pets}.

Median, quartiles and the interquartile range

The median Q2Q_2 is the middle value once the data are ordered: the middle reading if nn is odd, or the average of the two middle readings if nn is even. The quartiles split the ordered list into quarters. To find them, split the ordered list at the median into a lower half and an upper half. If nn is odd, exclude the middle value from both halves. The lower quartile Q1Q_1 is the median of the lower half and the upper quartile Q3Q_3 is the median of the upper half.

The interquartile range is IQR=Q3Q1\text{IQR} = Q_3 - Q_1, the range of the middle 50%50\% of the data. Unlike the full range (maximum minus minimum), the IQR ignores the extreme values, so it is not distorted by one unusual reading; that is exactly why it pairs with the median for skewed data.

The five-number summary and box plots

The five-number summary is the minimum, Q1Q_1, the median Q2Q_2, Q3Q_3, and the maximum. A box-and-whisker plot (box plot) draws it: a box from Q1Q_1 to Q3Q_3 with a line at the median, and whiskers reaching out to the extreme values. The box length is the IQR, and the whole plot shows the range.

The diagram below builds the box plot for 2020 daily maximum temperatures recorded at Sydney Observatory Hill (in degrees Celsius): 20,21,21,21,22,22,22,22,23,23,23,23,24,24,24,25,25,25,26,3420, 21, 21, 21, 22, 22, 22, 22, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 34. Their five-number summary (excluding the outlier from the whisker) is 20,22,23,24.5,2620, 22, 23, 24.5, 26, with one value, 3434, plotted separately.

Box plot of twenty daily maximum temperaturesA box-and-whisker plot. The box spans the lower quartile twenty-two to the upper quartile twenty-four point five with the median at twenty-three. The left whisker reaches the minimum twenty and the right whisker stops at twenty-six, the largest value that is not an outlier; the value thirty-four is plotted as a separate dot beyond the upper fence.18202224262830323436Daily maximum temperature (°C)outlier 34min 20Q₁ 22med 23Q₃ 24.526

Outliers by the 1.5 IQR rule

An outlier is a reading that sits unusually far from the rest. The standard criterion in this course is based on the IQR: a value is an outlier if it lies more than 1.5×IQR1.5 \times \text{IQR} below Q1Q_1 or more than 1.5×IQR1.5 \times \text{IQR} above Q3Q_3. The two cut-offs

Q11.5×IQRandQ3+1.5×IQRQ_1 - 1.5 \times \text{IQR} \quad \text{and} \quad Q_3 + 1.5 \times \text{IQR}

are called the lower and upper fences. For the temperature data, IQR=24.522=2.5\text{IQR} = 24.5 - 22 = 2.5, so the fences are 221.5(2.5)=18.2522 - 1.5(2.5) = 18.25 and 24.5+1.5(2.5)=28.2524.5 + 1.5(2.5) = 28.25. The value 3434 exceeds 28.2528.25, so it is an outlier; every other reading lies inside the fences. On the box plot an outlier is drawn as a separate dot beyond the whisker, and the whisker is shortened to stop at the most extreme reading that is not an outlier (here 2626). An outlier is flagged and discussed, not silently deleted.

Shape and skew

The shape of a distribution is read from the histogram, the box plot, or the relationship between the mean and the median.

  • A symmetric distribution has the mean and median roughly equal, with the median centred in the box and whiskers of similar length.
  • A right-skewed (positively skewed) distribution has a long tail of high values, which pulls the mean above the median; the median sits towards the left of the box and the right whisker is longer.
  • A left-skewed (negatively skewed) distribution has a long low tail, the mean below the median, and a longer left whisker.

The rule of thumb mean>median\text{mean} > \text{median} means right skew, mean<median\text{mean} < \text{median} means left skew, follows because the mean is dragged towards the long tail while the median is not.

Comparing distributions with parallel box plots

To compare two groups, draw their box plots on a common scale, one above the other: a parallel box plot. You compare centre (median against median), spread (IQR and range), and shape (skew and outliers). Below are daily rainfall totals on 1111 wet days in two suburbs (in millimetres). North is 6,8,9,11,12,14,15,17,19,22,406, 8, 9, 11, 12, 14, 15, 17, 19, 22, 40; South is 10,12,13,14,15,16,17,18,19,20,2110, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21.

Parallel box plots of rainfall in two suburbsTwo box-and-whisker plots on a common scale. The North suburb has a wider box and a longer right whisker with an outlier at forty millimetres, showing greater spread and right skew; the South suburb has a tighter, more symmetric box around a similar median.510152025303540Daily rainfall on wet days (mm)40NorthSouth

The two suburbs have almost the same median (North 1414, South 1616) and nearly equal means (15.715.7 against 15.915.9 mm), yet they are very different: North has a wider box (IQR=10\text{IQR} = 10 against 66), a long right whisker and an outlier at 4040 mm, so its rainfall is more variable and right-skewed, while South is tight and roughly symmetric. This is the whole point of comparing displays: similar centres can hide very different spreads and shapes.

How exam questions ask about univariate statistics

  • "Complete the frequency / cumulative-frequency table." Add the ff column to get NN, then accumulate down the column; the last cumulative entry must equal NN.
  • "Find the mean and standard deviation" (raw data or a table). Enter the data in the calculator's statistics mode and read xˉ\bar{x} and σn\sigma_n. For a table, enter the xx values against the ff values. Use σn\sigma_n (population), not sn1s_{n-1}.
  • "Estimate the mean / standard deviation from the grouped data." Use the class centres as xx, then proceed as for a frequency table; state that the answer is an estimate.
  • "Use the ogive to find the median / quartiles." Read across from cumulative frequency N2\frac{N}{2}, N4\frac{N}{4} and 3N4\frac{3N}{4}, then down to the value.
  • "Find the five-number summary" or "draw a box plot." Order the data, find the median, then the quartiles of each half. Draw the box from Q1Q_1 to Q3Q_3 with the median marked.
  • "Determine whether ... is an outlier." Compute IQR\text{IQR}, then the fences Q11.5IQRQ_1 - 1.5\,\text{IQR} and Q3+1.5IQRQ_3 + 1.5\,\text{IQR}, and compare the value to them.
  • "Compare the two data sets." Use parallel box plots and compare centre, spread and shape in words, quoting the medians and IQRs.
  • "Describe the shape / skew." Compare the mean and median, or look at the whisker lengths and the median's position in the box.

Edge cases worth knowing

  • Population versus sample standard deviation. This course uses σn\sigma_n, which divides by nn. The sample version sn1s_{n-1} divides by n1n - 1 and gives a slightly larger value; selecting it by mistake is the commonest standard-deviation error.
  • Quartiles when nn is odd. Exclude the middle value (the median) from both halves before taking each half's median. Including it shifts the quartiles.
  • Grouped statistics are estimates. Replacing each reading with its class centre means the grouped mean, standard deviation, median and quartiles only approximate the true values. The finer the classes, the better the estimate, but information is always lost.
  • The ogive uses N2\frac{N}{2}, not N+12\frac{N+1}{2}. When reading a median or quartiles off a cumulative-frequency curve you go up to N2\frac{N}{2}, N4\frac{N}{4} and 3N4\frac{3N}{4}, because the curve treats the data as continuous.
  • Outliers are flagged, not deleted. Identify an outlier and draw it separately, but only remove it with a stated reason (such as a recording error). A genuine extreme value is still part of the data.
  • The median resists outliers; the mean does not. Adding one huge value barely moves the median but drags the mean towards it. For skewed data or data with outliers, the median and IQR describe the data more honestly than the mean and standard deviation.

Practice questions

Original practice questions graded from foundation to exam level, each with a full worked solution. Try them before revealing the solution.

foundation4 marksA netballer scores these numbers of goals in 1212 games: 2,5,1,3,2,7,4,2,3,1,5,02, 5, 1, 3, 2, 7, 4, 2, 3, 1, 5, 0. Find the five-number summary and the interquartile range, and state the shape suggested by the median's position in the box.
Show worked solution →
Order the data
Writing the 1212 scores in increasing order gives 0,1,1,2,2,2,3,3,4,5,5,70, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 7.
Median (Q2Q_2)
With n=12n = 12 (even), the median is the average of the 66th and 77th values, 2+32=2.5\frac{2 + 3}{2} = 2.5.
Quartiles
Split into a lower half 0,1,1,2,2,20, 1, 1, 2, 2, 2 and an upper half 3,3,4,5,5,73, 3, 4, 5, 5, 7. The lower quartile is the median of the lower half, 1+22=1.5\frac{1 + 2}{2} = 1.5, and the upper quartile is the median of the upper half, 4+52=4.5\frac{4 + 5}{2} = 4.5.
Five-number summary and IQR
Minimum 00, Q1=1.5Q_1 = 1.5, median 2.52.5, Q3=4.5Q_3 = 4.5, maximum 77. So IQR=Q3Q1=4.51.5=3\text{IQR} = Q_3 - Q_1 = 4.5 - 1.5 = 3, and the range is 70=77 - 0 = 7.
Shape
The median 2.52.5 sits closer to Q1=1.5Q_1 = 1.5 than to Q3=4.5Q_3 = 4.5, and the upper whisker (to 77) is longer than the lower (to 00), so the distribution is skewed to the right (positively skewed).
core4 marksThe table shows the number of after-school activities done weekly by 2525 students. Find the mean and the population standard deviation (to 22 decimal places), and write down the median. | Activities xx | 5 | 6 | 7 | 8 | 9 | 10 | | --- | --- | --- | --- | --- | --- | --- | | Frequency ff | 3 | 5 | 8 | 6 | 2 | 1 |
Show worked solution →

Totals from the table. With N=f=25N = \sum f = 25, build fx\sum fx and fx2\sum fx^2:

fx=5(3)+6(5)+7(8)+8(6)+9(2)+10(1)=15+30+56+48+18+10=177\sum fx = 5(3) + 6(5) + 7(8) + 8(6) + 9(2) + 10(1) = 15 + 30 + 56 + 48 + 18 + 10 = 177.

fx2=25(3)+36(5)+49(8)+64(6)+81(2)+100(1)=75+180+392+384+162+100=1293\sum fx^2 = 25(3) + 36(5) + 49(8) + 64(6) + 81(2) + 100(1) = 75 + 180 + 392 + 384 + 162 + 100 = 1293.

Mean
xˉ=fxN=17725=7.08\bar{x} = \dfrac{\sum fx}{N} = \dfrac{177}{25} = 7.08.
Population standard deviation
σ2=fx2Nxˉ2=1293257.082=51.7250.1264=1.5936\sigma^2 = \dfrac{\sum fx^2}{N} - \bar{x}^2 = \dfrac{1293}{25} - 7.08^2 = 51.72 - 50.1264 = 1.5936, so σ=1.59361.26\sigma = \sqrt{1.5936} \approx 1.26.
Median
The cumulative frequencies are 3,8,16,22,24,253, 8, 16, 22, 24, 25. With N=25N = 25 (odd) the median is the 1313th value, which falls in the jump from 88 to 1616, so the median is 77.

In the calculator's statistics mode, enter the xx list against the ff list and read xˉ=7.08\bar{x} = 7.08 and σn1.26\sigma_n \approx 1.26 directly; the by-hand columns are shown so the method is clear and checkable.

core3 marksTen recent sales in a Sydney suburb had these prices, in thousands of dollars: 640,680,710,720,750,760,790,820,850,1450640, 680, 710, 720, 750, 760, 790, 820, 850, 1450. Use the 1.5×IQR1.5 \times \text{IQR} rule to decide whether any price is an outlier.
Show worked solution →
Order and find the quartiles
The data are already increasing. With n=10n = 10 (even), the lower half is 640,680,710,720,750640, 680, 710, 720, 750 and the upper half is 760,790,820,850,1450760, 790, 820, 850, 1450. Each half has 55 values, so Q1Q_1 is the middle of the lower half, 710710, and Q3Q_3 is the middle of the upper half, 820820. The median is 750+7602=755\frac{750 + 760}{2} = 755.
IQR
IQR=Q3Q1=820710=110\text{IQR} = Q_3 - Q_1 = 820 - 710 = 110 (in thousands of dollars).
Fences
Lower fence Q11.5×IQR=7101.5(110)=710165=545Q_1 - 1.5 \times \text{IQR} = 710 - 1.5(110) = 710 - 165 = 545. Upper fence Q3+1.5×IQR=820+165=985Q_3 + 1.5 \times \text{IQR} = 820 + 165 = 985.
Decision
Every value lies inside [545,985][545, 985] except 14501450, which is above the upper fence 985985. So $1,450,000 is an outlier; the others are not. In a report it would be plotted as a separate dot beyond the right whisker, not dropped, since a genuine high sale still describes the market.
exam6 marksThe masses (kg) of 6060 dogs seen at a vet clinic are grouped below. | Mass (kg) | 00 to 55 | 55 to 1010 | 1010 to 1515 | 1515 to 2020 | 2020 to 2525 | 2525 to 3030 | | --- | --- | --- | --- | --- | --- | --- | | Frequency | 7 | 15 | 18 | 12 | 6 | 2 | (a) State the modal class. (b) Estimate the mean and population standard deviation (to 22 decimal places). (c) Construct the cumulative-frequency column and estimate the median from it.
Show worked solution →

(a) Modal class. The highest frequency is 1818, so the modal class is 1010 to 1515 kg.

(b) Class centres and totals. Use the class centres 2.5,7.5,12.5,17.5,22.5,27.52.5, 7.5, 12.5, 17.5, 22.5, 27.5 as the representative value of each class, with N=60N = 60.

fx=2.5(7)+7.5(15)+12.5(18)+17.5(12)+22.5(6)+27.5(2)=17.5+112.5+225+210+135+55=755\sum fx = 2.5(7) + 7.5(15) + 12.5(18) + 17.5(12) + 22.5(6) + 27.5(2) = 17.5 + 112.5 + 225 + 210 + 135 + 55 = 755.

fx2=6.25(7)+56.25(15)+156.25(18)+306.25(12)+506.25(6)+756.25(2)=43.75+843.75+2812.5+3675+3037.5+1512.5=11925\sum fx^2 = 6.25(7) + 56.25(15) + 156.25(18) + 306.25(12) + 506.25(6) + 756.25(2) = 43.75 + 843.75 + 2812.5 + 3675 + 3037.5 + 1512.5 = 11925.

Estimated mean xˉ=7556012.58\bar{x} = \dfrac{755}{60} \approx 12.58 kg. Estimated variance σ2=119256012.58332=198.75158.3403=40.4097\sigma^2 = \dfrac{11925}{60} - 12.5833^2 = 198.75 - 158.3403 = 40.4097, so estimated σ=40.40976.36\sigma = \sqrt{40.4097} \approx 6.36 kg.

(c) Cumulative frequency and median. Accumulating the frequencies at the upper boundaries 5,10,15,20,25,305, 10, 15, 20, 25, 30 gives 7,22,40,52,58,607, 22, 40, 52, 58, 60. With N=60N = 60, the median is read from the ogive at a cumulative frequency of N2=30\frac{N}{2} = 30. That cumulative value falls in the 1010 to 1515 class (cumulative rises from 2222 to 4040 there), so by linear interpolation

median10+302218×5=10+818×512.22 kg.\text{median} \approx 10 + \frac{30 - 22}{18} \times 5 = 10 + \frac{8}{18} \times 5 \approx 12.22 \text{ kg}.

The estimates from grouped data are close to the true values but not exact, because grouping replaces each reading with its class centre.

Related dot points