What is the standard test for each component?

§-Syllabus dot point

NSWHealth and Movement ScienceSyllabus dot point

How are valid and reliable fitness tests selected and used to assess performance and prescribe training?

Select and justify valid, reliable fitness and performance tests for the components of fitness, and use the results to set baselines and prescribe training

A data-skills HSC Health and Movement Science answer on choosing valid, reliable fitness tests, field versus laboratory testing, the standard test for each fitness component (beep test, 1RM, vertical jump, Illinois, sit-and-reach), and using normative results to set baselines and prescribe training.

Generated by Claude Opus 4.810 min answerUpdated 2026-06-02

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

What this sub-topic is asking

NESA wants you to select the right test for each component of fitness, justify it in terms of validity, reliability and accuracy, weigh field tests against laboratory tests, interpret results against normative data, and then use the data to set a baseline and prescribe training. This is a scientific-investigation and data-handling sub-topic, so the marks live in the reasoning about the numbers, not in just naming a test.

In plain English

A fitness test is just a fair, repeatable way to put a number on how fit someone is at one thing - their running endurance, their jump height, how bendy they are. Two ideas decide whether the number is any good. Validity asks: is this test actually measuring the thing I care about? (Timing a sprint tells you about speed, not flexibility.) Reliability asks: if I did it again tomorrow, would I get nearly the same number? (Only if the conditions are the same each time.)

Think of it like weighing yourself. A scale that always reads the same is reliable; a scale that reads your true weight is valid and accurate. A broken scale can be reliable (it always says 60 kg) but useless if your true weight is 70 kg.

You also choose WHERE to test. A field test (like the beep test on a court) is cheap, quick and you can do a whole team at once - but it only estimates fitness. A laboratory test (a VO2max machine) is dead accurate but slow, costly and one person at a time. Once you have a trustworthy number, you compare it to a chart of typical scores for that age, find the athlete's weak spot, and train it - then test again to see if it worked.

The answer

Good testing is a chain: pick a valid test, run it reliably, rate it against normative data, set a baseline, prescribe to the weakest component, then re-test. Get any link wrong and the data misleads you.

Validity, reliability and accuracy

Validity = does the test measure the component it claims to? A 20 m sprint validly measures speed; it does not measure aerobic capacity.
Reliability = does the test repeat? Same athlete, same conditions, same result. Reliability comes from a standardised protocol (same surface, warm-up, order, equipment, time of day, rested state).
Accuracy = how close the measured value is to the true value (calibration, correct technique).

These are independent. A test can be reliable but not valid (a very repeatable but irrelevant measure), or valid but unreliable (right test, sloppy conditions). The strongest data is valid, reliable and accurate at once.

The standard test for each component

Component	Standard test	Field or lab	What it measures
Aerobic capacity	Multistage beep test / direct VO2max	Field / lab	VO2max (estimated vs directly measured)
Anaerobic capacity	Wingate 30 s test	Lab	Peak and mean anaerobic power
Muscular strength	One-rep maximum (1RM)	Field/gym	Maximal force in one lift
Power	Vertical jump	Field	Explosive leg power
Speed	Timed 20-40 m sprint	Field	Maximal running speed
Agility	Illinois agility run	Field	Timed change of direction
Flexibility	Sit-and-reach	Field	Lower-back and hamstring range

The beep test (20 m shuttle run) is the most-tested example: athletes run between two lines 20 m apart in time with an audio beep that speeds up each level; the final level and shuttle reached are read off a normative equation to estimate VO2max. It is a field test - cheap, fast, whole-squad - but an estimate, so its validity depends on maximal effort and the prediction equation. The laboratory gold standard is a graded exercise test with a metabolic cart that directly measures expired-gas exchange to give a true VO2max.

Field tests versus laboratory tests

	Field tests	Laboratory tests
Example	Beep test, vertical jump, Illinois, sit-and-reach	Direct VO2max, Wingate
Cost / time	Low, fast, whole squad	High, slow, one at a time
Accuracy / validity	Estimate (good enough to monitor)	High (direct measurement)
Control of variables	Lower (weather, surface)	High (calibrated, controlled)
Best for	Screening, frequent re-testing, sport-specific	Definitive measure, research, talent ID

The choice is a trade-off: field tests buy practicality and repeatability; laboratory tests buy accuracy and control. A coach screening a squad every block uses field tests; a sports institute confirming an elite athlete's true VO2max uses the lab.

From normative data to prescription

A raw score means little until it is rated against normative data - reference tables of typical scores by age and sex (e.g. a teenage male beep-test level of 6 might rate "below average", 9 "good"). That rating sets the athlete's baseline and exposes the limiting component. Prescription then targets the weakness: a low beep level drives an aerobic block; a low 1RM drives a strength block; a poor sit-and-reach drives a flexibility block. After the block, you re-test under the same protocol to evaluate the response - the monitoring cycle.

Worked exam answers

Worked example 1 - data response (5 marks)

Question. Two students complete pre-season tests (illustrative; normative rating in brackets). Student A: beep level 9.6, VO2max 47 mL/kg/min (good); vertical jump 38 cm (average). Student B: beep level 6.4, VO2max 35 mL/kg/min (below average); vertical jump 56 cm (excellent). (a) Compare the profiles. (b) Recommend a training priority for each, justified by the data.

Model answer. (a) The profiles are mirror images: Student A is the stronger aerobic athlete (beep 9.6, VO2max 47 = good) but only average for power (38 cm), while Student B is excellent for power (56 cm) but below average aerobically (VO2max 35). Each is weakest where the other is strong. (b) Student A should prioritise power development - the 38 cm jump trails Student B by 18 cm and is only average - while maintaining the already-good aerobic base. Student B should prioritise aerobic conditioning, because a below-average VO2max of 35 will limit work rate and recovery in a field sport; the excellent power is maintained.

Marker's note: a data response must quote the actual scores AND their normative ratings for both athletes, and each prescription must name the limiting component with the figure behind it. A generic "both should train harder" or a priority that contradicts the data scores nothing.

Worked example 2 - distinguish (4 marks)

Question. Distinguish between validity and reliability, and explain how a test can be reliable but not valid.

Model answer. Validity is whether a test measures the component it claims to measure; reliability is whether it gives the same result on repeated trials under the same conditions. They are independent. A test can be reliable but not valid: rating an athlete's aerobic capacity from a sit-and-reach score would be highly repeatable (reliable) yet tells you nothing about aerobic capacity (not valid). To be useful, a test must be both - the right measure, taken consistently.

Marker's note: markers want both definitions correct, an explicit statement that the two are independent, and a worked reliable-but-not-valid example. Swapping the definitions caps the answer.

Worked example 3 - extended-response paragraph (analyse)

Question (paragraph from an 8-mark "Analyse how a coach selects valid, reliable tests and uses results to prescribe training").

Model paragraph. The coach first matches each component of fitness to a test that validly measures it, because an invalid test gives a precise but meaningless number - aerobic capacity to the beep test, leg power to the vertical jump, agility to the Illinois run. Each protocol is standardised (same surface, warm-up and equipment) so reliability is high and a post-season re-test reflects real adaptation rather than changed conditions. Raw scores are then rated against age-and-sex normative data to set each athlete's baseline, and the lowest-rated component becomes the priority: an athlete rated below average for aerobic capacity is prescribed an aerobic block, while one already rated good there maintains it and trains a weaker quality. The block is re-tested under the same protocol, so the data both diagnoses the need and verifies the response.

Marker's note: an analyse paragraph runs test choice to data quality to prescription to re-test, not a list of tests. Naming a valid test per component, treating validity and reliability as distinct, and citing a specific score-to-prescription decision are the marks of a top-band response.

Common traps

Confusing validity and reliability: Validity = right thing; reliability = repeatable. They are independent - state which you mean and do not swap them.
Naming a test for the wrong component: Sit-and-reach is flexibility, not power; the beep test is aerobic capacity, not speed. An invalid test wastes the data however neatly collected.
Treating the beep test as a direct VO2max measure: It ESTIMATES VO2max from the level reached; only a laboratory metabolic cart measures it directly. Say "estimated" in data answers.
Reporting a number with no normative rating: "Level 6.4" means nothing until you rate it (below average / good for that age and sex). Marks for interpretation need the rating.
Blaming a fitness change for what is really a reliability problem: If conditions changed between trials (wet surface, prior fatigue), a lower score is a protocol failure, not a loss of fitness.
Prescribing without using the data: A prescription must name the limiting component and cite the figure that flags it, then re-test - not "train harder".

Exam technique

Read the command word first. Distinguish validity from reliability needs an explicit contrast (right thing vs repeatable) plus an example; compare field and lab tests needs both similarities/differences AND a judgement of when each wins; describe a trend needs direction AND a quantified read of the data with units. In data/stimulus items, quote the number and its normative rating - never just "it went down" or "level 6". Use the testing chain (valid test - reliable protocol - normative rating - baseline - prescribe - re-test) as a ready-made structure for any "how is testing used" question. Always say whether a test is field or lab and what that buys (practicality vs accuracy). Anchor a prescription to the specific score that flags the weakest component, and flag any illustrative data as illustrative.

Practice questions

Original practice questions graded from foundation to exam level, each with a full worked solution. Try them before revealing the solution.

foundation3 marksIdentify the standard test used to assess each of the following components of fitness: (a) aerobic capacity (field test), (b) muscular strength, (c) agility.

Show worked solution →

(a) Aerobic capacity (field): the multistage fitness test (20 m shuttle run / beep test), which estimates VO2max from the final level and shuttle reached.
(b) Muscular strength: the one-repetition maximum (1RM) for a nominated lift (e.g. bench press, squat).
(c) Agility: the Illinois agility run test (a timed change-of-direction course).

Marking criteria: 1 mark per correctly matched test (max 3). Naming a test that measures the wrong component (e.g. sit-and-reach for agility) earns nothing - the test must be valid for the stated component.

foundation4 marksDistinguish between validity and reliability in fitness testing. Explain how a test could be reliable but not valid, using an example.

Show worked solution →

Validity is whether a test measures the component it claims to measure. Reliability is whether the test gives the same result on repeated trials under the same conditions.

They are independent. A test can be reliable but not valid: for example, using a sit-and-reach score to rate an athlete's aerobic capacity would give a very repeatable number (reliable), but the number says nothing about aerobic capacity (not valid). Conversely, a valid test that is poorly standardised (different surfaces, no warm-up) becomes unreliable.

Marking criteria: 1 mark for a correct definition of validity; 1 mark for reliability; 1 mark for stating they are independent; 1 mark for a worked example showing reliable-but-not-valid. A definition that swaps the two terms caps at 1.

core5 marksCompare field tests and laboratory tests for assessing aerobic capacity. In your answer, name one example of each and judge when each is the better choice.

Show worked solution →

Field test - the beep test (20 m shuttle run): Cheap, quick, needs only a flat 20 m space and an audio track, and can test a whole squad at once in a sport-relevant setting. It estimates VO2max from the level reached, so it is less accurate and its validity depends on the prediction equation and on maximal effort.
Laboratory test - direct VO2max (graded test with a metabolic cart): Directly measures expired-gas exchange, so it is highly accurate and valid, and controls variables (temperature, gradient, calibration). But it is expensive, slow (one athlete at a time), and requires trained staff and equipment.
Judgement: For squad screening, repeated monitoring and limited budgets, the field test is the better choice (practical, lets you re-test often). For a definitive, research-grade or talent-identification measure where accuracy matters most, the laboratory test is better. The right choice trades accuracy/validity against cost and practicality.

Marking criteria: 1 mark for a correctly named field test, 1 for a named lab test, 1 for a valid advantage of each, up to 2 for a justified judgement that links the choice to purpose (screening vs definitive measurement). A list with no judgement caps at 3.

core5 marksSTIMULUS. Two Year 12 students complete a battery of pre-season tests. Results, with the matched age-and-sex normative rating in brackets, are illustrative: Student A - beep test level 9.6 estimated VO2max 47 mL/kg/min (good); vertical jump 38 cm (average); sit-and-reach +2 cm (below average). Student B - beep test level 6.4 estimated VO2max 35 mL/kg/min (below average); vertical jump 56 cm (excellent); sit-and-reach +14 cm (excellent). Both play the same field-sport position. (a) Compare the two fitness profiles. (b) Recommend a training priority for each, justified by the data.

Show worked solution →

(a) Comparison. The profiles are almost mirror images. Student A has the stronger aerobic base (beep level 9.6, VO2max 47 = good) but weaker power (vertical jump 38 cm = average) and poor flexibility (+2 cm = below average). Student B has weak aerobic capacity (level 6.4, VO2max 35 = below average) but excellent power (56 cm) and excellent flexibility (+14 cm). Each is strong exactly where the other is weak.

(b) Priorities, justified by the data.

Student A: prioritise power/strength training (the vertical jump of 38 cm is only average and lags Student B by 18 cm) plus a flexibility block to lift the below-average sit-and-reach; the aerobic base is already good, so maintain rather than build it.
Student B: prioritise aerobic conditioning (a below-average VO2max of 35 will limit work rate and recovery in a field sport); the excellent power and flexibility can be maintained.

Each recommendation must be anchored to the specific test value that flags the weakness, not a generic "train harder".

Marking criteria: (a) up to 2 marks for a data-anchored comparison that quotes scores AND ratings for both athletes. (b) up to 3 marks for a justified priority for each student that names the limiting component and cites the figure behind it. Recommending a priority that contradicts the data (e.g. more aerobic work for Student A) loses the justification mark.

core5 marksSTIMULUS. A coach tests an athlete's beep test three times in one week to check the test before the season: trial 1 = level 8.5, trial 2 = level 8.6, trial 3 = level 6.1. Trial 3 was run on a wet, sloping car park after a heavy gym session; trials 1 and 2 were on a dry indoor court when the athlete was fresh. (a) What does this tell you about the reliability of the data? (b) Outline two steps to improve reliability for the season's testing.

Show worked solution →

(a) Reliability. Trials 1 and 2 agree closely (8.5 and 8.6), but trial 3 (6.1) is far lower. The disagreement is not a real fitness change - it is caused by changed conditions (wet, sloping surface; fatigue from the prior gym session). So the testing as a whole is unreliable: the same athlete did not get a repeatable result because the protocol was not standardised. Trial 3 should be treated as invalid and excluded; the athlete's true level is about 8.5 to 8.6.

(b) Two steps to improve reliability (any two):

Standardise the conditions - same flat, non-slip surface, same time of day, same temperature, same audio track and equipment each time.
Standardise the athlete's state - same warm-up, rested (no hard session beforehand), hydrated, same footwear and motivation/encouragement protocol.
Repeat trials and use the best/mean of consistent results, discarding clear outliers.

Marking criteria: (a) 1 mark for identifying poor reliability, 1 mark for attributing trial 3's drop to changed conditions (not real fitness). (b) 1 mark per sensible standardisation step (max 2), 1 mark for linking standardisation explicitly to reliability/repeatability. Saying the athlete "got worse" misreads the data and loses (a).

exam8 marksAnalyse how a strength-and-conditioning coach selects valid and reliable fitness tests for a field-sport squad, and uses the results to prescribe training. Refer to specific tests and to the components of fitness.

Show worked solution →

This is an 8-mark extended response. Markers reward a sustained analysis (test choice linked to validity/reliability linked to a prescription decision), not a labelled list of tests.

Band 6 PLAN.

Thesis: good testing is purposeful - the coach matches each test to the component it validly measures, standardises it for reliability, rates results against normative data, and feeds the weakest components into the next training block. Testing is the evidence base for prescription, not a ritual.
Argument line 1 - VALIDITY (choose the right test). Each component needs a test that actually measures it: aerobic capacity via the beep test (field) or direct VO2max (lab); strength via 1RM; leg power via the vertical jump; agility via the Illinois run; speed via a 20 m timed sprint; flexibility via sit-and-reach. Using a test that measures the wrong quality (e.g. sit-and-reach for power) makes the data useless however neatly it is collected.
Argument line 2 - RELIABILITY and field-vs-lab trade-off. Standardise the protocol (same surface, order, warm-up, equipment, time of day, rested state) so a re-test reflects training change, not changed conditions. Field tests are chosen for squad-wide, repeatable, low-cost monitoring; a lab VO2max is reserved for a definitive measure where accuracy outweighs cost. Reliability is what lets pre-test and post-test be compared at all.
Argument line 3 - from data to prescription. Convert raw scores to normative ratings, set each athlete's baseline, identify the limiting component, and prescribe accordingly (low beep level = aerobic block; low 1RM = strength block), then re-test to evaluate the block and close the loop (the monitoring/evaluation cycle).
Synthesis: validity gets the right number, reliability makes the number trustworthy across re-tests, normative data makes it meaningful, and the prescription targets the weakness - judge the whole process as an evidence-driven cycle.

Model paragraph (validity and prescription line). The coach begins by matching each component of fitness to a test that validly measures it, because an invalid test produces a precise but meaningless number. Aerobic capacity is assessed with the multistage beep test, which estimates VO2max from the level reached and can screen a whole squad cheaply; leg power with the vertical jump; agility with the timed Illinois run. Each protocol is then standardised - same surface, warm-up and equipment - so that reliability is high and a post-season re-test reflects real adaptation rather than a change in conditions. Raw scores are converted to age-and-sex normative ratings to set every athlete's baseline, and the lowest-rated component becomes the training priority: a player rated below average for aerobic capacity is prescribed an aerobic-development block, while one already rated good there maintains it and trains the weaker quality instead. The block is then re-tested, so the data both diagnoses the need and verifies the response - testing drives the prescription rather than decorating it.

Marker's note: top-band answers (1) name a valid test for several components, (2) treat validity and reliability as distinct and explain the field-vs-lab trade-off, (3) carry the chain from a test result through a normative rating to a specific prescription and a re-test, and (4) keep answering the verb - ANALYSE means show how test choice, data quality and training decisions relate. Citing a specific score-to-prescription decision (e.g. a low beep level driving an aerobic block) lifts the response above a generic list of tests.