Skip to main content
ExamExplained
NSW · Health and Movement Science
Health and Movement Science study scene
§-Syllabus dot point
NSWHealth and Movement ScienceSyllabus dot point

How are valid and reliable fitness tests selected and used to assess performance and prescribe training?

Select and justify valid, reliable fitness and performance tests for the components of fitness, and use the results to set baselines and prescribe training

A data-skills HSC Health and Movement Science answer on choosing valid, reliable fitness tests, field versus laboratory testing, the standard test for each fitness component (beep test, 1RM, vertical jump, Illinois, sit-and-reach), and using normative results to set baselines and prescribe training.

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

What this sub-topic is asking

NESA wants you to select the right test for each component of fitness, justify it in terms of validity, reliability and accuracy, weigh field tests against laboratory tests, interpret results against normative data, and then use the data to set a baseline and prescribe training. This is a scientific-investigation and data-handling sub-topic, so the marks live in the reasoning about the numbers, not in just naming a test.

The answer

Good testing is a chain: pick a valid test, run it reliably, rate it against normative data, set a baseline, prescribe to the weakest component, then re-test. Get any link wrong and the data misleads you.

Validity, reliability and accuracy

  • Validity = does the test measure the component it claims to? A 20 m sprint validly measures speed; it does not measure aerobic capacity.
  • Reliability = does the test repeat? Same athlete, same conditions, same result. Reliability comes from a standardised protocol (same surface, warm-up, order, equipment, time of day, rested state).
  • Accuracy = how close the measured value is to the true value (calibration, correct technique).

These are independent. A test can be reliable but not valid (a very repeatable but irrelevant measure), or valid but unreliable (right test, sloppy conditions). The strongest data is valid, reliable and accurate at once.

The standard test for each component

Component Standard test Field or lab What it measures
Aerobic capacity Multistage beep test / direct VO2max Field / lab VO2max (estimated vs directly measured)
Anaerobic capacity Wingate 30 s test Lab Peak and mean anaerobic power
Muscular strength One-rep maximum (1RM) Field/gym Maximal force in one lift
Power Vertical jump Field Explosive leg power
Speed Timed 20-40 m sprint Field Maximal running speed
Agility Illinois agility run Field Timed change of direction
Flexibility Sit-and-reach Field Lower-back and hamstring range

The beep test (20 m shuttle run) is the most-tested example: athletes run between two lines 20 m apart in time with an audio beep that speeds up each level; the final level and shuttle reached are read off a normative equation to estimate VO2max. It is a field test - cheap, fast, whole-squad - but an estimate, so its validity depends on maximal effort and the prediction equation. The laboratory gold standard is a graded exercise test with a metabolic cart that directly measures expired-gas exchange to give a true VO2max.

The fitness-testing chain: select a valid reliable test, run it under standardised conditions, rate against normative data, set a baseline, prescribe to the weakest component, then re-test to evaluate A vertical flow of six stages. Stage one, select a valid and reliable test matched to the component. Stage two, run it under a standardised protocol for reliability. Stage three, rate the raw score against age and sex normative data. Stage four, set the athlete's baseline. Stage five, prescribe training to the lowest-rated component. Stage six, re-test to evaluate the block, with an arrow looping back to stage two to show the monitoring cycle. Two side notes flag that validity decides the right test and reliability lets pre and post tests be compared. The fitness-testing chain test choice → data quality → prescription → re-test 1. Select VALID, reliable test matched to the component of fitness 2. Run under STANDARDISED protocol same surface, warm-up, gear → reliability 3. Rate vs NORMATIVE data poor → excellent, by age and sex 4. Set BASELINE the starting reference for each athlete 5. PRESCRIBE to weakest component low beep level → aerobic block, etc. 6. RE-TEST to evaluate did the block work? close the loop monitoring cycle VALIDITY picks the right test RELIABILITY lets pre vs post compare

Field tests versus laboratory tests

Field tests Laboratory tests
Example Beep test, vertical jump, Illinois, sit-and-reach Direct VO2max, Wingate
Cost / time Low, fast, whole squad High, slow, one at a time
Accuracy / validity Estimate (good enough to monitor) High (direct measurement)
Control of variables Lower (weather, surface) High (calibrated, controlled)
Best for Screening, frequent re-testing, sport-specific Definitive measure, research, talent ID

The choice is a trade-off: field tests buy practicality and repeatability; laboratory tests buy accuracy and control. A coach screening a squad every block uses field tests; a sports institute confirming an elite athlete's true VO2max uses the lab.

From normative data to prescription

A raw score means little until it is rated against normative data - reference tables of typical scores by age and sex (e.g. a teenage male beep-test level of 6 might rate "below average", 9 "good"). That rating sets the athlete's baseline and exposes the limiting component. Prescription then targets the weakness: a low beep level drives an aerobic block; a low 1RM drives a strength block; a poor sit-and-reach drives a flexibility block. After the block, you re-test under the same protocol to evaluate the response - the monitoring cycle.

After an 8-week aerobic block, beep-test estimated VO2max rises for both athletes: Athlete A from 35 to 41 and Athlete B from 47 to 50 millilitres per kilogram per minute, with a normative good threshold marked at 45 A grouped bar chart of estimated VO2max in millilitres per kilogram per minute. Athlete A rises from a pre value of 35, which is below the normative good threshold of 45 shown as a dashed line, to a post value of 41 after an 8-week aerobic block. Athlete B rises from 47 to 50, staying above the threshold. The larger gain for Athlete A illustrates prescribing the aerobic block to the athlete whose baseline was below the threshold. Values are an illustrative response and the threshold is an illustrative normative cut-off. Estimated VO2max, pre vs post 8-week aerobic block illustrative beep-test response; threshold is an illustrative normative cut-off 30 35 40 45 50 55 VO2max (mL/kg/min) normative “good” threshold 45 35 41 Athlete A baseline below threshold 47 50 Athlete B already above threshold pre post (8 weeks)

Practice questions

Original practice questions graded from foundation to exam level, each with a full worked solution. Try them before revealing the solution.

foundation3 marksIdentify the standard test used to assess each of the following components of fitness: (a) aerobic capacity (field test), (b) muscular strength, (c) agility.
Show worked solution →
  • (a) Aerobic capacity (field): the multistage fitness test (20 m shuttle run / beep test), which estimates VO2max from the final level and shuttle reached.
  • (b) Muscular strength: the one-repetition maximum (1RM) for a nominated lift (e.g. bench press, squat).
  • (c) Agility: the Illinois agility run test (a timed change-of-direction course).

Marking criteria: 1 mark per correctly matched test (max 3). Naming a test that measures the wrong component (e.g. sit-and-reach for agility) earns nothing - the test must be valid for the stated component.

foundation4 marksDistinguish between validity and reliability in fitness testing. Explain how a test could be reliable but not valid, using an example.
Show worked solution →

Validity is whether a test measures the component it claims to measure. Reliability is whether the test gives the same result on repeated trials under the same conditions.

They are independent. A test can be reliable but not valid: for example, using a sit-and-reach score to rate an athlete's aerobic capacity would give a very repeatable number (reliable), but the number says nothing about aerobic capacity (not valid). Conversely, a valid test that is poorly standardised (different surfaces, no warm-up) becomes unreliable.

Marking criteria: 1 mark for a correct definition of validity; 1 mark for reliability; 1 mark for stating they are independent; 1 mark for a worked example showing reliable-but-not-valid. A definition that swaps the two terms caps at 1.

core5 marksCompare field tests and laboratory tests for assessing aerobic capacity. In your answer, name one example of each and judge when each is the better choice.
Show worked solution →
Field test - the beep test (20 m shuttle run)
Cheap, quick, needs only a flat 20 m space and an audio track, and can test a whole squad at once in a sport-relevant setting. It estimates VO2max from the level reached, so it is less accurate and its validity depends on the prediction equation and on maximal effort.
Laboratory test - direct VO2max (graded test with a metabolic cart)
Directly measures expired-gas exchange, so it is highly accurate and valid, and controls variables (temperature, gradient, calibration). But it is expensive, slow (one athlete at a time), and requires trained staff and equipment.
Judgement
For squad screening, repeated monitoring and limited budgets, the field test is the better choice (practical, lets you re-test often). For a definitive, research-grade or talent-identification measure where accuracy matters most, the laboratory test is better. The right choice trades accuracy/validity against cost and practicality.

Marking criteria: 1 mark for a correctly named field test, 1 for a named lab test, 1 for a valid advantage of each, up to 2 for a justified judgement that links the choice to purpose (screening vs definitive measurement). A list with no judgement caps at 3.

core5 marksSTIMULUS. Two Year 12 students complete a battery of pre-season tests. Results, with the matched age-and-sex normative rating in brackets, are illustrative: Student A - beep test level 9.6 estimated VO2max 47 mL/kg/min (good); vertical jump 38 cm (average); sit-and-reach +2 cm (below average). Student B - beep test level 6.4 estimated VO2max 35 mL/kg/min (below average); vertical jump 56 cm (excellent); sit-and-reach +14 cm (excellent). Both play the same field-sport position. (a) Compare the two fitness profiles. (b) Recommend a training priority for each, justified by the data.
Show worked solution →

(a) Comparison. The profiles are almost mirror images. Student A has the stronger aerobic base (beep level 9.6, VO2max 47 = good) but weaker power (vertical jump 38 cm = average) and poor flexibility (+2 cm = below average). Student B has weak aerobic capacity (level 6.4, VO2max 35 = below average) but excellent power (56 cm) and excellent flexibility (+14 cm). Each is strong exactly where the other is weak.

(b) Priorities, justified by the data.

  • Student A: prioritise power/strength training (the vertical jump of 38 cm is only average and lags Student B by 18 cm) plus a flexibility block to lift the below-average sit-and-reach; the aerobic base is already good, so maintain rather than build it.
  • Student B: prioritise aerobic conditioning (a below-average VO2max of 35 will limit work rate and recovery in a field sport); the excellent power and flexibility can be maintained.

Each recommendation must be anchored to the specific test value that flags the weakness, not a generic "train harder".

Marking criteria: (a) up to 2 marks for a data-anchored comparison that quotes scores AND ratings for both athletes. (b) up to 3 marks for a justified priority for each student that names the limiting component and cites the figure behind it. Recommending a priority that contradicts the data (e.g. more aerobic work for Student A) loses the justification mark.

core5 marksSTIMULUS. A coach tests an athlete's beep test three times in one week to check the test before the season: trial 1 = level 8.5, trial 2 = level 8.6, trial 3 = level 6.1. Trial 3 was run on a wet, sloping car park after a heavy gym session; trials 1 and 2 were on a dry indoor court when the athlete was fresh. (a) What does this tell you about the reliability of the data? (b) Outline two steps to improve reliability for the season's testing.
Show worked solution →

(a) Reliability. Trials 1 and 2 agree closely (8.5 and 8.6), but trial 3 (6.1) is far lower. The disagreement is not a real fitness change - it is caused by changed conditions (wet, sloping surface; fatigue from the prior gym session). So the testing as a whole is unreliable: the same athlete did not get a repeatable result because the protocol was not standardised. Trial 3 should be treated as invalid and excluded; the athlete's true level is about 8.5 to 8.6.

(b) Two steps to improve reliability (any two):

  • Standardise the conditions - same flat, non-slip surface, same time of day, same temperature, same audio track and equipment each time.
  • Standardise the athlete's state - same warm-up, rested (no hard session beforehand), hydrated, same footwear and motivation/encouragement protocol.
  • Repeat trials and use the best/mean of consistent results, discarding clear outliers.

Marking criteria: (a) 1 mark for identifying poor reliability, 1 mark for attributing trial 3's drop to changed conditions (not real fitness). (b) 1 mark per sensible standardisation step (max 2), 1 mark for linking standardisation explicitly to reliability/repeatability. Saying the athlete "got worse" misreads the data and loses (a).

exam8 marksAnalyse how a strength-and-conditioning coach selects valid and reliable fitness tests for a field-sport squad, and uses the results to prescribe training. Refer to specific tests and to the components of fitness.
Show worked solution →

This is an 8-mark extended response. Markers reward a sustained analysis (test choice linked to validity/reliability linked to a prescription decision), not a labelled list of tests.

Band 6 PLAN.

  • Thesis: good testing is purposeful - the coach matches each test to the component it validly measures, standardises it for reliability, rates results against normative data, and feeds the weakest components into the next training block. Testing is the evidence base for prescription, not a ritual.
  • Argument line 1 - VALIDITY (choose the right test). Each component needs a test that actually measures it: aerobic capacity via the beep test (field) or direct VO2max (lab); strength via 1RM; leg power via the vertical jump; agility via the Illinois run; speed via a 20 m timed sprint; flexibility via sit-and-reach. Using a test that measures the wrong quality (e.g. sit-and-reach for power) makes the data useless however neatly it is collected.
  • Argument line 2 - RELIABILITY and field-vs-lab trade-off. Standardise the protocol (same surface, order, warm-up, equipment, time of day, rested state) so a re-test reflects training change, not changed conditions. Field tests are chosen for squad-wide, repeatable, low-cost monitoring; a lab VO2max is reserved for a definitive measure where accuracy outweighs cost. Reliability is what lets pre-test and post-test be compared at all.
  • Argument line 3 - from data to prescription. Convert raw scores to normative ratings, set each athlete's baseline, identify the limiting component, and prescribe accordingly (low beep level = aerobic block; low 1RM = strength block), then re-test to evaluate the block and close the loop (the monitoring/evaluation cycle).
  • Synthesis: validity gets the right number, reliability makes the number trustworthy across re-tests, normative data makes it meaningful, and the prescription targets the weakness - judge the whole process as an evidence-driven cycle.

Model paragraph (validity and prescription line). The coach begins by matching each component of fitness to a test that validly measures it, because an invalid test produces a precise but meaningless number. Aerobic capacity is assessed with the multistage beep test, which estimates VO2max from the level reached and can screen a whole squad cheaply; leg power with the vertical jump; agility with the timed Illinois run. Each protocol is then standardised - same surface, warm-up and equipment - so that reliability is high and a post-season re-test reflects real adaptation rather than a change in conditions. Raw scores are converted to age-and-sex normative ratings to set every athlete's baseline, and the lowest-rated component becomes the training priority: a player rated below average for aerobic capacity is prescribed an aerobic-development block, while one already rated good there maintains it and trains the weaker quality instead. The block is then re-tested, so the data both diagnoses the need and verifies the response - testing drives the prescription rather than decorating it.

Marker's note: top-band answers (1) name a valid test for several components, (2) treat validity and reliability as distinct and explain the field-vs-lab trade-off, (3) carry the chain from a test result through a normative rating to a specific prescription and a re-test, and (4) keep answering the verb - ANALYSE means show how test choice, data quality and training decisions relate. Citing a specific score-to-prescription decision (e.g. a low beep level driving an aerobic block) lifts the response above a generic list of tests.

ExamExplained