Skip to main content
NSWMaths Extension 1Syllabus dot point

How does the sample proportion behave as a random variable, and how do we use its normal approximation?

Use the sample proportion p^=X/n\hat{p} = X/n as a random variable, with mean pp, variance pq/npq/n and standard deviation pq/n\sqrt{pq/n}, and apply its normal approximation p^N(p,pq/n)\hat{p} \sim N(p,\, pq/n)

A focused answer to the HSC Maths Extension 1 dot point on sample proportions. What p^=X/n\hat{p} = X/n is, why it is a random variable, its mean pp, variance pq/npq/n and standard deviation pq/n\sqrt{pq/n}, the normal approximation p^N(p,pq/n)\hat{p} \sim N(p, pq/n), computing probabilities about p^\hat{p}, the effect of sample size, and expected-range reasoning for polling and quality control.

Generated by Claude Opus 4.814 min answer

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

What this dot point is asking

NESA wants you to treat the sample proportion p^=X/n\hat{p} = X/n as a random variable in its own right, know its mean pp, its variance pqn\dfrac{pq}{n} and its standard deviation pqn\sqrt{\dfrac{pq}{n}}, and use the normal approximation p^N ⁣(p,pqn)\hat{p} \sim N\!\left(p,\, \dfrac{pq}{n}\right) to compute probabilities about how close a survey's estimate is likely to be to the true value. This is the one slice of the statistics chapter that is uniquely Extension 1, and it is where binomial theory turns into real polling and quality-control reasoning.

The answer

What a sample proportion is

Run a binomial experiment: nn independent Bernoulli trials, each a success with probability pp. Let XB(n,p)X \sim B(n, p) count the successes. The population proportion pp is a fixed (usually unknown) number, for example the true fraction of Australians who support a policy. The sample proportion

p^=Xn\hat{p} = \frac{X}{n}

is the fraction of your sample that were successes. If you poll n=2000n = 2000 voters and X=700X = 700 say they support a party, your sample proportion is p^=7002000=0.35\hat{p} = \dfrac{700}{2000} = 0.35, an estimate of the true pp.

The crucial idea is that p^\hat{p} is itself a random variable. Run the survey again with a fresh random sample and you get a different XX, hence a different p^\hat{p}. So p^\hat{p} has its own distribution, mean and spread, and those are exactly what tell us how trustworthy a single survey is.

Why the mean is pp and the variance is pq/npq/n

These are not new results to memorise blindly; they fall straight out of the binomial mean and variance you already know, E(X)=npE(X) = np and Var(X)=npq\operatorname{Var}(X) = npq, using the scaling rules E(aX)=aE(X)E(aX) = aE(X) and Var(aX)=a2Var(X)\operatorname{Var}(aX) = a^2 \operatorname{Var}(X) with the constant a=1na = \dfrac{1}{n}:

E(p^)=E ⁣(Xn)=1nE(X)=npn=p,E(\hat{p}) = E\!\left(\frac{X}{n}\right) = \frac{1}{n} E(X) = \frac{np}{n} = p,

Var(p^)=Var ⁣(Xn)=1n2Var(X)=npqn2=pqn,\operatorname{Var}(\hat{p}) = \operatorname{Var}\!\left(\frac{X}{n}\right) = \frac{1}{n^2}\operatorname{Var}(X) = \frac{npq}{n^2} = \frac{pq}{n},

SD(p^)=pqn=npqn=SD(X)n.\text{SD}(\hat{p}) = \sqrt{\frac{pq}{n}} = \frac{\sqrt{npq}}{n} = \frac{\text{SD}(X)}{n}.

The mean result, E(p^)=pE(\hat{p}) = p, is the whole reason surveys work: on average the sample proportion equals the population proportion, so p^\hat{p} is an unbiased estimator of pp. The variance result is the whole reason large samples are better: the spread carries a 1n\dfrac{1}{n}, so it shrinks as nn grows. Note the dimension check too, p^\hat{p} is a proportion (a number between 00 and 11), so its SD pq/n\sqrt{pq/n} is also a small fraction, never the large count-sized spread npq\sqrt{npq} that XX has.

The distribution of p^\hat{p} is the binomial, restretched

Because p^=X/n\hat{p} = X/n, each value of p^\hat{p} corresponds to exactly one value of XX and carries the same probability:

P ⁣(p^=xn)=P(X=x)=(nx)pxqnx.P\!\left(\hat{p} = \frac{x}{n}\right) = P(X = x) = \binom{n}{x} p^x q^{\,n-x}.

So the probability graph of p^\hat{p} is just the probability graph of XX with the horizontal axis relabelled (squashed from the integers 0,1,,n0,1,\ldots,n onto the fractions 0,1n,,10, \tfrac1n, \ldots, 1). Nothing about the probabilities changes, only the scale of the axis. That is why every binomial tool you have still applies: to find a probability about p^\hat{p} exactly, convert it into the matching statement about XX and sum binomial terms; to find it quickly for large nn, use the normal approximation below.

The normal approximation p^N(p,pq/n)\hat{p} \sim N(p,\, pq/n)

For large nn the binomial is well approximated by a normal distribution (the central limit theorem). Dividing through by nn carries that approximation over to p^\hat{p}: it becomes approximately normal with the same mean and variance we just derived,

p^N ⁣(p, pqn)when np5 and nq5.\hat{p} \sim N\!\left(p,\ \frac{pq}{n}\right) \qquad\text{when } np \ge 5 \text{ and } nq \ge 5.

The validity conditions are the same np5np \ge 5, nq5nq \ge 5 rule used for approximating XX itself, because p^\hat{p} and XX have identical probabilities. The picture below is the sampling distribution of p^\hat{p} for a national poll with true support p=0.4p = 0.4 and n=1000n = 1000: a bell curve centred exactly on the true value pp, with standard deviation pq/n0.0155\sqrt{pq/n} \approx 0.0155. The shaded band is the central ±2\pm 2 standard deviations, the "expected range" almost every poll will fall inside.

Sampling distribution of the sample proportionA normal bell curve for the sample proportion p-hat with true value 0.4 and sample size 1000. The curve is centred on 0.4 with standard deviation about 0.0155, and the central plus or minus two standard deviation band from 0.369 to 0.431 is shaded.0.40p (mean)0.3690.4310.3850.415p̂ ~ N(p, pq/n)SD ≈ 0.0155≈ 95% within ±2 SD

The effect of sample size nn

The mean of p^\hat{p} is always pp, no matter the sample size, so a bigger sample does not move the centre, it sharpens it. Since

SD(p^)=pqn,\text{SD}(\hat{p}) = \sqrt{\frac{pq}{n}},

the spread is inversely proportional to n\sqrt{n}. To halve the standard deviation you must quadruple the sample. The overlay below fixes p=0.5p = 0.5 and stacks the sampling distributions for n=100n = 100, n=400n = 400 and n=1600n = 1600: each fourfold jump in nn halves the standard deviation (from 0.050.05 to 0.0250.025 to 0.01250.0125) and the curve becomes correspondingly taller and tighter around 0.50.5.

Effect of sample size on the spread of the sample proportionThree normal curves for the sample proportion with true value 0.5, for sample sizes 100, 400 and 1600. As the sample size quadruples the standard deviation halves, so the curves become taller and narrower around 0.5.0.300.400.500.600.70n = 1600 (SD 0.0125)n = 400 (SD 0.025)n = 100 (SD 0.05)

This is the link to confidence-style "expected range" reasoning. Because p^\hat{p} is roughly normal, about 95%95\% of the time a single survey's p^\hat{p} lands within ±2\pm 2 standard deviations of pp (more precisely ±1.96\pm 1.96 SD). Read backwards, that interval is the "margin of error": if a poll of n=1000n = 1000 reports p^=0.40\hat{p} = 0.40, you can say the true value is very likely within 2×0.01550.0312 \times 0.0155 \approx 0.031, i.e. roughly 0.40±0.030.40 \pm 0.03. The smaller you want that margin, the larger nn has to be, and because of the n\sqrt{n}, shrinking the margin by half costs four times the sample.

Exact versus approximate

For small nn, do not reach for the normal curve, the values of p^\hat{p} are too few and chunky. Instead convert the question about p^\hat{p} into the matching question about X=np^X = n\hat{p} and sum binomial terms exactly. For large nn (both np5np \ge 5 and nq5nq \ge 5), the normal approximation N(p,pq/n)N(p, pq/n) replaces a long sum with one or two zz-lookups. A continuity correction of ±12n\pm \dfrac{1}{2n} (the half-step between adjacent p^\hat{p} values is 1n\dfrac1n) can be applied, but for the large samples typical of polling it is negligible and is usually dropped, which is the convention this page follows.

How exam questions ask about sample proportions

The wording is the tell. Map the phrase to the move:

  • "Write down / find the sample proportion": just compute p^=X/n\hat{p} = X/n from the given count. A 11 mark opener.
  • "State the mean and standard deviation of p^\hat{p}": quote E(p^)=pE(\hat{p}) = p and SD(p^)=pq/n\text{SD}(\hat{p}) = \sqrt{pq/n} (note the square root, and that the SD is a small fraction, not npq\sqrt{npq}).
  • "Show that E(p^)=pE(\hat{p}) = p / Var(p^)=pq/n\operatorname{Var}(\hat{p}) = pq/n": derive from E(X)=npE(X) = np, Var(X)=npq\operatorname{Var}(X) = npq using p^=X/n\hat{p} = X/n and the scaling rules E(aX)=aE(X)E(aX)=aE(X), Var(aX)=a2Var(X)\operatorname{Var}(aX)=a^2\operatorname{Var}(X).
  • "Use the normal approximation to find P(p^)P(\hat{p} \ldots)" or "estimate the probability the sample proportion is between ...": confirm np5np \ge 5 and nq5nq \ge 5, write p^N(p,pq/n)\hat{p} \sim N(p, pq/n), standardise with z=p^ppq/nz = \dfrac{\hat{p} - p}{\sqrt{pq/n}}, look up.
  • "within 0.0X0.0X of the true value": this is P(p^p0.0X)P(|\hat{p} - p| \le 0.0X), a symmetric interval p±0.0Xp \pm 0.0X; standardise both ends and use 2P(Zz)12\,P(Z \le z) - 1.
  • "What sample size is needed so the estimate is within ... with probability ...": set zpq/nmarginz \sqrt{pq/n} \le \text{margin} and solve for nn, using the worst case p=0.5p = 0.5 if pp is unknown.

Practice questions

Original practice questions graded from foundation to exam level, each with a full worked solution. Try them before revealing the solution.

foundation3 marksA market-research firm surveys 5050 randomly chosen Sydney commuters and finds that 1818 used a train at least once last week. Write down the sample proportion p^\hat{p}. If the true population proportion is p=0.4p = 0.4, state the mean, variance and standard deviation of p^\hat{p} for a sample of this size.
Show worked solution →
Sample proportion
With X=18X = 18 successes out of n=50n = 50,
p^=Xn=1850=0.36.\hat{p} = \frac{X}{n} = \frac{18}{50} = 0.36.
Mean
E(p^)=p=0.4E(\hat{p}) = p = 0.4.
Variance
With q=1p=0.6q = 1 - p = 0.6,
Var(p^)=pqn=0.4×0.650=0.2450=0.0048.\operatorname{Var}(\hat{p}) = \frac{pq}{n} = \frac{0.4 \times 0.6}{50} = \frac{0.24}{50} = 0.0048.
Standard deviation
0.00480.0693\sqrt{0.0048} \approx 0.0693.

So this one survey produced an estimate of 0.360.36, and the estimator p^\hat{p} is centred on the true value 0.40.4 with a spread of about 0.0690.069.

foundation3 marksA fair coin is tossed 44 times and p^\hat{p} is the proportion of heads. List the possible values of p^\hat{p}, then find its mean and standard deviation.
Show worked solution →
Possible values
XX can be 0,1,2,3,40,1,2,3,4, so p^=X/4\hat{p} = X/4 takes the 55 values
0, 14, 12, 34, 1i.e.0, 0.25, 0.5, 0.75, 1.0,\ \tfrac14,\ \tfrac12,\ \tfrac34,\ 1 \quad\text{i.e.}\quad 0,\ 0.25,\ 0.5,\ 0.75,\ 1.
Mean
E(p^)=p=0.5E(\hat{p}) = p = 0.5.
Standard deviation
With p=q=0.5p = q = 0.5 and n=4n = 4,
Var(p^)=pqn=0.254=0.0625,SD=0.0625=0.25.\operatorname{Var}(\hat{p}) = \frac{pq}{n} = \frac{0.25}{4} = 0.0625, \qquad \text{SD} = \sqrt{0.0625} = 0.25.

Note how few values p^\hat{p} has and how large the spread is: with only 44 trials a single survey tells you very little about pp.

core4 marksA streaming service estimates that a new show has a true national audience share of p=0.20p = 0.20. A ratings panel of n=625n = 625 households is sampled. Using the normal approximation to p^\hat{p}, estimate the probability that the panel's sample proportion lies between 0.180.18 and 0.230.23. Use P(Z1.88)0.9699P(Z \le 1.88) \approx 0.9699 and P(Z1.25)0.8944P(Z \le 1.25) \approx 0.8944.
Show worked solution →

Set up the model. Here p=0.20p = 0.20, q=0.80q = 0.80, n=625n = 625. Check the conditions: np=1255np = 125 \ge 5 and nq=5005nq = 500 \ge 5, so the normal approximation is valid.

Parameters of the approximating normal.

mean=p=0.20,Var(p^)=pqn=0.16625=0.000256,\text{mean} = p = 0.20, \qquad \operatorname{Var}(\hat{p}) = \frac{pq}{n} = \frac{0.16}{625} = 0.000256,

SD=0.000256=0.016.\text{SD} = \sqrt{0.000256} = 0.016.

So p^N(0.20,0.000256)\hat{p} \approx N(0.20,\, 0.000256).

Standardise both endpoints.

zlower=0.180.200.016=1.25,zupper=0.230.200.016=1.8751.88.z_{\text{lower}} = \frac{0.18 - 0.20}{0.016} = -1.25, \qquad z_{\text{upper}} = \frac{0.23 - 0.20}{0.016} = 1.875 \approx 1.88.

Read the probability.

P(0.18p^0.23)P(1.25Z1.88)=P(Z1.88)P(Z1.25).P(0.18 \le \hat{p} \le 0.23) \approx P(-1.25 \le Z \le 1.88) = P(Z \le 1.88) - P(Z \le 1.25).

0.96990.8944=0.0755 ?\approx 0.9699 - 0.8944 = 0.0755 \ ?

That uses P(Z1.25)=1P(Z1.25)=10.8944=0.1056P(Z \le -1.25) = 1 - P(Z \le 1.25) = 1 - 0.8944 = 0.1056, so

P(1.25Z1.88)=0.96990.1056=0.8643.P(-1.25 \le Z \le 1.88) = 0.9699 - 0.1056 = 0.8643.

Answer. About 0.8640.864, so roughly an 86%86\% chance the panel's share lands in [0.18,0.23][0.18, 0.23].

core4 marksA bottling plant claims its true defective rate is p=0.02p = 0.02. A quality inspector samples n=900n = 900 bottles. Find the standard deviation of the sample proportion p^\hat{p}, then use the normal approximation to estimate the probability that the inspector observes a sample defective rate greater than 0.030.03. Use P(Z2.14)0.9838P(Z \le 2.14) \approx 0.9838.
Show worked solution →
Conditions
np=900×0.02=185np = 900 \times 0.02 = 18 \ge 5 and nq=900×0.98=8825nq = 900 \times 0.98 = 882 \ge 5, so the approximation is valid.
Standard deviation
With q=0.98q = 0.98,
Var(p^)=pqn=0.02×0.98900=0.01969000.0000218,\operatorname{Var}(\hat{p}) = \frac{pq}{n} = \frac{0.02 \times 0.98}{900} = \frac{0.0196}{900} \approx 0.0000218,

SD=0.00002180.004667.\text{SD} = \sqrt{0.0000218} \approx 0.004667.
Standardise
For p^>0.03\hat{p} > 0.03,
z=0.030.020.0046672.14.z = \frac{0.03 - 0.02}{0.004667} \approx 2.14.

Read the probability.

P(p^>0.03)P(Z>2.14)=1P(Z2.14)10.9838=0.0162.P(\hat{p} > 0.03) \approx P(Z > 2.14) = 1 - P(Z \le 2.14) \approx 1 - 0.9838 = 0.0162.

Answer. About 0.0160.016. So even though the claimed rate is only 2%2\%, there is roughly a 1.6%1.6\% chance a clean batch shows a sample rate above 3%3\% purely by sampling variation, worth remembering before raising an alarm.

exam5 marksA polling company wants to estimate the proportion pp of voters supporting a referendum. It requires a 95%95\% chance that its sample proportion p^\hat{p} falls within 0.020.02 of the true value pp. Taking the worst case p=0.5p = 0.5, and using z=1.96z = 1.96 for the central 95%95\% of a normal distribution, find the smallest sample size nn the company should use.
Show worked solution →
Translate the requirement
"Within 0.020.02 of pp with probability 0.950.95" means the half-width of the central 95%95\% interval of p^\hat{p} must be at most 0.020.02:
1.96×SD(p^)0.02,SD(p^)=pqn.1.96 \times \text{SD}(\hat{p}) \le 0.02, \qquad \text{SD}(\hat{p}) = \sqrt{\frac{pq}{n}}.
Substitute the worst case
The product pqpq is largest at p=0.5p = 0.5, where pq=0.25pq = 0.25. Using p=0.5p = 0.5 gives the most demanding (largest) nn, which is safe for any true pp:
1.960.25n0.02.1.96 \sqrt{\frac{0.25}{n}} \le 0.02.
Solve for nn
Square both sides:
1.9620.25n0.022    n1.962×0.250.022=3.8416×0.250.0004=2401.1.96^2 \cdot \frac{0.25}{n} \le 0.02^2 \;\Longrightarrow\; n \ge \frac{1.96^2 \times 0.25}{0.02^2} = \frac{3.8416 \times 0.25}{0.0004} = 2401.
Answer
The company needs n=2401n = 2401 voters (round up to guarantee the bound). This is the textbook "margin of error ±2%\pm 2\%" sample size for a national poll, and it explains why such polls quote samples of roughly 20002000 to 25002500 people.
exam5 marksTwo opinion polls estimate the same true support level p=0.45p = 0.45. Poll A samples nA=400n_A = 400 voters; Poll B samples nB=1600n_B = 1600. (a) Compare the standard deviations of p^\hat{p} for the two polls. (b) Using the normal approximation, estimate for each poll the probability that p^\hat{p} falls within 0.030.03 of pp. Use P(Z1.21)0.8869P(Z \le 1.21) \approx 0.8869 and P(Z2.41)0.9920P(Z \le 2.41) \approx 0.9920.
Show worked solution →

(a) Standard deviations. With p=0.45p = 0.45, q=0.55q = 0.55 so pq=0.2475pq = 0.2475.

SDA=0.24754000.02487,SDB=0.247516000.01244.\text{SD}_A = \sqrt{\frac{0.2475}{400}} \approx 0.02487, \qquad \text{SD}_B = \sqrt{\frac{0.2475}{1600}} \approx 0.01244.

Because nB=4nAn_B = 4 n_A and the SD has n\sqrt{n} in the denominator, quadrupling the sample halves the standard deviation: SDA/SDB=1600/400=2\text{SD}_A / \text{SD}_B = \sqrt{1600/400} = 2.

(b) Probability within 0.030.03 for Poll A.

zA=0.030.024871.21,z_A = \frac{0.03}{0.02487} \approx 1.21,

P(p^p0.03)P(1.21Z1.21)=2×0.88691=0.7738.P(|\hat{p} - p| \le 0.03) \approx P(-1.21 \le Z \le 1.21) = 2 \times 0.8869 - 1 = 0.7738.

Probability within 0.030.03 for Poll B.

zB=0.030.012442.41,z_B = \frac{0.03}{0.01244} \approx 2.41,

P(p^p0.03)2×0.99201=0.9840.P(|\hat{p} - p| \le 0.03) \approx 2 \times 0.9920 - 1 = 0.9840.

Answer. Poll A has about a 77%77\% chance of landing within 0.030.03 of the truth; Poll B about 98%98\%. Halving the standard deviation sharply tightens the estimate, which is why larger samples are worth the cost.

Related dot points