Key Idea: Before you can analyse data, you need to understand what kind of data you have and how it was collected. Topic 4.1 covers the vocabulary of statistics: population vs sample, types of data, and sampling methods. Getting these right matters because the method of collection affects the validity of any conclusions you draw.
✅ Types of data
✅ Population and sampling
Reliability: A random sample tends to produce reliable results (low bias) if it is large enough. Non-random methods are faster but less reliable. Outlier impact: A single extreme value (outlier) can distort the mean significantly. Always identify outliers before drawing conclusions.
Paper 1: Questions often ask you to identify data type or explain why a sampling method is biased. Write a specific reason — 'convenience sampling means people who are easy to reach are over-represented' earns the mark; vague answers do not. Paper 2: You may need to calculate sample size per stratum. Divide: nₛₜᵣₐₜᵤₘ = (stratum size / population size) × total sample size.
IB-style question [6 marks]
A town with 2000 adult residents is surveyed about a proposed cycle lane. The residents are grouped by age: 800 are under 30, 700 are aged 30 to 60, and 500 are over 60. A stratified sample of 80 residents is taken. (a) Explain why stratified sampling is more suitable here than simple random sampling. (b) Find the number of residents that should be selected from each age group. (c) State one type of data being collected (support for the cycle lane) and classify it.
Step by step:
(a) Stratified sampling guarantees every age group is represented in proportion to its size, so the views of older and younger residents are not under- or over-counted by chance.
(b) Use the stratified-sample rule for each group. State it first.
Under 30: 800 out of 2000.
Aged 30–60: 700 out of 2000.
Over 60: 500 out of 2000 (or 80 − 32 − 28). Check the total.
(c) 'Support for the cycle lane' (yes / no) is a category, not a number.
(a) It keeps each age group represented in proportion, avoiding chance over-representation. (b) 32 under 30, 28 aged 30–60, 20 over 60. (c) Support (yes/no) is qualitative (categorical) data.