Alpha (significance level)
Your threshold, set before running the test. If p < alpha, you reject H0. It represents the maximum false-positive rate you accept.
alpha = 0.05 → you accept a 5% risk of falsely concluding a difference exists.
Alternative hypothesis (H1)
What you expect if H0 is false. Must be stated before data collection. Two-tailed: a difference in any direction. One-tailed: a specific direction.
"Design B will yield significantly higher SUS scores than design A."
Between-subjects design
Different participants are assigned to each condition or group. Each person only experiences one version of the interface or task. Results reflect how distinct groups compare.
Group A (n=15) tests Design A; a separate Group B (n=15) tests Design B. → Use Mann-Whitney U or Independent t-test.
Cohen's d
Effect size for t-tests. The difference between two means expressed in standard deviation units.
d = 0.6 → the groups differ by 0.6 standard deviations — a medium effect.
Effect size
How large the difference is, independent of sample size. A result can be statistically significant but have a negligible effect. Always report alongside p.
Cohen's d: small=0.2, medium=0.5, large=0.8. Rank-biserial r: small=0.1, medium=0.3, large=0.5.
Formulating a hypothesis
Name both groups, the measure, and the expected relationship. H0 claims no difference. H1 states what you expect. Both must be written before collecting data.
H0: "Mean task time will not differ between A and B." H1: "Design B will result in lower task completion time than A."
Normal distribution (checking normality)
A bell-shaped distribution where most values cluster around the mean. Parametric tests (t-tests) assume data is approximately normally distributed. For small samples (<30) this matters more; with n ≥ 30 the Central Limit Theorem often compensates.
"SUS scores from 8 participants should not be assumed normal — use Wilcoxon instead."
How to check with descriptive stats:
Skewness: values between −1 and +1 suggest approximate symmetry. Values beyond ±2 indicate strong skew.
Kurtosis: values near 0 (excess kurtosis) suggest normal tails. Beyond ±2 is a concern.
Mean ≈ Median: if they are close, the distribution is roughly symmetric.
Range check: if min/max are near the theoretical bounds of your scale, data may be bounded — not normal.
In practice: for n < 15, always prefer non-parametric tests. For n = 15–30, check skewness and mean vs. median. For n > 30, parametric tests are generally robust.
Non-parametric test
Does not assume normality. Works with ranks. Best for Likert scales, ordinal data, or small samples where normality cannot be verified.
SUS scores and task difficulty ratings (1–5) are ordinal — use Wilcoxon or Mann-Whitney.
Null hypothesis (H0)
Default assumption: no significant difference between groups. You are trying to find evidence strong enough to reject it.
"There is no significant difference in SUS scores between design A and design B."
One-tailed test
Tests whether one specific group scores higher. Only valid if the direction was predicted before data collection. Using it post-hoc inflates Type I error.
H1: "Design B will score higher than Design A on ease of use."
p-value
Probability of observing a result at least as extreme as yours if H0 were true. A small p-value means the data is unlikely under the null hypothesis.
p = 0.03 → 3% chance of seeing this difference (or larger) if no real effect exists.
Paired vs. independent
Paired: same participants in both conditions. Independent: different participants per group. Choosing the wrong structure invalidates the test.
Paired: 12 users rate both A and B. Independent: 12 users rate A, a different 12 rate B.
Rank-biserial r
Effect size for Wilcoxon and Mann-Whitney. Ranges 0–1, indicating how consistently one group outranks the other.
r = 0.4 → moderate tendency for one group to score higher in pairwise comparisons.
Two-tailed test
Tests whether groups differ in any direction. Use when you have no prior directional hypothesis.
H1: "The two interfaces differ in usability scores."
Within-subjects design
The same participants complete both conditions. Also called a repeated-measures design. Controls for individual differences, making the test more sensitive with fewer participants.
12 users rate Design A, then later rate Design B — same people, both conditions. → Use Wilcoxon signed-rank or Paired t-test.