UX Statistics Calculator

Statistical testing for UX research

1 — Select a test

2 — Significance level (alpha)

alpha = 0.01

Strict

1% false-alarm risk. Use when conclusions carry high stakes.

alpha = 0.05

Standard — recommended

The convention in UX and social science. A 5% false-alarm risk is broadly accepted.

alpha = 0.10

Lenient

10% risk. Sometimes used in exploratory studies with small samples.

3 — Test direction

Two-tailed

Recommended default

Tests whether groups differ in either direction. Use unless you predicted direction before data collection.

One-tailed — upper

Group A scores higher than Group B

Only valid if you predicted this direction before data collection.

One-tailed — lower

Group B scores higher than Group A

Only valid if you predicted this direction before data collection.

One-tailed tests should only be used when the direction was hypothesised before data collection.

Created by Kim Papillon, MSc. UX Design Thesis Candidate

Based on: Chen, Labbé, Fredette, Abdessemed, Courtemanche & Léger (2020). UX Calculator. HEC Montréal. uxcalc.web.app

Alpha (significance level)

Your threshold, set before running the test. If p < alpha, you reject H0. It represents the maximum false-positive rate you accept.

alpha = 0.05 → you accept a 5% risk of falsely concluding a difference exists.

Alternative hypothesis (H1)

What you expect if H0 is false. Must be stated before data collection. Two-tailed: a difference in any direction. One-tailed: a specific direction.

"Design B will yield significantly higher SUS scores than design A."

Between-subjects design

Different participants are assigned to each condition or group. Each person only experiences one version of the interface or task. Results reflect how distinct groups compare.

Group A (n=15) tests Design A; a separate Group B (n=15) tests Design B. → Use Mann-Whitney U or Independent t-test.

Cohen's d

Effect size for t-tests. The difference between two means expressed in standard deviation units.

d = 0.6 → the groups differ by 0.6 standard deviations — a medium effect.

Effect size

How large the difference is, independent of sample size. A result can be statistically significant but have a negligible effect. Always report alongside p.

Cohen's d: small=0.2, medium=0.5, large=0.8. Rank-biserial r: small=0.1, medium=0.3, large=0.5.

Formulating a hypothesis

Name both groups, the measure, and the expected relationship. H0 claims no difference. H1 states what you expect. Both must be written before collecting data.

H0: "Mean task time will not differ between A and B." H1: "Design B will result in lower task completion time than A."

Normal distribution (checking normality)

A bell-shaped distribution where most values cluster around the mean. Parametric tests (t-tests) assume data is approximately normally distributed. For small samples (<30) this matters more; with n ≥ 30 the Central Limit Theorem often compensates.

"SUS scores from 8 participants should not be assumed normal — use Wilcoxon instead."

How to check with descriptive stats:
Skewness: values between −1 and +1 suggest approximate symmetry. Values beyond ±2 indicate strong skew.
Kurtosis: values near 0 (excess kurtosis) suggest normal tails. Beyond ±2 is a concern.
Mean ≈ Median: if they are close, the distribution is roughly symmetric.
Range check: if min/max are near the theoretical bounds of your scale, data may be bounded — not normal.
In practice: for n < 15, always prefer non-parametric tests. For n = 15–30, check skewness and mean vs. median. For n > 30, parametric tests are generally robust.

Non-parametric test

Does not assume normality. Works with ranks. Best for Likert scales, ordinal data, or small samples where normality cannot be verified.

SUS scores and task difficulty ratings (1–5) are ordinal — use Wilcoxon or Mann-Whitney.

Null hypothesis (H0)

Default assumption: no significant difference between groups. You are trying to find evidence strong enough to reject it.

"There is no significant difference in SUS scores between design A and design B."

One-tailed test

Tests whether one specific group scores higher. Only valid if the direction was predicted before data collection. Using it post-hoc inflates Type I error.

H1: "Design B will score higher than Design A on ease of use."

p-value

Probability of observing a result at least as extreme as yours if H0 were true. A small p-value means the data is unlikely under the null hypothesis.

p = 0.03 → 3% chance of seeing this difference (or larger) if no real effect exists.

Paired vs. independent

Paired: same participants in both conditions. Independent: different participants per group. Choosing the wrong structure invalidates the test.

Paired: 12 users rate both A and B. Independent: 12 users rate A, a different 12 rate B.

Rank-biserial r

Effect size for Wilcoxon and Mann-Whitney. Ranges 0–1, indicating how consistently one group outranks the other.

r = 0.4 → moderate tendency for one group to score higher in pairwise comparisons.

Two-tailed test

Tests whether groups differ in any direction. Use when you have no prior directional hypothesis.

H1: "The two interfaces differ in usability scores."

Within-subjects design

The same participants complete both conditions. Also called a repeated-measures design. Controls for individual differences, making the test more sensitive with fewer participants.

12 users rate Design A, then later rate Design B — same people, both conditions. → Use Wilcoxon signed-rank or Paired t-test.

Data & Results

No test selected

Hypotheses Optional Review direction

See example of properly written hypotheses

H0: SUS scores for Design A are equal to SUS scores for Design B H1: SUS scores for Design A are higher than SUS scores for Design B

Name the measure + both groups + the relationship. H0 always claims equality. H1 states the direction you expect. Write both before collecting data.

H0 — Null hypothesis always claims equality or no difference

H0: is

H1 — Alternative hypothesis states the expected direction

H1: is

Tip: H0 claims no difference. H1 states your prediction. Two-tailed → use "different from". One-tailed → use "higher than" or "lower than". Write both before collecting data.

Data entry — paste from Excel or type values

Name your Group A

0 values

#	Group A

Select a cell · Ctrl+V / Cmd+V to paste from Excel

Name your Group B

0 values

#	Group B

Select a cell · Ctrl+V / Cmd+V to paste from Excel

Results

Run a test to see results

Enter data and click Run test.

No tests yet this session. Run a test to record it here.