Datasaurus Dozen

December 30 2023

If someone handed you a spreadsheet with two columns of numbers and told you the mean, standard deviation, and correlation, would you need to look further? Most analysts would say no. The Datasaurus Dozen is here to prove why that's a mistake.

Created by Alberto Cairo and extended by the Autodesk Research team, it starts with a simple scatter plot that forms the shape of a dinosaur. The numbers seem unremarkable until you look at the summary statistics:

mean_x

54.26327

mean_y

47.83225

std_x

16.76514

std_y

26.93540

corr

-0.06447

There are 12 other datasets that share these exact same statistics to at least two decimal places, yet each one looks completely different when plotted. Lines, circles, a star, a bullseye. The numbers match; the shapes could not be more different.

In the comparison below, the darker digits show where values agree between datasets. The lighter digits reveal where they quietly diverge. Summary statistics are a lossy compression of reality and can tell you almost nothing about the true shape of your data.

Hover each chart to compare its stats against the dinosaur.

comparing

dinovsaway

mean x	mean y	std x	std y	corr
54.26327	47.83225	16.76514	26.93540	-0.06447
54.26610	47.83472	16.76982	26.93974	-0.06413

	dino	away
mean x	54.26327	54.26610
mean y	47.83225	47.83472
std x	16.76514	16.76982
std y	26.93540	26.93974
corr	-0.06447	-0.06413