Javier Cardenas — home

Datasaurus Dozen

December 30 2023

If someone handed you a spreadsheet with two columns of numbers and told you the mean, standard deviation, and correlation, would you need to look further? Most analysts would say no. The Datasaurus Dozen is here to prove why that's a mistake.

Created by Alberto Cairo and extended by the Autodesk Research team, it starts with a simple scatter plot that forms the shape of a dinosaur. The numbers seem unremarkable until you look at the summary statistics:

mean_x
54.26327
mean_y
47.83225
std_x
16.76514
std_y
26.93540
corr
-0.06447

There are 12 other datasets that share these exact same statistics to at least two decimal places, yet each one looks completely different when plotted. Lines, circles, a star, a bullseye. The numbers match; the shapes could not be more different.

In the comparison below, the darker digits show where values agree between datasets. The lighter digits reveal where they quietly diverge. Summary statistics are a lossy compression of reality and can tell you almost nothing about the true shape of your data.

hover hand
Hover each chart to compare its stats against the dinosaur.
dinoaway
mean x54.2632754.26610
mean y47.8322547.83472
std x16.7651416.76982
std y26.9354026.93974
corr-0.06447-0.06413