If someone handed you a spreadsheet with two columns of numbers and told you the mean, standard deviation, and correlation, would you need to look further? Most analysts would say no. The Datasaurus Dozen is here to prove why that's a mistake.
Created by Alberto Cairo and extended by the Autodesk Research team, it starts with a simple scatter plot that forms the shape of a dinosaur. The numbers seem unremarkable until you look at the summary statistics:
There are 12 other datasets that share these exact same statistics to at least two decimal places, yet each one looks completely different when plotted. Lines, circles, a star, a bullseye. The numbers match; the shapes could not be more different.
In the comparison below, the darker digits show where values agree between datasets. The lighter digits reveal where they quietly diverge. Summary statistics are a lossy compression of reality and can tell you almost nothing about the true shape of your data.
| mean x | mean y | std x | std y | corr |
|---|---|---|---|---|
| 54.26327 | 47.83225 | 16.76514 | 26.93540 | -0.06447 |
| 54.26610 | 47.83472 | 16.76982 | 26.93974 | -0.06413 |
| dino | away | |
|---|---|---|
| mean x | 54.26327 | 54.26610 |
| mean y | 47.83225 | 47.83472 |
| std x | 16.76514 | 16.76982 |
| std y | 26.93540 | 26.93974 |
| corr | -0.06447 | -0.06413 |