The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed.[1] It was inspired by the smaller Anscombe's quartet that was created in 1973.
The following table contains summary statistics for all thirteen data sets.
Property | Value | Accuracy | |
---|---|---|---|
Number of elements | 142 | exact | |
Mean of x | 54.26 | to 2 decimal places | |
Sample variance of x: s | 16.76 | to 2 decimal places | |
Mean of y | 47.83 | to 2 decimal places | |
Sample variance of y: s | 26.93 | to 2 decimal places | |
Correlation between x and y | −0.06 | to 3 decimal places | |
Linear regression line | y = 53 − 0.1x | to 0 and 1 decimal places, respectively | |
Coefficient of determination of the linear regression: R2 | 0.004 | to 3 decimal places |
Similar to the Anscombe's quartet, the Datasaurus dozen was designed to further illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic data sets.[2] [3] [4] [5] [6]
The first data set, in the shape of a Tyrannosaurus, that inspired the rest of the "datasaurus" data set was constructed in 2016 by Alberto Cairo.[7] [8] It was proposed by Maarten Lambrechts that this data set also be called "Anscombosaurus".
This data set was then accompanied by twelve other data sets that were created by Justin Matejka and George Fitzmaurice at Autodesk. Unlike the Anscombe's quartet, where it is not known how the data set was generated,[9] the authors used simulated annealing to make these data sets. They made small, random, and biased changes to each point towards the desired shape. Each shape took 200,000 iterations of perturbations to complete.
The pseudocode for this algorithm is as follows:
function perturb(ds, temp): loop: test ← move_random_points(ds) if fit(test) > fit(ds) or temp > random: return test
where
initial_ds
is the seed data setcurrent_ds
is the latest version of the data setfit
is a function used to check whether moving the points gets closer to the desired shapetemp
is the temperature of the simulated annealing algorithm0similar_enough
is a function that checks whether the statistics for the two given data sets are similar enoughmove_random_points
is a function that randomly moves data points. The Visual Display of Quantitative Information . Graphics Press . 2001 . 0-9613921-4-2 . 2nd . Cheshire, CT . Edward Tufte.