This is a list of important publications in data science, generally organized by order of use in a data analysis workflow.See the list of important publications in statistics for more research-based and fundamental publications; while this list is more applied, business oriented, and cross-disciplinary.
General article inclusion criteria are:
Some reasons why a particular publication might be regarded as important:
When possible, a reference is used to validate the inclusion of the publication in this list.
Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)
Author: Leo Breiman
Publication data: [2]
Online version: https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.pdf
Description: Describes two cultures of statistics, one using a parsimonious and generative stochastic model, while the other is an algorithmic model with no known mechanism for how the data is generated. Breiman argues that while statistics has traditionally favored using the stochastic model, there is value in expanding the methods that statisticians can use to study phenomenon.
Importance: Influence on the philosophies of statisticians right before the increased use of machine learning and deep learning methods. In a 20-year retrospective on this article, "Breiman's words are perhaps more relevant than ever".[3] Notable statisticians at the time wrote opinion pieces about the publication. Although overall critical of the publication, David Cox writes that the publication "contains enough truth and exposes enough weaknesses to be thought-provoking." Bradley Efron commented that this publication is a "stimulating paper". Emanuel Parzen also comments about this publication that "Breiman alerts us to systematic blunders (leading to wrong conclusions) that have been committed applying current statistical practice of data modeling".
50 Years of Data Science
Author: David Donoho
Publication data: [4]
Online version: https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734
Description: Retrospective discussion paper on the history and origins of data science, with a number of commentary from notable statisticians.
Importance: This has been described as "the first in the field to present such a comprehensive and in-depth survey and overview",[5] and helps to define the field that has many definitions.
The Composable Data Management System Manifesto
Author: Pedro Pedreira, Orri Erling, Konstantinos Karanasos, Scott Schneider, Wes McKinney, Satya R Valluri, Mohamed Zait, Jacques Nadeau
Publication data: [6]
Online version: https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf
Description: The vision paper advocating for a paradigm shift in how data management systems are designed using standard, composable, interoperable tools rather than siloed software tools.
Importance: A paradigm shifting view on how future data science software tools should be designed for more efficient workflows, the principles of which "will be especially crucial for addressing fragmentation, improving interoperability, and promoting user-centricity as data ecosystems grow increasingly complex".[7]
Tidy Data
Author: Hadley Wickham
Publication data: [8]
Online version: https://www.jstatsoft.org/article/view/v059i10/ https://vita.had.co.nz/papers/tidy-data.pdf
Description: Describes a framework for data cleaning that is summarized in the quote, "each variable is a column, each observation is a row, and each type of observational unit is a table". This allows a standard data structure for which data analysis tools can be consistently built around.
Importance: Cited over 1,500 times, this effort for tidy data has been described by David Donoho as having "more impact on today’s practice of data analysis than many highly regarded theoretical statistics articles". In the context of data visualization, this publication is said to support "efficient exploration and prototyping because variables can be assigned different roles in the plot without modifying anything about the original dataset".[9]
Data Organization in Spreadsheets
Author: Karl W. Broman and Kara H. Woo
Publication data: [10]
Online version: https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989
Description: This article offers practical recommendations for organizing data in spreadsheets, like Microsoft Excel and Google Sheets, to reduce errors and lower the barrier for later analyses due to limitations in spreadsheets or quirks in the software.
Importance: Influences teaching both data and non-data practitioners to create more analysis-friendly spreadsheets, and has been described to outline "spreadsheet best practices".[11]
Quantitative Graphics in Statistics: A Brief History
Author: James R. Beniger and Dorothy L. Robyn
Publication data: [12]
Online version: https://www.jstor.org/stable/2683467
Description: Outlines history and evolution of quantitative graphics in statistics, going through spatial organization (17th and 18th centuries), discrete comparison (18th and 19th centuries), continuous distribution (19th century), and multivariate distribution and correlation (late 19th and 20th centuries).
Importance: Helps put into perspective for learning data practitioners the recency of graphics that are used. A later publication "Graphical Methods in Statistics" by Stephen Fienberg in 1979 writes that his publication "owes much to the work of Beniger and Robyn".[13]
Hidden Technical Debt in Machine Learning Systems
Author: D. Sculley, Gary Holy, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, Dan Dennison
Publication data: [14]
Online version: https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
Description: This paper argues that it is "dangerous to think of [complex machine learning] quick wins as coming for free" and overviews risk factors to account for when implementing a machine learning system.
Importance: All authors worked for Google, article is cited over 1,000 times,[15] and helped practitioners thinking about quickly implementing a machine learning tool without understanding the long-term maintenance of the tool.
A few useful things to know about machine learning
Author: Pedro Domingos
Publication data: [16]
Online version: https://dl.acm.org/doi/10.1145/2347736.2347755 https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Description: The purpose of this paper is to distill inaccessible "folk knowledge" to effectively implement machine learning projects because "machine learning projects take much longer than necessary or wind up producing less-than-ideal results".
Importance: Cited over 4,000 times[17] to influence the common set of knowledge for data practitioners using machine learning.[18]
The Introductory Statistics Course: A Ptolemaic Curriculum
Author: George W. Cobb[19]
Publication data: [20]
Online version: https://escholarship.org/uc/item/6hb3k0nz
Description: This paper argues for a rethinking of how teachers of statistics should structure their introductory statistics courses away from the technical machinery based on the normal distribution and towards simpler alternative methods based on permutations done on computers.
Importance: Cited over 300 times,[21] this paper influenced teachers of statistics in the 21st century to reconsider teaching the mere mechanics of statistics, while the use of computers can be leveraged for doing more with less.