Data-driven model explained

Data-driven models are a class of computational models that primarily rely on historical data collected throughout a system's or process' lifetime to establish relationships between input, internal, and output variables. Commonly found in numerous articles and publications, data-driven models have evolved from earlier statistical models, overcoming limitations posed by strict assumptions about probability distributions. These models have gained prominence across various fields, particularly in the era of big data, artificial intelligence, and machine learning, where they offer valuable insights and predictions based on the available data.

Background

These models have evolved from earlier statistical models, which were based on certain assumptions about probability distributions that often proved to be overly restrictive.^[1] The emergence of data-driven models in the 1950s and 1960s coincided with the development of digital computers, advancements in artificial intelligence research, and the introduction of new approaches in non-behavioural modelling, such as pattern recognition and automatic classification.^[2]

Key Concepts

Data-driven models encompass a wide range of techniques and methodologies that aim to intelligently process and analyse large datasets. Examples include fuzzy logic, fuzzy and rough sets for handling uncertainty,^[3] neural networks for approximating functions,^[4] global optimization and evolutionary computing,^[5] statistical learning theory,^[6] and Bayesian methods.^[7] These models have found applications in various fields, including economics, customer relations management, financial services, medicine, and the military, among others.^[8]

Machine learning, a subfield of artificial intelligence, is closely related to data-driven modelling as it also focuses on using historical data to create models that can make predictions and identify patterns.^[9] In fact, many data-driven models incorporate machine learning techniques, such as regression, classification, and clustering algorithms, to process and analyse data.^[10]

In recent years, the concept of data-driven models has gained considerable attention in the field of water resources, with numerous applications, academic courses, and scientific publications using the term as a generalization for models that rely on data rather than physics.^[11] This classification has been featured in various publications and has even spurred the development of hybrid models in the past decade. Hybrid models attempt to quantify the degree of physically based information used in hydrological models and determine whether the process of building the model is primarily driven by physics or purely data-based. As a result, data-driven models have become an essential topic of discussion and exploration within water resources management and research.^[12]

The term "data-driven modelling" (DDM) refers to the overarching paradigm of using historical data in conjunction with advanced computational techniques, including machine learning and artificial intelligence, to create models that can reveal underlying trends, patterns, and, in some cases, make predictions^[13] Data-driven models can be built with or without detailed knowledge of the underlying processes governing the system behavior, which makes them particularly useful when such knowledge is missing or fragmented.^[14]

References

David, A., Freedman. (2006). On The So-Called “Huber Sandwich Estimator” and “Robust Standard Errors”. The American Statistician, 60(4):299-302.
Richard, O., Duda., Peter, E., Hart. (1973). Pattern classification and scene analysis.
J., A., Goguen. (1973). Zadeh L. A.. Fuzzy sets. Information and control, vol. 8 (1965), pp. 338–353. Zadeh L. A.. Similarity relations and fuzzy orderings. Information sciences, vol. 3 (1971), pp. 177–200.. Journal of Symbolic Logic, 38(4):656-657.
Simon, Haykin. (2009). Neural Networks and Learning Machines 3rd Edition : Simon Haykin.
David, E., Goldberg. (1988). Genetic algorithms in search, optimization, and machine learning. University of Alabama.
Vapnik, V. (1995). The nature of statistical learning theory. Springer.
Paul, Hewson. (2015). Bayesian Data Analysis 3rd edn A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari and D. B. Rubin, 2013 Boca Raton, Chapman and Hall–CRC 676 pp., . Journal of The Royal Statistical Society Series A-statistics in Society, 178(1):301-301.
Usama, M., Fayyad., Gregory, Piatetsky-Shapiro., Padhraic, Smyth. (1996). From Data Mining to Knowledge Discovery in Databases. Ai Magazine, 17(3):37-54.
Mitchell, T. M. (1997). Machine learning. McGraw Hill Series in Computer Science.
Alpaydin, E. (2020). Introduction to machine learning. MIT Press.
Robert, J., Abrahart., Linda, M., See., Dimitri, Solomatine. (2008). Practical hydroinformatics : computational intelligence and technological developments in water applications.
G.A., Corzo, Perez. (2009). Hybrid models for Hydrological Forecasting: integration of data-driven and conceptual modelling techniques.
Foster, Provost., Tom, Fawcett. (2013). Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking.
M., Cheng., Fangxin, Fang., Christopher, C., Pain., Ionel, Michael, Navon. (2020). Data-driven modelling of nonlinear spatio-temporal fluid flows using a deep convolutional generative adversarial network. Computer Methods in Applied Mechanics and Engineering, 365:113000-.