Concept drift explained

In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.

Predictive model decay

In machine learning and predictive analytics this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as probability distribution of the actual data. If they deviate from the statistical properties of the training data set, then the learned predictions may become invalid, if the drift is not addressed.[1] [2] [3] [4]

Data configuration decay

Another important area is software engineering, where three types of data drift affecting data fidelity may be recognized. Changes in the software environment ("infrastructure drift") may invalidate software infrastructure configuration. "Structural drift" happens when the data schema changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when many independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system.[5] [6]

For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates.[6]

In the case of cloud computing, infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software.[5]

There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream the data processing pipeline, but somewhere downstream there data fields are absent.[6]

Inconsistent data

"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift may be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input.[7]

"Data drift" may also refer to inconsistency of data elements between several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run checksum regularly. However the remedy may be not so easy.[8]

Examples

The behavior of the customers in an online shop may change over time. For example, if weekly merchandise sales are to be predicted, and a predictive model has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on advertising, promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some confounding variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift.

Possible remedies

To prevent deterioration in prediction accuracy because of concept drift, reactive and tracking solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test,[9] [10] to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy.[11] [12] A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include online machine learning, frequent retraining on the most recently observed samples,[13] and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble.[14]

Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, but concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.

Concept drift cannot be avoided for complex phenomena that are not governed by fixed laws of nature. All processes that arise from human activity, such as socioeconomic processes, and biological processes are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary.

See also

Further reading

Many papers have been published describing algorithms for concept drift detection. Only reviews, surveys and overviews are here:

Reviews

External links

Software

Datasets

Real

Other

Synthetic

Data generation frameworks

Projects

Benchmarks

Meetings

Notes and References

  1. Book: 10.1007/978-981-16-8531-6_4. A Drift Aware Hierarchical Test Based Approach for Combating Social Spammers in Online Social Networks. Data Mining. Communications in Computer and Information Science. 2021. Koggalahewa. Darshika. Xu. Yue. Foo. Ernest. 1504. 47–61. 978-981-16-8530-9. 245009299.
  2. 10.1007/BF00116900. Learning in the presence of concept drift and hidden contexts. 1996. Widmer. Gerhard. Kubat. Miroslav. Machine Learning. 23. 69–101. 206767784. free.
  3. Book: 10.1007/978-3-030-64243-3_9. A Drift Detection Method Based on Diversity Measure and McDiarmid's Inequality in Data Streams. Green, Pervasive, and Cloud Computing. Lecture Notes in Computer Science. 2020. Xia. Yuan. Zhao. Yunlong. 12398. 115–122. 978-3-030-64242-6. 227275380.
  4. 10.1109/TKDE.2018.2876857. Learning under Concept Drift: A Review. 2018. Lu. Jie. Liu. Anjin. Dong. Fan. Gu. Feng. Gama. Joao. Zhang. Guangquan. IEEE Transactions on Knowledge and Data Engineering. 1. 2004.05785. 69449458.
  5. https://dev.to/stack-labs/driftctl-and-terraform-they-re-two-of-a-kind-22p1 "Driftctl and Terraform, they're two of a kind!"
  6. Girish Pancha, Big Data's Hidden Scourge: Data Drift, CMSWire, April 8, 2016
  7. Matthew Magne, "Data Drift Happens: 7 Pesky Problems with People Data", InformationWeek, July 19, 2017
  8. Daniel Nichter, Efficient MySQL Performance, 2021,, p. 299
  9. Book: Basseville, Michele. Detection of abrupt changes: theory and application. 1993. Prentice Hall. 0-13-126780-9. 876004326.
  10. Book: Alippi . C. . Roveri . M. . Adaptive Classifiers in Stationary Conditions . 2007 International Joint Conference on Neural Networks . IEEE . 2007 . 978-1-4244-1380-5 . 1008–13 . 10.1109/ijcnn.2007.4371096. 16255206 .
  11. Book: Gama . J. . Medas . P. . Castillo . G. . Rodrigues . P. . Learning with Drift Detection . Advances in Artificial Intelligence – SBIA 2004 . Springer . 2004 . 978-3-540-28645-5 . 286–295 . 10.1007/978-3-540-28645-5_29. 2606652 .
  12. Alippi . C. . Boracchi . G. . Roveri . M. . A just-in-time adaptive classification system based on the intersection of confidence intervals rule . Neural Networks . 24 . 8 . 791–800 . 2011 . 10.1016/j.neunet.2011.05.012 . 21723706 .
  13. Widmer . G. . Kubat . M. . Learning in the presence of concept drift and hidden contexts . Machine Learning . 23 . 1 . 69–101 . 1996 . 10.1007/bf00116900 . 206767784 . free .
  14. Elwell . R. . Polikar . R. . Incremental Learning of Concept Drift in Nonstationary Environments . IEEE Transactions on Neural Networks . 22 . 10 . 1517–31 . 2011 . 10.1109/tnn.2011.2160459 . 21824845 . 9136731 .
  15. Céspedes Sisniega . Jaime . López García . Álvaro . 2024 . Frouros: An open-source Python library for drift detection in machine learning systems . PDF . SoftwareX . Elsevier . 26 . 101733 . 10.1016/j.softx.2024.101733. free . 10261/358367 . free .