Data exhaust or exhaust data is the trail of data left by the activities of an Internet or other computer system users during their online activity, behavior, and transactions. This is part of a broader category of unconventional data[1] that includes geospatial, network, and time-series data and may be useful for predictive analytics. Every visited website, clicked link, and even hovering with a mouse is collected, leaving behind a trail of data.[2] An enormous amount of often raw data are created, which can be in the form of cookies, temporary files, logfiles, storable choices, and more.[3] This information can help to improve the online experience, for example through customized content. It can be used to improve tracking trends and studying data exhaust also improves the user interface and the layout design. On the other hand, they can also compromise privacy, as they offer a valuable insight into the user's habits. For example, as the world's most popular website, Google, uses this data exhaust to refine the predictive value of their products.[4]
The data that is collected by companies is often information that does not seem immediately useful. Although the information is not used by the company right away, it can be stored for future use or sold to someone else who can use the information. The data can help with quality control, performance, and revenue.[5] Unlike primary content, these data are not purposefully created by the user, who is often unaware of their very existence. A bank for example would consider as primary data information concerning the sums and parties of a transaction, whilst secondary data might include the percentage of transactions carried out at a cash machine instead of a real bank.[6]
Most medical devices emit some form of exhaust data, such as many pacemakers, dialysis machines, and cameras used during surgery.[7] The majority of this data is never captured, and is primarily abandoned after the surgery is completed, or the device makes its next routine check. Some issues have arisen regarding the use of the data captured by devices like pacemakers. This can lead to larger issues surrounding the use of this exhaust data.[8] Using electronic health records (EMR) for research poses a large number of challenges, the most prevalent being the amount of data there is. This surplus of data is too much for people to sort through and analyze, thus creating a need for algorithms.[9]
Although data exhaust is not a new concept, the ubiquity of internet-enabled gadgetry has exacerbated the scope and impacts of our passive digital trail. The collection and distribution of data thus generated is not illegal, but there are steps that must be taken to ensure that the use of this data is ethical. In order to ensure privacy of users, when the information is sold it can be anonymized. Also, users can be given the opportunity to opt-out of the selling of their information if they choose. Lastly, to build trust, websites can update their privacy policies so that they include all the data in which they will be collecting about the user.[10]