The data divide is the unequal relationship between those capable of collecting, storing, mining, and general management of immense volumes of data, and those whose data is collected.[1] Using the framework of the digital divide, the data divide posits that the evolving nature of data and big data has created divisions and inequalities in data ownership, access, analysis, collection, and the manipulation of personal data generated by information and communications technologies (ICTs).
Early research in the digital divide concentrated on the divisions of access to information and digital technologies, demonstrating a split between the "haves" and the "have-nots":[2] those able to access and use digital technologies versus those who do not. Divisions were found to occur along multiple lines of inequality, including education, economic income, race, and gender. The digital divide has several dimensions of access, including access to equipment or hardware, ownership, support networks, digital literacy, skill to use/navigate user interfaces, and so on. The Ada Lovelace Institute notes that the digital divide has exacerbated a data divide.[3] As a result, the dimensions of access present within the digital divide are still present. The data divide additionally puts in contrast the "haves" who have access to large-scale datasets and the "have-nots" who do not have access to large-scale datasets nor the capability to navigate them.[4] For example, private companies, often social media companies, are the only ones who have access to extensive social data. Boyd and Crawford suggest divisions are also emphasized through research and universities: well-funded universities can buy access to datasets and the students who attend would be more likely to be bridged into work within the same social media companies, while less prestigious institutions would be less likely to afford their students the same opportunities.
The COVID-19 pandemic resulted in governments worldwide issuing stay-at-home orders, lockdowns, quarantines, restrictions, and closures. Interruptions to schooling, work, business, and other public service operations caused a massive shift to moving otherwise in-person activities online. Operations like doctor's visits, online schooling, shipping, and remote working require access to high-speed or broadband internet access and digital technologies.[5] This mass adoption of data-driven digital technologies is what the Ada Lovelace Institute describes as a digital surge. In a report with the Health Foundation, the Ada Lovelace Institute found the four key elements that emerged through a public attitudes survey: a data divide based on access to data-driven technologies, a data divide based on awareness and skill, a data divide based on comfort with using health-related tracking apps, and a data divide based on choosing not to use health-tracking apps. In this, the Ada Lovelace Institute stressed the data divide in users not being able to access data that may benefit them and the dangers of not being represented to address health inequalities.
General advances in technology, computing power, storage, and information management practices have enabled huge quantities of data to be both produced and analyzed.[6]
Tim Berners-Lee notes an increasing detachment between people and their own personal data, that regular users of digital devices or other services do not have the same capabilities to utilize data to the extent that ICTs or data brokerage firms do.[7] Berners-Lee argues that if personal data has the potential to benefit users, then users should be able to access and utilize them. However, Mark Andrejevic notes that even if users were given access to their own data, they would not be able to put it to use as effectively as data collectors because user data is not collected in isolation. Instead, data collectors accumulate data within a broader environment. By connecting and individual's interests to the profiles of other users, collectors can filter through content and interest patterns to recommend the content they may deem relevant to an individual. The capacities for storing, collecting, and analyzing data requires the necessary technological infrastructures, datasets, software, and processing power. Being able to extract information out of large datasets necessitates access to machines, databases, and advanced algorithms.
Smart city infrastructures are emblematic of the ways that sensors facilitated by big data technologies capture information to manage issues in urban city centers. Processing technologies may relate to issues around finance, trade, social welfare, etc. For example, municipal governments may identify and monitor citizens, companies, organizations, update their records, map profiles, perform data analyses to spot trends or issues, track services, and so on. Many smart city systems track data on a granular level, and while some governments have opted for open data approaches with dashboards and KPIs on display, some governments are not fully transparent and do not share their processes with the public. In this sense, the data divide is represented by governments capturing real-time data on citizens, whose data is used to further manage and govern city centers.
Being able to work with large-scale datasets and having the skills to navigate them requires types of knowledge that tend to only be available to those who have access to advanced machines, databases, and algorithms. Large databases enabled by big data are too large for any person or any group of people to understand on their own, which is why companies use tools and technologies assisted by artificial intelligence, algorithms, and so on. This divide is not only in the sense of those who have and those who do not. This divide also exists in categorical processes, ontological ways of thinking of data, and application of data. Companies with access to data are able to engage with robust and complex means of sorting. Individual use of collected data simply cannot measure up, especially if enabled within new work environments or communities without proper knowledge sharing or training. For example, failures to adopt new technologies into key industries such as agriculture represents aspects in both the digital and data divide.[8] Lack of data literacy can lead to data deluges – the burden of having and overwhelming amount of data without the capability to extract any meaningful information from datasets.[9]
A lack of collected information can create disparities which can eventually lead to information poverty. Information poverty stems from a lack of data about a given concept, where the data poverty can have a cumulative effect. This can snowball from individuals to governments on a national scale. This may become especially problematic when considering that those with access to information can act on data and thereby influence the lives of others in ways that people may not be able to concretely see. For example, a 2007 report from the World Health Organization shows that health information is one of the six fundamental building blocks of a well-functioning health system.[10] Access to quality health data is essential to resolving outbreaks, sicknesses, or other disparities in health; however, many countries, particularly in the Global South, do not have access to the relevant data sources that would allow them to otherwise address health inequities.
Overcoming the digital divide itself may provide populations with the means to access information and solve digital inequities, however this would also mean further exacerbating the data divide.
Data activists and information professionals do have the ability to help bridge the data divide through social action including crowdsourcing, citizen science, data cooperatives, hackathons, and civic hacking. These events seek to disrupt and challenge the status quo by cooperating with citizens to better understand quality of access, raise awareness, and allow citizens to generate data for their own uses.