A choropleth map is a type of statistical thematic map that uses pseudocolor, meaning color corresponding with an aggregate summary of a geographic characteristic within spatial enumeration units, such as population density or per-capita income.[1] [2] [3]
Choropleth maps provide an easy way to visualize how a variable varies across a geographic area or show the level of variability within a region. A heat map or isarithmic map is similar but uses regions drawn according to the pattern of the variable, rather than the a priori geographic areas of choropleth maps. The choropleth is likely the most common type of thematic map because published statistical data (from government or other sources) is generally aggregated into well-known geographic units, such as countries, states, provinces, and counties, and thus they are relatively easy to create using GIS, spreadsheets, or other software tools.
The earliest known choropleth map was created in 1826 by Baron Pierre Charles Dupin, depicting the availability of basic education in France by department.[4] More "cartes teintées" ("tinted maps") were soon produced in France to visualize other "moral statistics" on education, disease, crime, and living conditions.[5] Choropleth maps quickly gained popularity in several countries due to the increasing availability of demographic data compiled from national Censuses, starting with a series of choropleth maps published in the official reports of the 1841 Census of Ireland.[6] When Chromolithography became widely available after 1850, color was increasingly added to choropleth maps.
The term "choropleth map" was introduced in 1938 by the geographer John Kirtland Wright, and was in common usage among cartographers by the 1940s.[7] [8] Also in 1938, Glenn Trewartha reintroduced them as "ratio maps", but this term did not survive.[9]
A choropleth map brings together two datasets: spatial data representing a partition of geographic space into distinct districts, and statistical data representing a variable aggregated within each district. There are two common conceptual models of how these interact in a choropleth map: in one view, which may be called "district dominant", the districts (often existing governmental units) are the focus, in which a variety of attributes are collected, including the variable being mapped. In the other view, which may be called "variable dominant", the focus is on the variable as a geographic phenomenon (say, the Latino population), with a real-world distribution, and the partitioning of it into districts is merely a convenient measurement technique.[10]
In a choropleth map, the districts are usually previously defined entities such as governmental or administrative units (e.g., counties, provinces, countries), or districts created specifically for statistical aggregation (e.g., census tracts), and thus have no expectation of correlation with the geography of the variable. That is, boundaries of the colored districts may or may not coincide with the location of changes in the geographic distribution being studied. This is in direct contrast to chorochromatic and isarithmic maps, in which region boundaries are defined by patterns in the geographic distribution of the subject phenomenon.
Using pre-defined aggregation regions has a number of advantages, including: easier compilation and mapping of the variable (especially in the age of GIS and the Internet with its many sources of data), recognizability of the districts, and the applicability of the information to further inquiry and policy tied to the individual districts. A prime example of this would be elections, in which the vote total for each district determines its elected representative.
However, it can result in a number of issues, generally due to the fact that the constant color applied to each aggregation district makes it look homogeneous, masking an unknown degree of variation of the variable within the district. For example, a city may include neighborhoods of low, moderate, and high family income, but be colored with one constant "moderate" color. Thus, real-world spatial patterns may not conform to the regional unit symbolized.[11] Because of this, issues such as the ecological fallacy and the modifiable areal unit problem (MAUP) can lead to major misinterpretations of the data depicted, and other techniques are preferable if one can obtain the necessary data.[12] [13]
These issues can be somewhat mitigated by using smaller districts, because they show finer variations in the mapped variable, and their smaller visual size and increased number reduces the likelihood that the map user makes judgments about the variation within a single district. However, they can make the map overly complex, especially if there is not a meaningful geographic pattern in the variable (i.e., the map looks like randomly scattered colors). Although representing specific data in large regions can be misleading, the familiar district shapes can make the map clearer and easier to interpret and remember.[14] The choice of regions will ultimately depend on the map's intended audience and purpose. Alternatively, the dasymetric technique can sometimes be employed to refine the region boundaries to more closely match actual changes in the subject phenomenon.
Because of these issues, for many variables, one may prefer an isarithmic (for a quantitative variable) or chorochromatic map (for a qualitative variable), in which the region boundaries are based on the data itself. However, in many cases such detailed information is simply not available, and the choropleth map is the only feasible option.
The variable to be mapped may come from a wide variety of disciplines in the human or natural world, although human topics (e.g. demographics, economics, agriculture) are generally more common because of the role of governmental units in human activity, which often leads to the original collection of the statistical data. The variable can also be in any of Stevens' levels of measurement: nominal, ordinal, interval, or ratio, although quantitative (interval/ratio) variables are more commonly used in choropleth maps than qualitative (nominal/ordinal) variables. It is important to note that the level of measurement of the individual datum may be different than the aggregate summary statistic. For example, a census may ask each individual for his or her "primary spoken language" (nominal), but this may be summarized over all of the individuals in a county as "percent primarily speaking Spanish" (ratio) or as "predominant primary language" (nominal).
Broadly speaking, a choropleth map may represent two types of variables, a distinction common to physics and chemistry as well as Geostatistics and spatial analysis:
Normalization is the technique of deriving a spatially intensive variable from one or more spatially extensive variables, so that it can be appropriately used in a choropleth map.[3] It is similar, but not identical, to the technique of normalization or standardization in statistics. Typically, it is accomplished by computing the ratio between two spatially extensive variables.[17] Although any such ratio will result in an intensive variable, only a few are especially meaningful and commonly used in choropleth maps:
These are not equivalent, nor is one better than another. Rather, they tell different aspects of a geographic narrative. For example, a choropleth map of the population density of the Latino population in Texas visualizes a narrative about the spatial clustering and distribution of that group, while a map of the percent Latino visualizes a narrative of composition and predominance. Failure to employ proper normalization will lead to an inappropriate and potentially misleading map in almost all cases.[15] [18] [19] This is one of the most common mistakes in cartography, with one study finding that at one point, more than half of United States COVID-19 dashboards hosted by state governments were not employing normalization to their choropleth maps. This is one of many issues that contributed to the infodemic surrounding the COVID-19 pandemic, and "might also be a subtle facilitator of the extreme political polarization surrounding measures to combat COVID that has occurred in the United States".[20]
See main article: Data binning, Cluster analysis and Statistical classification. Every choropleth map has a strategy for mapping values to colors. A classified choropleth map separates the range of values into classes, with all of the districts in each class being assigned the same color. An unclassed map (sometimes called n-class) directly assigns a color proportional to the value of each district. Starting with Dupin's 1826 map, classified choropleth maps have been far more common.[2] It is likely that this was originally due to the greater simplicity of applying a limited set of tints; only in the age of computerized cartography have unclassed choropleth maps even been feasible, and until recently, they were still not easy to create in most mapping software.[21] [2] [22] [23] Waldo R. Tobler, in formally introducing the unclassed scheme in 1973, asserted that it was a more accurate depiction of the original data, and stated that the primary argument in favor of classification, that it is more readable, needed to be tested.[2] The debate and experiments that followed came to the general conclusion that the primary advantage of unclassed choropleth maps, in addition to Tobler's assertion of raw accuracy, was that they allowed readers to see subtle variations in the variable, without leading them to believe that the districts the fell into the same class had identical values. Thus, they are able to better see the general patterns in the geographic phenomenon, but not the specific values.[24] [25] The primary argument in favor of classed choropleth maps is that it is easier for readers to process, due to the fewer number of distinct shades to recognize, which reduces cognitive load and allows them to precisely match the colors in the map to the values listed in the legend.[2] [22] [23]
Classification is performed by establishing a classification rule, a series of thresholds that partitions the quantitative range of variable values into a series of ordered classes. For example, if a dataset of annual Median income by U.S. county includes values between US$20,000 and $150,000, it could be broken into three classes at thresholds of $45,000 and $83,000. To avoid confusion, any classification rule should be mutually exclusive and collectively exhaustive, meaning that any possible value falls into exactly one class. For example, if a rule establishes a threshold at the value 6.5, it needs to be clear about whether a district with a value of exactly 6.5 will be classified into the lower or upper class (i.e., whether the definition of the lower class is <6.5 or ≤6.5 and whether the upper class is >6.5 or ≥6.5).A variety of types of classification rules have been developed for choropleth maps:[26]
Because calculated thresholds can often be at precise values that are not easily interpretable by map readers (e.g., $74,326.9734), it is common to create a modified classification rule by rounding threshold values to a similar simple number. A common example is a modified geometric progression that subdivides powers of ten, such as [1, 2.5, 5, 10, 25, 50, 100, ...] or [1, 3, 10, 30, 100, ...].
See main article: Color scheme. The final element of a choropleth map is the set of colors used to represent the different values of the variable. There are a variety of different approaches to this task, but the primary principle is that any order in the variable (e.g., low to high quantitative values) should be reflected in the perceived order of the colors (e.g., light to dark), as this will allow map readers to intuitively make "more vs. less" judgements and see trends and patterns with minimal reference to the legend. A second general guideline, at least for classified maps, is that the colors should be easily distinguishable, so the colors on the map can be unambiguously matched to those in the legend to determine the represented values. This requirement limits the number of classes that can be included; for shades of gray, tests have shown that when value alone is used (e.g., light to dark, whether gray or any single hue), it is difficult to practically use more than seven classes.[28] If differences in hue and/or saturation are incorporated, that limit increases significantly to as many as 10-12 classes. The need for color discrimination is further impacted by color vision deficiencies; for example, color schemes that use red and green to distinguish values will not be useful for a significant portion of the population.[29]
The most common types of color progressions used in choropleth (and other thematic) maps include:[30] [31]
See main article: Bivariate map. It is possible to represent two (and sometimes three) variables simultaneously on a single choropleth map by representing each with a single-hue progression and blending the colors of each district. This technique was first published by the U.S. Census Bureau in the 1970s, and has been used many times since, to varying degrees of success.[35] This technique is generally used to visualize the correlation and contrast between two variables hypothesized to be closely related, such as educational attainment and income. Contrasting but not complementary colors are generally used, so that their combination is intuitively recognized as "between" the two original colors, such as red+blue=purple. The technique works best when the geography of the variable has a high degree of spatial autocorrelation, so that there are large regions of similar colors with gradual changes between them; otherwise the map can look like a confusing mix of random colors. They have been found to be more easily used if the map includes a carefully designed legend and an explanation of the technique.[36]
See also: Page layout (cartography). A choropleth map uses ad hoc symbols to represent the mapped variable. While the general strategy may be intuitive if a color progression is chosen that reflects the proper order, map readers cannot decipher the actual value of each district without a legend. A typical choropleth legend for a classed choropleth map includes a series of sample patches of the symbol for each class, with a text description of the corresponding range of values. On an unclassed choropleth map, it is common for the legend to show a smooth color gradient between the minimum and maximum values, with two or more points along it labeled with corresponding values.
An alternative approach is the histogram legend, which includes a histogram showing the frequency distribution of the mapped variable (i.e., the number of districts in each class). Each class may be represented by a single bar with its width determined by its minimum and maximum threshold values and its height calculated such that the box area is proportional to the number of districts included, then colored with the map symbol used for that class. Alternatively, the histogram may be divided into a large number of bars, such that each class includes one or more bars, symbolized according to its symbol in the map.[37] This form of legend shows not only the threshold values for each class, but gives some context for the source of those values, especially for endogenous classification rules that are based on the frequency distribution, such as quantiles. However, they are not currently supported in GIS and mapping software, and must typically be constructed manually.