The granularity-related inconsistency of means (GRIM) test is a simple statistical test used to identify inconsistencies in the analysis of data sets. The test relies on the fact that, given a dataset containing N integer values, the arithmetic mean (commonly called simply the average) is restricted to a few possible values: it must always be expressible as a fraction with an integer numerator and a denominator N. If the reported mean does not fit this description, there must be an error somewhere; the preferred term for such errors is "inconsistencies", to emphasise that their origin is, on first discovery, typically unknown. GRIM inconsistencies can result from inadvertent data-entry or typographical errors or from scientific fraud. The GRIM test is most useful in fields such as psychology where researchers typically use small groups and measurements are often integers. The GRIM test was proposed by Nick Brown and James Heathers in 2016, following increased awareness of the replication crisis in some fields of science.[1]
The GRIM test is straightforward to perform. For each reported mean in a paper, the sample size (N) is found, and all fractions with denominator N are calculated. The mean is then checked against this list (being aware of the fact that values may be rounded inconsistently: depending on the context, a mean of 1.125 may be reported as 1.12 or 1.13). If the mean is not in this list, it is highlighted as mathematically impossible.[2] [3]
Consider an experiment in which a fair dice is rolled 20 times. Each roll will produce one whole number between 1 and 6, and the hypothesized mean value is 3.5. The results of the rolls are then averaged together, and the mean is reported as 3.48. This is close to the expected value, and appears to support the hypothesis. However, a GRIM test reveals that the reported mean is mathematically impossible: the result of dividing any whole number by 20, written to 2 decimal places, must be of the form X.X0 or X.X5; it is impossible to divide any integer by 20 and produce a result with an "8" in the second decimal place.[4]
Even if the data fails the GRIM test, this is not automatically a sign of manipulation. Errors in the mean can come about innocently as a result of an error on the part of the tester, typographical errors, calculation and programming mistakes, or improper reporting of the sample size.[2] However, it can be a sign that some data has been improperly excluded or that the mean has been illegitimately fudged in order to make the results appear more significant. The location of failures can be indicative of the underlying cause: an isolated impossible mean may be caused by an error, multiple impossible values in the same row of a table indicate a poor response rate, and multiple impossible values in the same column indicate the given sample size is incorrect. Multiple errors scattered throughout a table can be a sign of deeper problems, and other statistical tests can be used to analyze the suspect data.[5]
The GRIM test works best with data sets in which: the sample size is relatively small, the number of subcomponents in composite measures is also small, and the mean is reported to multiple decimal places.[2] In some cases, a valid mean may appear to fail the test if the input data is not discretized as expected – for example, if people are asked how many slices of pizza they ate at a buffet, some people may respond with a fraction such as "three and a half" instead of a whole number as expected.[5]
Brown and Heathers applied the test to 260 articles published in Psychological Science, , and Journal of Personality and Social Psychology. Of these articles, 71 were amenable to GRIM test analysis; 36 of these contained at least one impossible value and 16 contained multiple impossible values.[3]
GRIM testing also played a significant role in uncovering errors in publications by Cornell University's Food and Brand Lab under Brian Wansink. GRIM testing revealed that a series of articles on the effect of price on consumption at an all-you-can-eat pizza buffet contained many impossible means – deeper analysis of the raw data revealed that in many cases, sample sizes were incorrectly stated and values incorrectly calculated.[1] [5]