Visual indexing theory, also known as FINST theory, is a theory of early visual perception developed by Zenon Pylyshyn in the 1980s. It proposes a pre-attentive mechanism (a ‘FINST’) whose function is to individuate salient elements of a visual scene, and track their locations across space and time. Developed in response to what Pylyshyn viewed as limitations of prominent theories of visual perception at the time, visual indexing theory is supported by several lines of empirical evidence.
'FINST' abbreviates ‘FINgers of INSTantiation’. Pylyshyn describes visual indexing theory in terms of this analogy.[1] Imagine, he proposes, placing your fingers on five separate objects in a scene. As those objects move about, your fingers stay in respective contact with each of them, allowing you to continually track their whereabouts and positions relative to one another. While you may not be able to discern in this way any detailed information about the items themselves, the presence of your fingers provides a reference via which you can access such information at any time, without having to relocate the objects within the scene. Furthermore, the objects' continuity over time is inherently maintained — you know the object referenced by your pinky finger at time t is the same object as that referenced by your pinky at t−1, regardless of any spatial transformations it has undergone, because your finger has remained in continuous contact with it.
Visual indexing theory holds that the visual perceptual system works in an analogous way. FINSTs behave like the fingers in the above scenario, pointing to and tracking the location of various objects in visual space. Like fingers, FINSTs are:
FINSTs operate pre-attentively — that is, before attention is drawn or directed to an object in the visual field. Their primary task is to individuate certain salient features in a scene, conceptually distinguishing these from other stimuli. Under visual indexing theory, FINSTing is a necessary precondition for higher level perceptual processing.
Pylyshyn suggests that what FINSTs operate upon in a direct sense is 'feature clusters' on the retina, though a precise set of criteria for FINST allocation has not been defined. "The question of how FINSTs are assigned in the first instance remains open, although it seems reasonable that they are assigned primarily in a stimulus-driven manner, perhaps by the activation of locally distinct properties of the stimulus-particularly by new features entering the visual field."[1]
FINSTs are subject to resource constraints. Up to around five FINSTs can be allocated at any given time, and these provide the visual system information about the relative locations of FINSTed objects with respect to one another.
Once an object has been individuated, its FINST then continues to index that particular feature cluster as it moves across the retina. "Thus distal features which are currently projected onto the retina can be indexed through the FINST mechanism in a way that is transparent to their retinal location."[1] By continually tracking an objects' whereabouts as it moves about, FINSTs perform the additional function of maintaining the continuity of objects over time.
Under visual indexing theory, an object cannot be attended to until it has first been indexed. Once it has been allocated a FINST, the index provides the visual system with rapid and preferential access to the object for further processing of features such as colour, texture and shape.
While in this sense FINSTs provide the means for higher-level processing to occur, FINSTs themselves are "opaque to the properties of the objects to which they refer."[1] FINSTs do not directly convey any information about an indexed object, beyond its position at a given instant. "Thus, on initial contact, objects are not interpreted as belonging to a certain type or having certain properties; in other words, objects are initially detected without being conceptualised."[2] Like the fingers described above, FINSTs' role in visual perception is purely an indexical one.
Visual indexing theory was created partly in response to what Pylyshyn viewed as limitations of traditional theories of perception and cognition — in particular, the spotlight model of attention, and the descriptive view of visual representation.[1]
The traditional view of visual perception holds that attention is fundamental to visual processing. In terms of an analogy offered by Posner, Snyder and Davidson (1980): "Attention can be likened to a spotlight that enhances the efficiency of detection of events within its beam".[3] This spotlight can be controlled volitionally, or drawn involuntarily to salient elements of a scene,[4] but a key characteristic is that it can only be deployed to one location at a time. In 1986, Eriksen and St. James conducted a series of experiments which suggested that the spotlight of attention comes equipped with a zoom-lens. The zoom-lens allows the size of the area of attentional focus to be expanded (but due to a fixed limit on available attentional resources, only at the expense of processing efficiency).[5]
According to Pylyshyn, the spotlight/zoom-lens model cannot tell the complete story of visual perception. He argues that a pre-attentive mechanism is needed to individuate objects upon which a spotlight of attention could be directed in the first place. Furthermore, results of multiple object tracking studies (discussed below) are "incompatible with the proposal that items are accessed by moving around a single spotlight of attention." Visual indexing theory addresses these limitations.
According to the classical view of mental representation, we perceive objects according to the conceptual descriptions they fall under. It is these descriptions, and not the raw content of our visual perceptions, that allow us to construct meaningful representations of the world around us, and determine appropriate courses of action. In Pylyshyn's words, "it is not the bright spot in the sky that determines which way we set out when we are lost, but the fact that we see it (or represent it) as the North Star".[6] The method by which we come to match a percept to its appropriate description has been the subject of ongoing investigation (for example the way in which parts of objects are combined to represent their whole),[7] but there is a general consensus that descriptions are fundamental in this way to visual perception.[6]
Like the spotlight model of attention, Pylyshyn takes the descriptive model of visual representation to be incomplete. One issue is that the theory does not account for demonstrative, or indexical references. "For example, in the presence of a visual stimulus, we can think thoughts such as `that is red' where the term `that' refers to something we have picked out in our field of view without reference to what category it falls under or what properties it may have."[6] Relatedly, the theory has problems accounting for how we are able to pick out a single token among several objects of the same type. For example, I may refer to a particular can of soup on a supermarket shelf sitting among a number of identical cans that answer to the same description. In both cases, a spatiotemporal reference is required in order to pick out the object within the scene, independently of any description that object may fall under. FINSTs, Pylyshyn suggests, provide just such a reference.
A deeper problem for this view, according to Pylyshyn, is that it cannot account for objects' continuity over time. "An individual remains the same individual when it moves about or when it changes any (or even all) of its visible properties."[6] If we refer to objects solely in terms of their conceptual descriptions, it is not clear how the visual system maintains an object's identity when those descriptions change. "The visual system needs to be able to pick out a particular individual regardless of what properties the individual happens to have at any instant of time."[6] Pylyshyn argues that FINSTs' detachment from the descriptions of the objects they reference overcomes this problem.
Three main types of experiments provide data that support visual indexing theory. Multiple tracking studies demonstrate that more than one object can be tracked within the visual field simultaneously, subitizing studies suggest the existence of a mechanism that allows small numbers of objects to be efficiently enumerated, and subset selection studies show that certain elements of a visual scene can be processed independently of other items. In all three cases, FINSTs provide an explanation of the phenomenon observed.[8] [2]
Multiple object tracking describes the ability of human subjects to simultaneously track the movement of up to five target objects as they move across the visual field, usually in the presence of identical moving distractor objects of equal or greater number. The phenomenon was first demonstrated by Pylyshyn and Storm in 1988,[9] and their results have been widely replicated (see Pylyshyn, 2007 for a summary.[10])
Experimental setup
In a typical experiment, a number of identical objects (up to 10) are initially shown on a screen. Some subset of these objects (up to five) are then designated as targets — usually by flashing or changing colour momentarily — before returning to being indistinguishable from the non-target objects. All of the objects then proceed to move randomly around the screen for between 7 and 15 seconds. The subject's task it to identify, once the objects have stopped moving, which objects were the targets. Successful completion of the task thus requires subjects to continually track each of the target objects as they move, and ignore the distractors.
Results
Under such experimental conditions, it has been repeatedly found that subjects can track multiple moving objects simultaneously.[8] In addition to consistently observing a high rate of successful target tracking, researchers have shown that subjects can:
Two defining properties of FINSTs are their plurality, and their capacity to track indexed objects as they move around a visually cluttered scene. "Thus multiple-item tracking studies provide strong support for one of the more counterintuitive predictions of FINST theory — namely, that the identity of items can be maintained by the visual system even when the items are visually indiscriminable from their neighbors and when their locations are constantly changing."[8]
See also: Numerical cognition. Subitizing refers to the rapid and accurate enumeration of small numbers of items. Numerous studies (dating back to Jevons in 1871)[19] have demonstrated that subjects can very quickly and accurately report the quantity of objects randomly presented on a display, when they number fewer than around five. While larger quantities require subjects to count or estimate — at great expense of time and accuracy — it seems that a different enumeration method is employed in these low-quantity cases. In 1949, Kaufman, Lord, Reese and Volkmann coined the term 'subitizing' to describe the phenomenon.[20]
In 2023 a study of single neuron recordings in the medial temporal lobe of neurosurgical patients judging numbers reported evidence of two separate neural mechanisms with a boundary in neuronal coding around number 4 that correlates with the behavioural transition from subitizing to estimation, supporting the old observation of Jevons.[21] [22]
Experimental setup
In a typical experiment, subjects are briefly shown (for around 100ms) a screen containing a number of randomly arranged objects. The subjects' task is to report the number of items shown, which can range between one and several hundred per trial.
Results
When the number of items to be enumerated is within the subitizing range, each additional item on the display adds around 40–120ms to the total response time. Beyond the subitizing range, each additional item adds 250–350ms to the total response time (so that when the number of items presented is plotted against reaction time, an 'elbow' shaped curve results.) Researchers generally take this to be evidence of there being (at least) two different enumeration methods at work — one for small numbers, and another for larger numbers.[23]
Trick and Pylyshyn (1993) argue that "subitizing can be explained only by virtue of a limited-capacity mechanism that operates after the spatially parallel processes of feature detection and grouping but before the serial processes of spatial attention."[23] In other words, by a mechanism such as a FINST.
A key assumption of visual indexing theory is that once an item entering the visual field has been indexed, that index provides the subject with rapid subsequent access to the object, which bypasses any higher level cognitive processes.[2] In order to test this hypothesis, Burkell and Pylyshyn (1997) designed a series of experiments to see whether subjects could effectively index a subset of items on a display, such that a search task could be undertaken with respect to only the selected items.[24]
Experimental setup
Burkell and Pylyshyn's experiments took advantage of a well-documented distinction between two types of visual search:
The experimental setup is similar to a typical conjunction search task: 15 items are presented on a screen, each of which has one of two colours, and one of two orientations. Three of these items are designated as the subset by late onset (appearing after the others). The subset contains the target item and two distractors.
The key independent variable in this experiment is the nature of the subset selected. In some cases, the subset comprises a feature search set — i.e. the target differs from the two distractors in one dimension only. In other cases, the subset is equivalent to a conjunction search, with the target differing from the distractors in both dimensions. Because the total display contains items that differ from the target in both dimensions, if subjects are quicker to respond to the feature search subsets, this would suggest they had taken advantage of the "pop out" method of target identification. This in turn would mean that they had applied their visual search to the subsetted items only.
Results
Burkell and Pylyshyn found that subjects were indeed quicker to identify the target object in the subset feature search condition than they were in the subset conjunction search condition, suggesting that the subsetted objects were successfully prioritised. In other words, the subsets "could, in a number of important ways, be accessed by the visual system as though they were the only items present".[8] Furthermore, the subsetted objects' particular positions within the display made no difference to subjects' ability to search across them — even when they were distally located.[24] Watson and Humphreys (1997) reported similar findings.[26] These results are consistent with the predictions of visual indexing theory: FINSTs provide a possible mechanism by which the subsets were prioritised.