Natural scene perception refers to the process by which an agent (such as a human being) visually takes in and interprets scenes that it typically encounters in natural modes of operation (e.g. busy streets, meadows, living rooms).[1] This process has been modeled in several different ways that are guided by different concepts.
One major dividing line between theories that explain natural scene perception is the role of attention. Some theories maintain the need for focused attention, while others claim that focused attention is not involved.
Focused attention played a partial role in early models of natural scene perception. Such models involved two stages of visual processing.[2] According to these models, the first stage is attention free and registers low level features such as brightness gradients, motion and orientation in a parallel manner. Meanwhile, the second stage requires focused attention. It registers high-level object descriptions, has limited capacity and operates serially. These models have been empirically informed by studies demonstrating change blindness, inattentional blindness and attentional blink. Such studies show that when one's visual focused attention is engaged by a task, significant changes in one's environment that are not directly pertinent to the task can escape awareness. It was generally thought that natural scene perception was similarly susceptible to change blindness, inattentional blindness and attentional blink, and that these psychological phenomena occurred because engaging in a task diverts attentional resources that would otherwise be used for natural scene perception.
The attention-free hypothesis soon emerged to challenge early models. The initial basis for the attention-free hypothesis was the finding that in visual search, basic visual features of objects immediately and automatically pop out to the person doing the visual search.[3] Further experiments seemed to support this: Potter (as cited by Evans & Treisman, 2005) showed that high-order representations can be accessed rapidly from natural scenes presented at rates of up to 10 per second. Additionally, Thorpe, Fize & Marlot (as cited by Evans & Treisman) discovered that humans and primates can categorize natural images (i.e. of animals in everyday indoor and outdoor scenes) rapidly and accurately even after brief exposures. The basic idea in these studies is that exposure to each individual scene is too brief for attentional processes to occur, yet human beings are able to interpret and categorize these scenes.
Weaker versions of the attention-free hypothesis have also been targeted at specific components of the natural scene perception process instead of the process as a whole. Kihara & Takeda (2012) limit their claim to saying that it is the integration of spatial frequency-based information in natural scenes (a sub-process of natural scene perception) that is attention free.[4] This claim is based on a study of theirs which used attention-demanding tasks to examine participants' abilities to accurately categorize images that were filtered to have a wide range of spatial frequencies. The logic behind this experiment was that if integration of visual information across spatial frequencies (measured by the categorization task) is preattentive, then attention-demanding tasks should not affect performance in the categorization task. This was indeed found to be the case.
A recent study by Cohen, Alvarez & Nakayama (2011) calls into question the validity of evidence supporting the attention-free hypothesis. They found that participants did display inattentional blindness while doing certain kinds of multiple-object tracking (MOT) and rapid serial visual presentation (RSVP) tasks.[5] Furthermore, Cohen et al. found that participants' natural scene perception was impaired under dual-task conditions, but that this dual-task impairment happened only when participants' primary task was sufficiently demanding. The authors concluded that previous studies showing the absence of a need for focused attention did not use tasks that were demanding enough to fully engage attention.
In the Cohen et al. study, the MOT task involved viewing eight black moving discs presented against a changing background that consisted of randomly colored checkerboard masks. Four of these discs were picked out and participants were instructed to track these four discs. The RSVP task involved viewing a stream of letters and digits presented against a series of changing checkerboards, and counting the number of times a digit was presented. In both experiments, the critical trial involved a natural scene suddenly replacing the second last checkerboard, and participants were immediately afterwards asked whether they had noticed anything different, as well as presented with six questions to determine whether they had categorized the scene. The dual-task condition simply involved participants performing the MOT task mentioned above and a scene-classification task simultaneously. The authors varied the difficulty of the task (i.e. how demanding the task was) by increasing or decreasing the speed of the moving discs.
These are some of the models that have been proposed for the purpose of explaining natural scene perception.
Evans & Treisman (2005) proposed a hypothesis that humans rapidly detect disjunctive sets of unbound features of target categories in a parallel manner, and then use these features to discriminate between scenes that do or do not contain the target without necessarily fully identifying it. An example of such a feature would be outstretched wings that can be used to tell whether or not a bird is in a picture, even before the system has identified an object as a bird. Evans & Treisman propose that natural scene perception involves a first pass through the visual processing hierarchy up to the nodes in a visual identification network, and then optional revisiting of earlier levels for more detailed analysis. During the 'first pass' stage, the system forms a global representation of the natural scene that includes the layout of global boundaries and potential objects. During the 'revisiting' stage, focused attention is employed to select local objects of interest in a serial manner, and then bind their features to their representations.
This hypothesis is consistent with the results of their study in which participants were instructed to detect animal targets in RSVP sequences, and then report their identities and locations. While participants were able to detect the targets in most trials, they were often subsequently unable to identify or localize them. Furthermore, when two targets were presented in quick succession, participants displayed a significant attentional blink when required to identify the targets, but the attentional blink was mostly eliminated among participants only required to only detect them. Evans & Treisman explain these results by with the hypothesis that the attentional blink occurs because the identification stage requires attentional resources, while the detection stage does not.
Ultra-rapid visual categorization is a model proposing an automatic feedforward mechanism that forms high-level object representations in parallel without focused attention. In this model, the mechanism cannot be sped up by training. Evidence for a feedforward mechanism can be found in studies that have shown that many neurons are already highly selective at the beginning of a visual response, thus suggesting that feedback mechanisms are not required for response selectivity to increase.[6] Furthermore, recent fMRI and ERP studies have shown that masked visual stimuli that participants do not consciously perceive can significantly modulate activity in the motor system, thus suggesting somewhat sophisticated visual processing.[7] VanRullen (2006) ran simulations showing that the feedforward propagation of one wave of spikes through high-level neurons, generated in response to a stimulus, could be enough for crude recognition and categorization that occurs in 150 ms or less.[8]
Xu & Chun (2009) propose the neural-object file theory, which posits that the human visual system initially selects a fixed number of roughly four objects from a crowded scene based on their spatial information (object individuation) before encoding their details (object identification).[9] Under this framework, object individuation is generally controlled by the inferior intra-parietal sulcus (IPS), while object identification involves the superior IPS and higher-level visual areas. At the object individuation stage, object representations are coarse and contain minimal feature information. However, once these object representations (or object-files, to use the theory's language) have been 'set up' during the object individuation stage they can be elaborated on over time during the object identification stage, during which additional featural and identity information is received.
The neural-object file theory deals with the issue of attention by proposing two different processing systems. One of them tracks the overall hierarchical structure of the visual display and is attention-free, while the other processes current objects of attentional selection. The current hypothesis is that the parahippocampal place area (PPA) plays a role in shifting visual attention to different parts of a scene and incorporating information from multiple frames in order to form an integrated representation of the scene.
The separation between object individuation and identification in the neural object-file theory is supported by evidence such as that from Xu's & Chun's fMRI study (as cited in Xu & Chun, 2009). In this study, they examined posterior brain mechanisms that supported visual short-term memory (VSTM). The fMRI showed that representations in the inferior IPS were fixed to roughly four objects regardless of object complexity, but representations in the superior IPS and lateral occipital complex (LOC) varied according to complexity.[10]
See main article: article and Natural scene statistics.