2 Theoretical Background
2.1 Mind, Recreated: Simulations in Imagery, Memory, and Language
2.1.1 Mental imagery as a simulation
Mental imagery is the ability to construct mental representations in the absence of external sensory stimulation. Thus, it is a quasi-phenomenal experience (N. J. T. Thomas, 1999, 2018a). That is, it resembles the actual perceptual experience but occurs when the appropriate external stimulus is not there. The ability to see with the “mind’s eye” without any sensory stimulation is a remarkable feature of the human mind. Mental imagery underlies our ability to think, plan, re-analyse past events or even fantasise events that may never happen (Pearson & Kosslyn, 2013). Accordingly, mental images involve, alter or even replace the core operations of human cognition such as memory (Albers, Kok, Toni, Dijkerman, & De Lange, 2013; Rebecca Keogh & Pearson, 2011; Tong, 2013), problem-solving (Kozhevnikov, Motes, & Hegarty, 2007) decision-making (Tuan Pham, Meyvis, & Zhou, 2001), counter-factual thinking (Kulakova, Aichhorn, Schurz, Kronbichler, & Perner, 2013), reasoning (Hegarty, 2004; Knauff, Fangmeier, Ruff, & Johnson-Laird, 2003), numerical cognition (Dehaene, Bossini, & Giraux, 1993) and creativity (LeBoutillier & Marks, 2003; Palmiero et al., 2016). The human mind can mentally “visualise” not only visual but also nonvisual perceptions (Lacey & Lawson, 2013) such as auditory mental imagery (e.g., imagining the voice of a friend or a song) (Lima et al., 2015) or motor mental imagery (e.g., mentally rehearsing a movement before actualising it) (Hanakawa, 2016). Mental images can arise from nonvisual modalities (particularly auditory or haptic) in congenitally blind individuals (Cattaneo et al., 2008). However, the literature of mental imagery is largely dedicated to metal imagery that is specifically visual (Tye, 1991). How do we imagine? By extension, what does a mental image look like? The format of mental images has been extensively discussed in the 70s and 80s with two camps: pictorial (depictivism) and propositional (descriptivism) imagery. The pictorial position (Kosslyn, 1973) holds that mental images are like pictures and there are spatial relations between the imagined objects. On the other hand, the propositional view (Pylyshyn, 1973) is that mental images are more like linguistic descriptions of visual scenes based on tacit knowledge about the world (i.e., implicit knowledge that is difficult to express explicitly such as the ability to ride a bike). Mental imagery under the treatment of descriptivism is more of an amodal, formal system. Whereas, depictivism offers a picture of mental imagery that appears more compatible with the mechanics of grounded-embodied cognition. However, neither of these approaches captures the true essence of grounded-embodied cognition because both of them depend on an information processing approach. In both cases, perceptual data flows inward to a passive cognitive agent (N. J. T. Thomas, 1999). On the other hand, grounded-embodied theories of cognition conceive that mental imagery is based on active perceptions and actions. Mental images are considered as mental representations reactivated through previous perceptions (Ballard et al., 1997). Consequently, mental imagery is a simulation itself (Barsalou, 1999). As a matter of fact, mental imagery is assumed to be the most typical example of the simulation mechanism in that there are certain similarities between the properties of a sensorimotor simulation and mental imagery (Markman, Klein, & Suhr, 2008): First, mental images arise from perceptual representations. They are formed in the absence of the original perceptual stimulation. And lastly, a mental image is not an exact copy of the percept but rather, a partial recreation (Kosslyn, 1980). Within this view, the primary function of mental imagery is to simulate reality “at will” in order to access previous knowledge and predict the future (i.e., mental emulation) (Moulton & Kosslyn, 2009). In order to verify that mental imagery is sensorimotor simulation, evidence showing similarities between perception and imagery is needed. This is indeed what the literature on mental imagery within the framework of grounded-embodied cognition indicates. For instance, an overwhelming body of neuroimaging evidence shows that similar brain regions are activated during perception and imagery stages (Cichy, Heinzle, & Haynes, 2012; Ganis, Thompson, & Kosslyn, 2004; Ishai & Sagi, 1995; Kosslyn, Thompson, & Alpert, 1997; O’Craven & Kanwisher, 2000). Behavioural studies further reveal the nature of the link between perception and imagery. As early as 1910, the psychologist Cheves Perky’s experiments showed that visual mental images can supress perceiving real visual targets unconsciously (i.e., the Perky effect) (Craver-Lemley & Reeves, 1992; Perky, 1910). In the original experiment, participants were asked to fixate a point on a white screen and visually imagine certain objects there such as a tomato, a book or a pencil etc. After a few trials, a real but a faint image (i.e., in soft focus) of the concerned object was projected onto the screen. Participants failed to distinguish between their imagined projections and the real percepts. Shortly, real images intermingled with the mental images. For instance, some participants reported their surprise when they “imagined” an upright banana rather than a horizontally oriented one they were attempting to imagine (N. J. T. Thomas, 2018b). The Perky effect indicates that mental imagery and visual perception draw on the same sources (see also Finke, 1980). In a similar study (Lloyd-Jones & Vernon, 2003), participants saw a word (e.g., “dog”) accompanied by a line drawing of that object in the perception phase. In the imagery phase, participants made spatial judgements about the previously shown picture. Simultaneously, a picture distractor appeared on the screen during mental imagery. The picture distractor was either unrelated to the mental image of the previously shown object (e.g., dog - strawberry) or conceptually related (e.g., dog - cat). Response times in the judgement task were longer when participants generated a mental picture along with the perception of a conceptually related picture but not a conceptually unrelated picture. These findings suggest that imagery and visual perception share the same semantic representations. Mental images are also processed in similar ways as the actual images. In Borst and Kosslyn (2008), participants scanned a pattern of dots and then, an arrow was shown on the screen. Participants then decided whether the arrow pointed at a location that had been previously occupied by one of the dots. Results showed that the time to scan during imagery increased linearly as the distance between the arrow and the dots increased in perception. Further, participants who were better at scanning distances perceptually were also better at scanning distances across a mental image, suggesting the functional role of perception in mental imagery. Finally, eye movement studies have given considerable support to the simulation account of mental imagery with two key findings (see Laeng, Bloem, D’Ascenzo, & Tommasi, 2014 for a review): First, eye movements during perception are similar to those during imagery (Brandt & Stark, 1997; Johansson, Holsanova, & Holmqvist, 2006). Second, the amount of overlap between eye movements during perception and imagery predicts the performance in imagery-related tasks (Laeng & Teodorescu, 2002). Eye movements in mental imagery are further elaborated in Chapter 2.2.3.
2.1.2 Memory retrieval as a simulation
A simulation account of memory views memory retrieval as a partial recreation of the past that often includes sensorimotor and contextual details of the original episode (see Buckner & Wheeler, 2001; Christophel, Klink, Spitzer, Roelfsema, & Haynes, 2017; Danker & Anderson, 2010; Kent & Lamberts, 2008; Pasternak & Greenlee, 2005; Rugg, Johnson, Park, & Uncapher, 2008; Xue, 2018 for comprehensive reviews and see De Brigard, 2014; Mahr & Csibra, 2018; Marr, 1971 for theoretical discussions). Hence, memory retrieval can be thought as a simulation of encoding in a similar way to mental imagery being a simulation of perception. Indeed, it is known that mental imagery and memory operate on similar machinery as long as their perceptual modalities match (e.g., visual mental imagery - visual memory). In the original model of working memory, Baddeley and Hitch (1974) assumed that one function of the visuospatial sketchpad (i.e., the component of working memory responsible for the manipulation of visual information) is manipulating visual mental images. In support of this assumption, Baddeley and Andrade (2000) showed that visual and auditory mental imagery tasks disrupted visual and auditory components of working memory respectively; that is, visuospatial sketchpad and phonological loop (i.e., the component of working memory responsible for the manipulation of auditory information). Keogh and Pearson (2014) evidenced that individuals with stronger visual mental imagery also have greater visual working memory capacity but not verbal memory capacity (see also Keogh & Pearson, 2011). Grounded-embodied cognition takes the link between mental imagery and memory one step forward: Memory not only involves mental imagery, but memory is mental imagery itself. In accordance, encoding corresponds to perception and retrieval corresponds to imagery. In this respect, Albers et al. (2013) presented strong evidence that working memory and mental imagery share representations in the early visual cortex (V1 - V3). Further, as Buckner and Wheeler (2001) noted “assessments of visual mental imagery ability in patients with damage to visual cortex support the possibility that brain regions involved in perception are also used during imagery and remembering” (De Renzi & Spinnler, 1967; D. N. Levine, Warach, & Farah, 1985). Mental time travel is a striking example of the role of imagery in memory (Corballis, 2009; Schacter, Addis, & Buckner, 2007; Suddendorf & Corballis, 2007; Szpunar, 2010). Mental time travel is a cognitive ability of episodic memory (i.e., conscious and explicit recollection of past events) and episodic future thinking through mental imagery. Thus, a mental time traveller can mentally project herself backwards in time to re-live (i.e., reconstruct) the past events and pre-live (i.e., predict) the possible future events (Suddendorf & Corballis, 2007). In this respect, mental time travel can be considered as an intertemporal simulation (Shanton & Goldman, 2010). Growing evidence has shown that episodic memory and simulation of future by mental imagery share a core neural network (i.e., default network) (see Schacter et al., 2012 for a review), suggesting that memory, mental imagery and thinking about future rest on the similar neural mechanisms. As in mental imagery, a simulation approach to memory underlines the correspondence between encoding and retrieval (Kent & Lamberts, 2008). Mounting evidence illustrates that common neural systems are activated both in encoding and retrieval (Nyberg, Habib, McIntosh, & Tulving, 2000; Wheeler, Petersen, & Buckner, 2000). Crucially, the similarity between neural patterns during encoding and retrieval is often predictive of how well an experience is remembered subsequently (see Brewer, Zhao, Glover, & Gabrieli, 1998; Wagner et al., 1998 for reviews). There is much evidence indicating that reinstated neural activations are specific to perceptual modality (visual vs. auditory), domain (memory for what - where) and feature (colour, motion or spatial location) (see Slotnick, 2004 for a review). For example, Wheeler, Petersen and Buckner (2000) gave participants a set of picture and sound items to study and then a recall test during which participants vividly remembered these items. Results demonstrated that regions of auditory and visual cortex are activated differently during retrieval of sounds and pictures. In a similar fashion, Goldberg, Perfetti and Schneider (2006b) asked participants whether a concrete word possesses a property from one of four sensory modalities as colour (e.g. green), sound (e.g., loud), touch (e.g., soft) or taste (e.g., sweet). Retrieval from semantic memory involving flavour knowledge as in the word “sweet” increased specific activation in the left orbitofrontal cortex which is known to process semantic comparisons among edible items (Goldberg, Perfetti, & Schneider, 2006a). A number of studies supported a simulation account of memory with retrieval dependent on perceptions by showing temporal overlaps between encoding and retrieval (Kent & Lamberts, 2008). There is not a strict temporal regularity between retrieval and encoding as far as the ERP evidence shows (Allan, Robb, & Rugg, 2000). However, better memory performance was found in serial recall when retrieval direction (forward vs. backward) matched with the order in which the words were encoded in the first place (J. G. Thomas, Milner, & Hanerlandt, 2003). More direct evidence for temporal similarity between encoding and retrieval comes from Kent and Lamberts (2006). Participants were instructed to retrieve different dimensions of faces such as eye colour, nose shape, mouth expressions etc. Results revealed that features that were quickly perceived were also quickly retrieved. In addition to the findings from the abovementioned research areas, the historical phenomena of state-dependent memory and context-dependent memory show that memory retrieval is simulation of the original event. An overlap between the internal state (e.g., mood, state of consciousness) or external context of the individual during encoding and retrieval leads to higher retrieval efficiency (S. M. Smith & Vela, 2001; Ucros, 1989). In one such study, Dijkstra, Kaschak and Zwaan (2007) documented faster retrieval when body positions and actions during retrieval of autobiographical events were similar to the body positions and actions in the original events compared to when body positions and actions were non-congruent. For example, participants were faster to remember how old they were at a concert, if they were instructed to sit up straight in the chair and clap their hands several times during the retrieval. In another intriguing study (Casasanto & Dijkstra, 2010), participants were instructed to tell their autobiographical memories with either positive or negative valence, while moving marbles either upward or downward, which was an apparently meaningless action. However, retrieval was faster when the direction of movement was congruent with the valence of the emotional memory in a metaphorical way (i.e., upward for positive and downward for negative memories). Lastly, eye movements provide plentiful evidence that retrieval is perceptual recreation of encoding (D. C. Richardson & Spivey, 2000; Spivey & Geng, 2000) and further, these simulations usually predict the success of the retrieval (Johansson & Johansson, 2014; Scholz et al., 2018, 2016). Eye movements in memory simulations are further elaborated in Chapter 2.2.3.
2.1.3 Simulations in language
Language is one of the most influential domains in showing the centrality of simulations in human cognition. The claim of simulation view of language is simple: “Meaning centrally involves the activation of perceptual, motor, social, and affective knowledge that characterizes the content of utterances” (Bergen, 2007, pp. 277-278). Thus, a simulation mechanism is essential to comprehend and remember language. Switch-cost effects are a clear demonstration of perceptual and affective (re)activation in language. In this paradigm, participants are asked to verify whether a property (e.g., “blender”) corresponds to a particular target modality (e.g., “loud” in the auditory modality). The effect is that participants are slower to verify a property in one perceptual modality (e.g., “blender” can be loud - auditory modality) after verifying a property in a different modality (e.g., “cranberries” can be tart - gustatory modality) than after verifying a property in the same modality (e.g., “leaves” can rustle - auditory modality) (Pecher, Zeelenberg, & Barsalou, 2003). A switch-cost occurs between properties with positive and negative valence (e.g., “couple” can be happy, and “orphan” can be hopeless) (Vermeulen, Niedenthal, & Luminet, 2007) and at the sentence level (e.g., “A cellar is dark” in visual modality - “A mitten is soft” in tactile modality) (Hald, Marshall, Janssen, & Garnham, 2011). Similar switching costs occur when participants switch between actual modalities in perceptual tasks (Masson, 2015). Thus, findings reviewed above support the claim that language is rooted in perceptions and language comprehension can activate these perceptions. Importantly, the same priming effect was not elicited when participants verified semantically associated properties (e.g., “sheet” can be spotless, and “air” can be clean) as opposed to unassociated properties (e.g., “sheet” can be spotless, and “meal” can be cheap) (Pecher et al., 2003). This finding rules out the alternative, computational hypothesis that properties across all modalities are stored together in a single, amodal system of knowledge. Rather, they support perceptual roots of language processing and language-based simulations.
2.1.3.1 Mental simulations and situation models
Simulations triggered with language are slightly different than the sensorimotor simulations that have been covered so far. Sensorimotor simulations in mental imagery and memory rely on actual sensorimotor experiences (e.g., playing a piano or perceptually encoding an episode). They take place in an offline manner, that is, when the agent “needs” to access perceptual/conceptual information in the absence of original stimulus. Whereas, language-based simulations are activated upon perceiving linguistic stimuli in an online manner. The subject (re)creates perceptual, motor, affective, introspective and bodily states not by actually experiencing them but through linguistic descriptions. Further, language can give rise to simulations of several abstract conceptualisations that go beyond these states. This type of simulation is usually referred to as a mental simulation (Zwaan, 1999). Mental simulations can extend into and affect subsequent perceptual/conceptual processing and memory retrieval (discussed below). It is reasonable to assume that online mental simulation evoked by language and offline simulation in memory and mental imagery share some common architecture. After all, both types of simulations originate from perceptual, motor, affective, introspective and bodily states. That said, the substantial difference between offline and online simulation is conscious effort. Mental simulations based on language are assumed to be inherently involved in language comprehension and thus, triggered automatically and unconsciously (Zwaan & Pecher, 2012). Whereas, offline sensorimotor simulation in memory and mental imagery is often a consequence of effortful, resource-consuming and conscious processes as memory and mental imagery themselves. In line, there is little to no evidence that mental simulation is correlated with the strength of mental imagery (Zwaan & Pecher, 2012). The idea of mental simulation via language stems from the discovery of mirror neurons (Caggiano et al., 1996; Gallese et al., 1996). Mirror neurons are activated in motor regions of the brain by merely observing others executing motor actions (Hari et al., 1998). In a similar fashion, neural correlates were found between the content of what is being read and activated areas in the brain (see Hauk & Tschentscher, 2013; Binkofski, 2010; Pulvermüller, 2005 for exhaustive reviews and Jirak, Menz, Buccino, Borghi for a meta-analysis). In a pioneering study (Hauk, Johnsrude, & Pulvermüller, 2004), participants saw action words referring to face, arm and leg (e.g., lick, pick and kick) in a passive reading task and then, moved their corresponding extremities (i.e., left or right foot, left or right index finger, or tongue). Results showed that reading action verbs activates somatotopic brain regions (i.e., regions corresponding to specific parts of the body) that are involved in the actual movements (see also Buccino et al., 2005). For example, reading the word “kick” or “pick” invokes activation in the specific regions of motor and premotor cortex that control the execution of leg and arm movements respectively. Critically, several fMRI (functional magnetic resonance imaging) studies showed that not only concrete words but also idiomatic expressions involving action words (e.g., “John grasped the idea or Pablo kicked the habit) (see Yang & Shu, 2016 for a review) and counterfactual statements (e.g., “if Mary had cleaned the room, she would have moved the sofa”) (de Vega et al., 2014) elicit similar somatotopic activation in brain. In addition to action words, words in different perceptual modalities activate brain regions associated with the concerned modalities as well. For example, reading odour-related words such as “cinnamon”, “garlic” or “jasmine” triggers activations in primary olfactory cortex, the brain region involved in the sensation of smells (González et al., 2006). Language-based simulations go beyond recreation of perceptual and motor experiences. It is well-documented that reading narratives can form situation models (mental models) in the minds of the readers (e.g., Speer, Reynolds, Swallow, & Zacks, 2009). Situation models are integrated, situational mental representations of characters, objects and events that are described in narrative (Johnson‐Laird, 1983; Kintsch & van Dijk, 1978). They allow readers to imagine themselves in the story by taking the perspective of the protagonist (e.g., Avraamides, 2003). Consequently, situation models give rise to simulations of perceptual, motor and affective states and also abstract structures such as time, speed, space, goals and causations (Speed & Vigliocco, 2016; Zwaan, 1999; Zwaan & Radvansky, 1998). For instance, Zwaan, Stanfield and Yaxley (2002) evidenced that language comprehenders simulate what the objects described by language look like. In their study, participants read sentences describing an animal or an object in a certain location (e.g., egg in a carton vs. egg in a pan). Thus, the shape of the objects changed as a function of their location, but it is only implied by sentences (e.g., “The egg is in the carton.” - whole egg). Even though, a line drawing of the object matching with the shape implied in the previous sentence (e.g., a drawing of a whole egg) improved participants’ performance in retrieval of the sentences. Similar results were demonstrated for sentences that imply orientation (e.g., vertical - horizontal) (D. C. Richardson, Spivey, Barsalou, & McRae, 2003), rotation (Wassenburg & Zwaan, 2010), size (de Koning, Wassenburg, Bos, & Van der Schoot, 2017), colour (Zwaan & Pecher, 2012), visibility (Yaxley & Zwaan, 2007), distance (Vukovic & Williams, 2014) and number (Patson, George, & Warren, 2014). Language can activate simulations of more abstract structures in the same manner. Simulation of time, in particular, is well-documented. For instance, longer chronological distance between two consecutively narrated story events denoted with “an hour later” as compared to “a moment later” leads to longer reading times (Zwaan, 1996). Reading times measured with eye movements were also shown to be longer when reading “slow” verbs (e.g., amble) than “fast” verbs (e.g., dash) (Speed & Vigliocco, 2014). Similarly, Coll-Florit and Gennari (2011) found that judging the sensicality of sentences describing durative states (e.g. “to admire a famous writer”) took longer than non-durative states (e.g. “to run into a famous writer). Several other abstractions can be mentally simulated in the reader’s mind. In one experiment, participants can access the concept of “cake” more easily when they previously read a sentence in which a cake is actually present (“Mary baked cookies and cake”) than when it is not (“Mary baked cookies but no cake”) (MacDonald & Just, 1989). In another experiment, participants simulated the protagonist’s thoughts and they remembered and forgot what the character in the story remembered and forgot (Gunraj, Upadhyay, Houghton, Westerman, & Klin, 2017). In Scherer, Banse, Wallbott and Goldbeck (1991), participants simulated the intended emotions that were cued in characters’ voices. Mental simulations via language, and situation models play important roles in numerous cognitive tasks transcending language comprehension. Most importantly, simulations are involved in memory for language. Johansson, Oren and Holmqvist (2018) reported that eye movements on a blank screen when participants were remembering a narrative reflected the layout of the scenes described in the text rather than the layout of the text itself. Zwaan and Radvansky (1998) assumed that successful retrieval of what is comprehended would necessarily involve the retrieval of simulations. In accordance with this assumption, there is evidence that the ability to restructure situation models have beneficial effects on memory performance (Garnham, 1981; Magliano, Radvansky, & Copeland, 2012).
2.1.3.2 Simulation of space with language
Space has a privileged status in human cognition. Coslett (1999) argues that the representation of space in the mind has a fundamental evolutionary advantage because information about the location of objects in the environment is essential for sustenance and avoiding danger. A large body of evidence indicates that young children show sensitivity to spatial concepts and properties starting from the infancy (e.g., Aguiar & Baillargeon, 2002; Casasola, 2008; Frick & Möhring, 2013; Hespos & Rochat, 1997; McKenzie, Slater, Tremellen, & McAlpin, 1993; Örnkloo & Von Hofsten, 2007; Wishart & Bower, 1982). There is also evidence suggesting that development of spatial cognition forms the foundation for subsequent cognitive structures such as mathematical aptitude (Lauer & Lourenco, 2016), creativity (Kell, Lubinski, Benbow, & Steiger, 2013) and notably, language (Levinson, 1992; Piaget & Inhelder, 1969). As a result, there is good reason to assume that language and space are inherently interconnected through the course of cognitive development (e.g., Casasola, 2005; Haun, Rapold, Janzen, & Levinson, 2011; Hespos & Spelke, 2004). People use language when describing space and spatial language schematises space by selecting certain aspects of a scene while ignoring other aspects (Talmy, 1983). For instance, “across” conveys the information that the thing doing the crossing is smaller than the thing that is being crossed (Tversky & Lee, 1998). However, it does not contain any information about the distance between these things or their shapes. Thereby, language forms spatial representations in mind (H. A. Taylor & Tversky, 1992). On the other hand, space provides a rich canvas for representing abstraction. Many abstract conceptualisations such as time (Boroditsky & Ramscar, 2002), valence (Meyer & Robinson, 2004), power (Zanolie et al., 2012), numerical magnitude (Dehaene et al., 1993), happiness (Damjanovic & Santiago, 2016), divinity (Chasteen, Burdzy, & Pratt, 2010), health (Leitan, Williams, & Murray, 2015) and self-esteem (J. E. T. Taylor, Lam, Chasteen, & Pratt, 2015) are understood with space (e.g., “powerful is up”, “more is up”, “happy is up” etc.). Further, space constraints the use of language with gestures and in sign language (Emmorey, 2001; Emmorey, Tversky, & Taylor, 2000). In support of this, both brain imaging (Carpenter, Just, Keller, Eddy, & Thulborn, 1999) and behavioural (Hayward & Tarr, 1995) evidence indicate that there are similarities between spatial and linguistic representations. Given the central position of space in human mind as briefly discussed above and the intrinsic links between language and space, spatial simulations in language deserve particular attention. Reading narratives can activate simulations of spatial descriptions in a text through situation models. For instance, objects that are described close to a protagonist in a narrative are accessed faster than the objects described as more distant (Glenberg, Meyer, & Lindem, 1987; Morrow, Greenspan, & Bower, 1987). In seminal work, Franklin and Tversky (1990) showed that situation models of space derived from text are similar to the representations of spatial experiences in the real-world and notably, have bodily constraints. Participants in the study read descriptions of scenes and objects in them. Then, they were asked to remember and locate certain objects in a three-dimensional environment. Results showed that objects on the vertical (i.e., head-feet) axis were retrieved faster than objects on the horizontal (i.e., left-right) and sagittal (i.e., front-back) axes. The findings indicate that space in language is simulated with an ego-centric perspective rather than an allocentric (i.e., object-centred) or a mental transformation perspective. If the participants took an allocentric perspective as in inspecting a picture (in which the subject is not immersed into the environment), all directions would have been equally accessible. On the other hand, if they mentally transformed the described environments, response times would have varied as a function of the mental movement needed to inspect each location. Accordingly, response times would have been shortest for the objects in front of the subject and the accessibility would have decreased in line with the angular disparity from the front. Objects behind the subject, for example, would have been the most difficult to access. Bias for the objects on the vertical dimension suggests that simulation of space with language is body-based. As Franklin and Tversky (1990, p. 64) discuss, the dominant position of a person interacting with the environment is upright due to a number of reasons: First, the perceptual world of the observer can be described by one vertical and two horizontal dimensions (i.e., left/right and front/back). Second, vertical dimension is correlated with gravity, which in an important asymmetric factor in perceiving spatial relations. Thus, vertical spatial relations generally remain constant with respect to the observer. Third, the ground and the sky present stationary reference points on the vertical axis. On the other hand, horizontal spatial relations change frequently. Thus, horizontal dimension depends on more arbitrary reference points, such as the prominent dimensions of the observer’s own body. In another experiment using a similar methodology (Avraamides, 2003), it was demonstrated that simulated ego-centric positions are not static but can be automatically updated whenever the reader/protagonist moves in the text, suggesting the motor basis of language. In a recognition memory task, Levine and Klin (2001) showed that a story character’s current location was more active in the reader’s memory than her/his previous location (see Gunraj et al., 2017). Further, such spatial simulations remained highly accessible even several sentences after last mention, indicating the robustness of these spatial simulations. There are stable representational mappings between language and space at the sentence level as well. Richardson, Spivey, Edelman and Naples (2001) asked participants to read sentences involving concrete and abstract action verbs (e.g., lifted, offended). They were then asked to associate diagrams illustrating motions on the horizontal (left and right) and the vertical axis (up and down) with the sentences depicting motion events. Substantial agreement was found between participants in their preferences of diagrams for both concrete and abstract verbs within action sentences. For example, participants tended to attach a horizontal image schema to “push”, and a vertical image schema to “respect”. In a later study, it was evidenced that spatial simulation triggered by a verb affects other forms of spatial processing along the same axis both in a visual discrimination and a picture memory task (D. C. Richardson et al., 2003). Spatial simulations interfered with visual discrimination on the congruent axis and deteriorated performance; however, memory performance was facilitated when the picture to be remembered and the simulated orientation matched (see “Effects of mental simulation” below). The effect was shown for both concrete and abstract verbs. Not only orientation, but upward and downward motion on the vertical axis are simulated via language. In one study (Bergen, Lindsay, Matlock, & Narayanan, 2007), subject nouns and main verbs related with up and down locations interfered with visual processing in the same location. However, the effect was shown in literal sentences implying real space (e.g., “The ceiling cracked” – downward movement for the subject noun, “The mule climbed” – upward movements for the main verb) but not in sentences implying metaphorical space (e.g., “The prices rose”). Bergen et al. (2007) argue that the comprehension of the sentence as a whole, and not simply lexical associations, yield spatial simulations. However, there is evidence that single words can also trigger simulation of space. Several abstract nouns such as “tyrant” (up) and “slave” (down) invoke simulations of metaphorical spatial locations (e.g., Giessner & Schubert, 2007). There are numerous common nouns in language such as “bird” (up) and “worm” (down) which are associated with actual spatial locations (i.e., spatial iconicity). Words denoting spatial locations simulate perceptions of these locations in space. In Zwaan and Yaxley (2003), participants were presented word pairs with spatial associations (e.g., “attic” - “basement”) and asked to decide whether the words are semantically related. Results showed that word pairs in a reverse-iconic condition (i.e., “basement” above “attic”) were judged slower than word pairs in an iconic condition (i.e., “attic” above “basement”). In a similar fashion, it was shown that reading words that occur higher or lower positions in the visual field (e.g., head and foot) hinders the identification of visual targets at the top or bottom of the display (Estes, Verges, & Barsalou, 2008).
2.1.3.3 Effects of mental simulations
Simulation-based language understanding leads to two main effects on simultaneous or subsequent visual/conceptual processing: compatibility and interference (see Fischer & Zwaan, 2008 for a review). The underlying idea is that if understanding an utterance involves the activation of perceptual, affective and motor representations; then perceptions, emotions and actions that are congruent with the content of the utterances should facilitate visual/conceptual processing and vice versa (Bergen, 2007). For example, the action-sentence compatibility effect demonstrates compatibility/interference resulting from motor simulations in language. In the study introducing the effect for the first time (Glenberg & Kaschak, 2002), participants were presented sensible and non-sensible sentences (e.g., “Boil the air”) and were asked to judge whether the sentences made sense or not. Sensible sentences implied actions either toward the body (e.g., “Open the drawer”) or away from the body (e.g., “Close the drawer”). Response button for identifying the sentence as sensible (i.e., yes button) was either near or far from the participants’ bodies. Results showed that when the implied direction of the sentence and the actual action to press the button matched, participants were faster to judge the sensibility of the sentences. For example, the sentence, “Open the drawer” was processed faster when participants reached the yes button near them, an action that is comparable to opening a drawer. The effect was found not only for imperatives but also for descriptive sentences (“Andy delivered the pizza to you” - toward sentence / “You delivered the pizza to Andy” - away sentence). Notably, sentences describing abstract transfers (“Liz told you the story” - toward sentence / “You told Liz the story” - away sentence) elicited an action-sentence compatibility effect as well. An action-sentence compatibility effect extends to sign language, suggesting that the motor system is involved in the comprehension of a visual-manual language as well (Secora & Emmorey, 2015). Notably, the congruency effect was found relative to the verb’s semantics (e.g., “You throw a ball” - away) not relative to the actual motion executed by the signer and perceived by the participant (e.g., “You throw a ball” - toward). Along with that, there are meta-reviews and experimental evidence arguing that an action-sentence compatibility effect is generally weak (Papesh, 2015; but see Zwaan, van der Stoep, Guadalupe, & Bouwmeester, 2012) or highly task-dependent (Borreggine & Kaschak, 2006). In sum, the current status of the literature suggests that the factors modulating an action-sentence compatibility effect and in general, effects of language-based simulations should be further specified. Simulations can also interfere with language comprehension which results in a “mismatch advantage”. For example, Kaschak et al. (2005) demonstrated that participants judge the feasibility of motion sentences (e.g., “The horse ran away from you”) faster when they simultaneously view visual displays depicting motion in the opposite direction as the action described in the sentence (e.g., a spiral moving towards the centre). They concluded that visual processing and action simulation during language comprehension engage the same neural circuits; which, in turn leads to a mismatch advantage. Connell (2007) evidenced a mismatch advantage in the simulation of colour with language. Participants read sentences involving an object which can occur in different colours (e.g., meat can be red when raw and brown when cooked). They were then presented pictures of objects and they had to decide whether the pictured object had appeared in the preceding sentence. Colour of the objects sometimes matched with the descriptions in the sentences (e.g., “John looked at a steak in the butcher’s window” - red steak) and sometimes did not match (e.g., “John looked at a steak in the butcher’s window” - brown steak). Responses were faster when the colour of the object mismatched with the colour implied by the previous sentence. Why do some studies show a congruency advantage and others an incongruency advantage? This is an important question within the context of the present thesis (see Chapter 6). Kaschak et al. (2005) argue that there are two factors determining match or mismatch advantage in language-based simulations: (1) Temporal distance between the perceptual stimulus and the verbal stimulus to be processed. (2) The extent to which the perceptual stimulus can be integrated into the simulation activated by the content of the sentence. In support of the temporal distance assumption, Borreggine and Kaschak (2006) found that action-sentence compatibility effect arises only when individuals have enough time to plan their motor response as they process the sentence. According to Kaschak et al. (2005), if the verbal information must be processed simultaneously with the perceptual information, a congruency or incongruency advantage may occur, depending on whether linguistic information and perceptual stimulus can be integrated. To be more specific, a congruency advantage is expected if the linguistic and visual stimulus are comparable such as reading the sentence “The egg is in the carton” and seeing a line drawing of a whole egg (Zwaan et al., 2002). However, different perceptual and linguistic stimulus such as reading the sentence “The horse ran away from you” and seeing a spiral moving towards the centre or away from it (Kaschak et al., 2005) result in an incongruency advantage (see also Meteyard, Zokaei, Bahrami, & Vigliocco, 2008)
2.2 I Look, Therefore I Remember: Eye Movements and Memory
2.2.1 Eye movements and eye tracking
Eyes do not flow in a smooth fashion when engaged in visual tasks (Huey, 1908). If you were able to see your gaze on the page or on the digital screen right now, you would notice that your eyes shift from one word to the next as you are reading this sentence. Known as saccades, these “jumps” are rapid, short and repeated ballistic (i.e. jerk-like) movements which occur approximately three to four times every second. Saccades abruptly change the point of fixations, the periods of eye immobility in which visual or semantic information is acquired and processed (Purves, Augustine, & Fitzpatrick, 2001; D. C. Richardson & Spivey, 2004). In simple terms, individuals internalise the visual world during fixations that are executed between saccades (Bridgeman, Van der Heijden, & Velichkovsky, 1994; Simons & Rensink, 2005). Eye movements are fundamental to visual perception because visual system cannot process the huge amount of available information in the visual world at once. Thus, execution of eye movements allows us to see the world as a seamless whole, although we can only see one region at a time (Buswell, 1936; Yarbus, 1967) due to anatomical limitations (i.e., the total visual field that the human eye covers) and also, limited processing resources (Levi, Klein, & Aitsebaomo, 1985; D. C. Richardson, Dale, & Spivey, 2007). Fixations have two elemental measures: location and duration. Both measures are highly informative of ongoing cognitive operations. We can see a stimulus clearly only when it falls into the most sensitive area of the retina (i.e., fovea) (~2o or 3 to 6 letter spaces), which is specialised for high acuity visual perception (Mast & Kosslyn, 2002; Yarbus, 1967). Thus, eye position (i.e., fixation location) gives valuable information about the location of the attentional “spotlight” (Posner, Snyder, & Davidson, 1980). In other words, fixation location corresponds to the spatial locus of cognitive processing. On the other hand, fixation duration corresponds to the duration of cognitive processing of the material located at fixation (Irwin, 2004, p. 2). Longer fixations suggest higher cognitive load or higher attentional processing demands required by a material or task (Irwin, 2004). The underlying idea behind the link between cognition and fixation is known as eye-mind assumption (Just & Carpenter, 1980), which simply posits that the “direction of our eyes indicates the content of the mind” (Underwood & Everatt, 1992). Based on the location and duration of fixations, cognitive processes can be measured and evaluated objectively and precisely during the occurrence of the process in question. There is now a universal consensus on the value of eye movements and eye tracking as a methodology in the investigation of the human mind (e.g., Hyona, Radach, & Deubel, 2003; Just & Carpenter, 1980; Rayner, 1998; Rayner, Pollatsek, Ashby, & Clifton, 2012; Reichle, Pollatsek, Fisher, & Rayner, 1998; Theeuwes, Belopolsky, & Olivers, 2009; Van der Stigchel et al., 2006). Eye tracking methodology provides detailed measures with regard to the temporal order of fixations and saccades, gaze direction, pupil size and time spent on pre-defined regions of the scene. Fixation duration in a certain location relative to other locations is used as the main measure of looking behaviour in the present thesis. Eye movements can be monitored in various different ways. A pupil corneal reflection technique, that is based on high-speed cameras and near infrared light, is the most advanced remote and non-intrusive eye tracking method as of today. An illuminator shines dispersed infrared light to one eye or both eyes. A high-speed video camera captures the infrared reflections coming from the pupil and cornea (i.e., the outer layer of the eye) and transforms them into high-resolution images and patterns pertaining to the position of the eye(s) at any given millisecond. Such an infrared eye tracker can record eye movements quite precisely. Precision offered by an eye tracker is indicated by temporal resolution (i.e., sampling rate) and spatial resolution. Sampling rate shows the frequency of which a tracker samples and determines the position of the eye at a given moment. For example, the eye tracker used in the present thesis (i.e., SR EyeLink 1000) operates at a sampling rate of 1000 Hz, which means that the position of the eye is measured 1000 times every second. Put differently, it produces one sample of the eye position per one millisecond. Spatial resolution refers to the angular distance between successive samples of eye position. Thus, an eye tracker with a higher spatial resolution can detect even the smallest eye movements in a certain interest area. SR EyeLink 1000 has a spatial resolution of 0.25o - 0.50o which means that it can detect and sample eye movements within an angular distance of 0.25o - 0.50o. There generally exists a spatial difference between the calculated location of a fixation and the actual one. This difference is expressed in degrees of visual angle and reflects the accuracy of eye tracking. If you draw a straight line from the eye to the actual fixation point on the screen and another line to the computed one, the angle between these lines gives the accuracy. Thus, a smaller difference means higher accuracy. Accuracy depends on the screen size and the distance between the participant and the screen. Visual angle is also used to calculate the size of the experimental stimulus as it refers to the perceived size rather than the actual size. These measures of data quality are reported in the methods section of each experiment in accordance with the eye tracking standards and good practices in literature (Blignaut & Wium, 2014; Holmqvist, Nyström, & Mulvey, 2012; D. C. Richardson & Spivey, 2004).
2.2.2 Investigating memory with eye movements
The role of eye movements in evidently visual tasks and processes such as visual perception (Noton & Stark, 1971), reading (Rayner, 1998), visuospatial memory (Irwin & Zelinsky, 2002), visual search (Rayner, 2009) and visuospatial attention (Van der Stigchel et al., 2006) has been widely investigated for many decades and is very well-documented. Eye movements have recently emerged as an alternative means in memory research complementing behavioural measures based on end-state measures (e.g., hit rate, hit latency etc.) (Lockhart, 2000) and brain-imaging studies (Fiser et al., 2016; Gabrieli, 1998; Rugg & Yonelinas, 2003). It has been known for a long time that previous experience and knowledge of the observer can govern eye movements in addition to the physical properties of the scene and stimulus. For example, many early studies have reported that human observers tend to look at areas of a picture which are relatively more informative to them. Importantly, informativeness rating of a region is modulated by the previous knowledge of the participants in the long-term memory (Antes, 1974; Kaufman & Richards, 1969; Mackworth & Morandi, 1967; Parker, 1978; Zusne & Michels, 1964). Similarly, Althoff and Cohen (1999) reported that previous exposure to a face changes the viewing behaviour and thus, eye movements. In their study, different patterns of eye movements emerged when participants viewed famous versus non-famous faces driven by recognition, fame rating and emotion labelling tasks. Participants made fewer fixations and fixation durations were shorter when viewing famous faces (now known as a repetition effect), which suggests lower cognitive load in processing previously experienced stimuli that can be retrieved from memory. Ryan, Althoff, Whitlow and Cohen (2000) took a similar approach: Participants viewed a set of real word images under three conditions: novel (i.e., seen once during the experiment), repeated (i.e., seen once in each block of the experiment) or manipulated (i.e., seen once in original form in the first two blocks and then seen in a slightly changed form in the final block). Participants made fewer fixations and sampled fewer regions when viewing repeated and manipulated scenes compared to novel scenes (i.e., repetition effect). Repetition effect speaks to the link between stability of mental representation and memory-guided eye movements. To illustrate, in Heisz and Shore (2008), the number of fixations gradually decreased with the number of exposures to the unfamiliar faces during a task. There was also evidence for another memory driven eye movement behaviour known as a relational manipulation effect: a higher proportion of total fixation time was dedicated to the manipulated regions in the scenes compared with repeated or novel scenes. Further, participants made more transitions into and out of the changed regions of the manipulated scenes than in unchanged (matched) regions of the repeated scenes. Similar paradigms based on eye movements were also used to study memory in non-human primates (Sobotka, Nowicka, & Ringo, 1997), infants (Richmond, Zhao, & Burns, 2015; Richmond & Nelson, 2009) and special populations. For example, Ryan et al. (2000, Experiment 4) did not observe any difference in looking patterns between amnesic patients with severe memory deficits and a control group when both were viewing the repeated images. However, amnesic patients did not look longer at the altered regions when viewing manipulated images, suggesting that amnesia disrupts relational memory, i.e., memory for the relations among the constituent elements of an experience. Likewise, in Niendam, Carter and Ragland (2010), schizophrenic patients failed to detect image manipulation, which was shown with eye movements and even though the memory impairment was not evident in behavioural results. Studies reviewed above suggest relevance of eye movements in memory and importantly, advantages of eye tracking methodology over behavioural, response-based methodologies. (1) Memory-guided eye movements are mostly obligatory, that is, cannot be controlled. For instance, repetition effect reviewed above occurs regardless of the instruction (i.e., whether participants are told just to study all items for later, are explicitly told to pick out the familiar item, or are told to avoid looking at the familiar item) (Ryan, Hannula, & Cohen, 2007, pp. 522-523). (2) Individuals launch memory-guided eye movements whether exposure comes from short term memory (i.e., within the experiment) or from long term memory (i.e., prior to the experiment). (3) Memory-guided eye movements precedes conscious recall. As stated by Hannula et al. (2010), “eye movements can reveal memory for elements of previous experience without appealing to verbal reports and without requiring conscious recollection” (see Spering & Carrasco, 2015 for a comprehensive review; but see Smith, Hopkins, & Squire, 2006). For instance, repetition effect occurs as early as the very first fixation to the item and thus, prior to the behavioural recognition response (Ryan et al., 2007). Similarly, in Henderson and Hollingworth (2003), gaze durations were reliably longer for manipulated scenes although participants failed to detect changes explicitly. To conclude, studies making use of eye movements are highly promising as a methodology. They can provide unique information about memory processes, which complement overt behavioural measures and brain imaging (e.g., Hannula & Ranganath, 2009). In fact, eye movements are so representative of memory that mathematical models are able to predict the task that a person is engaged in (e.g. scene memorisation) from their eye movements using pattern classification (Henderson, Shinkareva, Wang, Luke, & Olejarczyk, 2013). It should also be noted that eye movements in memory are not limited to fixation measures or saccadic trajectories. Variation in pupil size (e.g., pupil dilation) and blinks have been used to probe the ongoing processes during retrieval (Goldinger & Papesh, 2012; Heaver & Hutton, 2011; Kahneman & Beatty, 1966; Mill, O’Connor, & Dobbins, 2016; Otero, Weekes, & Hutton, 2011; Siegle, Ichikawa, & Steinhauer, 2008; Van Gerven, Paas, Van Merriënboer, & Schmidt, 2004; Vo et al., 2008). A well-established finding is that the pupil dilates as the retrieval becomes cognitively challenging (Goldinger & Papesh, 2012; Kucewicz et al., 2018; Laeng, Sirois, & Gredeback, 2012).
2.2.3 Eye movements in mental imagery and memory simulations
As discussed in Chapter 2.1.1 and 2.1.2, there is mounting evidence showing the neural and behavioural similarities between memory and mental imagery (Albers et al., 2013; Rebecca Keogh & Pearson, 2011). Concordantly, simulation theories of memory within grounded-embodied cognition highlight the connection between memory and mental imagery in that both processes are simulations in essence. That is, memory retrieval/mental imagery is a neural, perceptual and/or motor reinstatement of perception (Borst & Kosslyn, 2008; Buckner & Wheeler, 2001; De Brigard, 2014; Ganis et al., 2004; Kent & Lamberts, 2008; Mahr & Csibra, 2018; Michaelian, 2016b; Norman & O’Reilly, 2003; Pasternak & Greenlee, 2005; Shanton & Goldman, 2010). Eye movements play a crucial role in the simulation thesis of memory and mental imagery because they can illustrate the behavioural reinstatements between perception/encoding and imagery/retrieval. The essential idea behind this imagery - eye movements - memory network holds that eye movements are stored in memory along with the visual representations of previously inspected images and they are re-enacted during memory and visual imagery (Mast & Kosslyn, 2002). Long before the idea had been proven empirically, many researchers hinted at a possible similarity in saccades between visual perception and imagery (Hebb, 1968; Hochberg, 1968; Neisser, 1967; Schulman, 1983). Hebb (1968) was probably the first researcher who explicitly argued that “if the mental image is a reinstatement of the perceptual process, it should include the eye movements” (p. 470). Brandt and Stark (1997) provided direct empirical evidence for this argument by showing that people do move their eyes during mental imagery and the scanpaths (i.e., the sequential order of fixations and saccades, not only their spatial positions) are not random (see also Noton & Stark, 1971 for more on scanpath theory). Instead, they bear striking similarities with the scanpaths during the perception of the original image (Foulsham & Underwood, 2008; Underwood, Foulsham, & Humphrey, 2009) Correspondence between the eye movements in perception and imagery was so robust that it was observed both for auditory (retelling a story) and visual stimuli (depicting a picture) and even when participants were in complete darkness and thus, without any visual information at all during imagery (Johansson et al., 2006). It seems reasonable to assume that spatiotemporal characteristics of visual perception are similar to the mental imagery as eye movements reflect the mental processes during visual inspection. Memory-guided eye movements are also informative in grounding of abstract concepts such as time. In Martarelli, Mast and Hartmann (2017), participants launched more rightward saccades during encoding, free recall and recognition of future items compared to past items (see also Hartmann, Martarelli, Mast, & Stocker, 2014; Stocker, Hartmann, Martarelli, & Mast, 2015). A majority of the studies investigating the ocular motility in mental imagery and memory have revolved around the role and functionality of eye movements. Whether these eye movements are merely epiphenomenal (i.e., an involuntary by-product of the imagery process) or play an important role and affect the imagery/retrieval performance is an important issue in that it directly taps into the primary question of nonvisual gaze patterns: Why do people move their eyes when forming mental images in the first place? Early studies (Kosslyn, 1980) discussed a potential advantage in vividness if non-random eye movements are systematically employed during mental imagery; yet, they failed to provide experimental evidence, which led to a premature conclusion: Oculomotor movements during imagery were regarded as a mere reflection of the visual buffer (Kosslyn, 1980, 1987). A visual buffer is a hypothetical unit which is responsible for holding visual information for a limited time. Nonvisual eye movements in mental imagery were assumed as an additional mechanism for presenting complex scenes on the visual buffer without overloading its capacity (Brandt & Stark, 1997). Thus, eye movements were viewed as passively mirroring the attentional window over the target image during encoding to provide a solution for the cognitive load problem (Irvin & Gordon, 1998). There is now increasing evidence that eye movements have a relatively more direct role in mental imagery and memory (Bochynska & Laeng, 2015; Hollingworth & Henderson, 2002; Laeng et al., 2014; Mäntylä & Holm, 2006; Stark & Ellis, 1981; Underwood et al., 2009; Valuch, Becker, & Ansorge, 2013). For example, in Laeng and Teodorescu (2002), participants viewed an irregular checkerboard, similar to the one used by Brandt and Stark (1997) or a coloured picture. Then, they were asked to mentally imagine the visual stimuli as they were looking at a blank screen. Percentages of fixation time on certain interest areas and the order of scanning during perceptual phase (i.e., original image) and imagery phase (i.e., blank screen) were highly correlated. But importantly, the strength of relatedness between scanpaths predicted the vividness of mental imagery. More recent evidence indicates that what is perceptually simulated in memory retrieval or mental imagery is not the order of eye movements (i.e., scanpaths) but rather, the locations of perception. In a visual memory experiment, Johansson, Holsanova, Dewhurst and Holmqvist (2012) found no literal re-enactment during retrieval although suppression of eye movements hindered retrieval accuracy (cf., Bochynska & Laeng, 2015). By challenging the scanpath theory, they deduced that eye movements during retrieval are functional but not one-to-one reactivation of the oculomotor activity produced during perception/encoding (see also Foulsham & Kingstone, 2012 for similar results). Also, in Laeng and Teodorescu (2002), the participants who were not allowed to free scan during imagery phase (i.e., fixed gaze condition) did worse when they were asked to recall the original pattern, which was calculated by the number of squares corresponded to the location of a black square in the grid. Using a similar paradigm in visuospatial memory, Johansson and Johansson (2014) asked participants to view objects distributed in four quadrants at the encoding phase. Participants then listened statements about the direction of the objects (e.g., “The car was facing left”) and were asked to decide whether the statements are true or false. Results showed that participants who were free to look at a blank screen during retrieval had a superior retrieval performance than participants whose eye movements were constrained to a central fixation point. Further, participants whose eye movements were constrained to the previous locations of the objects were more accurate and faster than participants whose eye movements were constrained to a diagonal location as to the previous location of the concerned object. Studies reviewed above suggest that the human mind encodes eye movements not as they are but in the form of spatial indices, seemingly invisible spatial pointers in space (D. C. Richardson & Kirkham, 2004; D. C. Richardson & Spivey, 2000). Spatial indices link internal representations to objects in the visual world by tapping into space-time information and in turn, trigger eye movements to blank locations during retrieval to reduce working memory demands (Ballard et al., 1997). Therefore, there is no need for a literal recapitulation of gaze patterns because eye movements function as a scaffolding structure with the network of spatial indices for the generation of a detailed image. In other words, spatial indices in the environment which are internalised via eye movements complete the representations “in the head” resulting in a detailed mental image (Ferreira et al., 2008). In an alternative model, O’Regan and Noë (2001) put forward that seeing is a way of acting and eye movements are visual representations themselves in a nod to ecological psychology (Gibson, 1979). To sum up, current evidence shows that oculomotor activity during memory and mental imagery is not limited to the reconstruction of the original: it is essential to generate mental images. Further, it seems that the role of eye movements is also beyond an automatic and involuntary distribution of limited cognitive sources between the oculomotor activity and memory to alleviate the mental load. Rather, eye movements might serve as an optional, situational strategy in situations where expanding could make a difference for solving the task (Hayhoe et al., 1998; Laeng et al., 2014; J. T. E. Richardson, 1979). In support of this assumption, many task-oriented vision studies have suggested that “the eyes are positioned at a point that is not the most visually salient but is the best for the spatio-temporal demands of the job that needs to be done” (Hayhoe & Ballard, 2005, p. 189). Furthermore, there is also intriguing evidence that these strategic, opportunistic eye movements in goal-directed behaviour are guided by a dopamine-based reward system (Glimcher, 2003; Hikosaka, Takikawa, & Kawagoe, 2000). Thus, eye movements during imagery and memory can be situational and adaptive according to the task demands. For example, in Laeng, Bloem, D’Ascenzo and Tommasi (2014) eye movements during mental imagery concentrated in the salient, information-rich parts of the original image (e.g., head region of an animal picture in the study). Here, it is important to underline that difficulty of the task seems to be the decisive factor. For instance, memory tasks requiring relatively low cognitive load would not need a detailed mental image of the original scene to be solved and thus, retrieval should be challenging in order to observe any memory advantage (Hollingworth & Henderson, 2002; Laeng et al., 2014).