A compressed hierarchy for visual form processing in the tree shrew

August 29, 2025

6

The ability to recognize objects is fundamental to the survival of visual animals. The primate ventral stream has long served as a model for studying how objects are processed in the brain^10,11. One defining feature of the primate ventral stream is hierarchical organization¹², which is mirrored by deep neural networks (DNNs) trained on object recognition^8,13. This parallel raises an important question: is hierarchical representation necessary and, if so, can it be found across all highly visual mammalian species? Investigating visual processing across different mammalian species promises to provide a deeper understanding of general principles for object vision.

Over a decade ago, the mouse visual system began to attract strong interest, driven by the wealth of tools available for mouse neural circuit dissection^14,15. However, the mouse’s low visual acuity and limited cortical territory dedicated to vision¹⁶ make it a non-ideal organism for studying hierarchical brain mechanisms underlying object recognition. The tree shrew has attracted growing interest as a model to study visual processing¹⁷ owing to its high visual acuity (more than ten times that of rodents)¹⁸, greatly expanded visual cortex¹⁹ and excellent ability to perform visually guided behavioural tasks compared with the mouse^20,21. The tree shrew visual system includes at least nine distinct anatomical visual cortical areas¹⁹. The primary visual area (V1) shows a high degree of functional specialization, including an orderly arrangement of orientation-selective columns^22,23. The tree shrew also has a prominent second visual area (V2), albeit with a large-scale topographic organization that differs from that of primates²⁴. Lesion studies suggest a rough correspondence between tree shrew extrastriate areas anterior to V2 and primate IT cortex: ablations of large portions of the temporal lobe produce deficits in pattern discrimination and object vision similar to the effects of inferotemporal (IT) lesions in primates^19,25,26. However, to our knowledge, there have been no electrophysiological studies of the functional properties of tree shrew extrastriate visual areas beyond V2.

Here we aim to identify the cortical organization and coding principles that underlie visual object representation across the entire tree shrew ventral stream. Using large-scale electrophysiological recordings with several Neuropixels probes, we surveyed five tree shrew ventral visual areas as well as the pulvinar. We confirmed hallmarks of hierarchical organization found in primates, including increased receptive field size and response latency²⁷ as well as increased selectivity for naturalistic textures compared with spectrally matched noise⁵. We found that area V2 in the tree shrew performs key functions associated with primate IT cortex. This includes a full representation of high-level object space, accurate object identity decoding and reconstruction, and the presence of strongly face-selective cells. Overall, the results indicate a compressed, multi-stage hierarchy in the tree shrew in which representations previously observed in the primate are realized at a much earlier stage of visual processing.

We targeted a set of areas spanning the tree shrew ventral stream to investigate hierarchical visual processing (Fig. 1a). We included primary (V1) and secondary (V2) visual areas as architectonically distinct regions involved in early stages of visual processing^28,29. As a potential intermediate node along the ventral visual processing stream, we selected the temporal posterior (TP) area. At the anterior end, we focused on three subregions that may be homologous to macaque IT cortex: temporal-inferior (TI), temporal intermediate (ITi) and inferotemporal rostral (ITr) areas. Lesions to TI and ITi cause drastic impairments in visual form detection²⁵. ITr receives inputs from both visual and auditory cortex¹⁹, but its visual functional properties have never been explored, to our knowledge. Owing to the difficulty in distinguishing the border between TI and ITi, we grouped them together and refer to this region as TI-ITi. Because many temporal areas receive direct input from the thalamus³⁰, we also included the dorsal visual portion of the pulvinar (Pulv) in our recordings. To guide electrode targeting, we performed retrograde tracing experiments (Extended Data Fig. 1a,b).

**Fig. 1: High-throughput electrophysiological recordings along the tree shrew ventral visual pathway reveal a functional hierarchy.**

To characterize the visual responses of neurons across V1, V2, TP, TI-ITi, ITr and Pulv, we performed electrophysiological recordings using Neuropixels probes in awake tree shrews (Fig. 1b). During each experiment, animals were head-fixed in front of a monitor and presented with a battery of visual stimuli, including local sparse noise, static gratings, naturalistic textures and noise, and images of faces and other objects. At the conclusion of each session, probe locations were marked with DiI (DiIC18(3), a fluorescent dye) and targeting was confirmed with histology (Fig. 1c). We classified a cell as visually responsive if it responded to any of the classes of visual stimuli we tested (Methods). We found many well-isolated single units in each area (Fig. 1d), with some inter-area differences in the fractions of cells that responded to visual stimuli (ANOVA, F_5,18 = 5.362, P = 0.003; Fig. 1e). In particular, significantly fewer TI-ITi cells were visually responsive compared with V2 cells.

We began by mapping the receptive fields of neurons along the tree shrew ventral pathway using a locally sparse noise stimulus (Methods). For each neuron, we estimated the receptive field by fitting a Gaussian distribution to the two-dimensional (2D) matrix of spike counts across visual field locations; ON and OFF receptive fields were computed separately using responses to white and black squares, respectively. Cells with ON and/or OFF receptive fields were clearly present in all areas except TP (Fig. 1f). This included the two most anterior areas TI-ITi and ITr; this contrasts with the anterior temporal lobe in primates, where neurons typically show spatially invariant responses^31,32.

Within individual recordings, receptive field positions were clustered in a small portion of the visual field, corresponding to the retinotopic region represented by the cortical site targeted with the electrode. Figure 1g shows receptive fields of all recorded cells in a representative session for each area. This clustering was evident across all areas studied, including the most anterior areas, TI-ITi and ITr. This finding suggests that, despite their position at the anterior end of the ventral stream, these areas preserve retinotopic organization.

To assess the hierarchical relationships between the recorded areas, we first examined two classic metrics of hierarchical level: receptive field size and visually evoked response latency. Receptive field sizes increased systematically from posterior to anterior (Fig. 1h). We also calculated the half-peak latencies for each unit in each area and found that latencies increased from V1 to V2 to ITr (Fig. 1i and Methods). The hierarchy predicted by the increase in receptive field sizes was broadly consistent with the hierarchy predicted by the increase in latencies (Fig. 1j).

In the primate visual cortex, early visual areas are strongly tuned to low-level features such as orientation and spatial frequency, whereas later areas are tuned to more complex object features^7,33,34,35. To examine whether a similar progression exists in the tree shrew, we assessed tuning to orientation and spatial frequency across ventral visual areas using static gratings (Fig. 2a). We found that the proportion of visually responsive neurons (see Fig. 1e) that responded to gratings was the highest in V1 and V2 (roughly 55% and 65%, respectively) and lowest in TI-ITi (Fig. 2b). Tuning to orientation, spatial frequency and phase of example cells from V2 and ITr illustrates the diverse tuning we observed to these variables across tree shrew visual areas (Fig. 2c). Overall, orientation tuning was most prevalent in V1 and V2 (Tukey analysis after ANOVA, F_5, 1,106 = 26.791, P < 10⁻²⁴, Fig. 2d), whereas spatial frequency tuning was also prevalent in ITr (Tukey analysis ANOVA, F_5, 1106 = 20.514, P < 10⁻¹⁸, Fig. 2e). These findings are roughly consistent with those found in the primate and rodent ventral stream, where orientation tuning is especially prominent in early visual areas and then sharply decreases in later areas^36,37,38,39.

**Fig. 2: Encoding of orientation, spatial frequency, and texture across tree shrew ventral visual areas.**

Thus far, V2 responses seemed largely similar to those in V1, raising the question whether V2 performs any distinct computational function. In macaques, sensitivity to higher-order statistical dependencies in naturalistic textures has been identified as a distinguishing feature of area V2 (ref. ⁵). We therefore asked whether tree shrew extrastriate areas show a similar specialization for naturalistic texture processing. To test this, we recorded neural activity across all six visual areas while presenting naturalistic textures and spectrally matched synthetic noise images (Fig. 2f and Methods). Among all areas, V2 contained the highest proportion of cells that responded to the texture and/or noise stimuli (Fig. 2g). Population response dynamics revealed the strongest differential activity between naturalistic textures and noise in V2, followed by V1, ITr and TI-ITi, with minimal or no modulation in the remaining areas (Fig. 2h). In V2, the difference persisted for the duration of the stimulus. Although responses in V1 commenced well before those in V2 (Fig. 1i), the divergence between texture and noise responses occurred later in V1 (at 90 ms) than in V2 (at 45 ms), suggesting that the texture modulation in V1 may arise through feedback from V2. This interpretation is further supported by the finding that V2 encoded texture family identity earlier than V1 (Fig. 2i).

A central function of the visual hierarchy is to recognize and categorize objects to guide vital behaviours such as navigation, foraging or mating. To investigate high-level object representations in the tree shrew ventral stream, we presented a rich stimulus set consisting of 1,593 images of animals, body parts, faces and everyday objects (Methods). This same stimulus set has previously been used to characterize tuning in macaque inferotemporal (IT) cortex, enabling direct comparisons between object recognition mechanisms in primates and tree shrews⁸. Stimuli were adjusted to match the receptive field location of recorded neurons (Methods). Response rasters from example cells showed diversity in object selectivity across different neurons in the tree shrew ventral stream (Fig. 3a). Among the six areas recorded, a similar proportion of visually responsive cells responded to object stimuli across V2, TP, TI-ITi and Pulv (Fig. 3b). Notably, a much larger fraction of visually responsive cells in TI-ITi responded to object stimuli compared with gratings (Fig. 2b), consistent with temporal areas occupying a higher level in the visual hierarchy. To quantify the reliability of object-driven responses, we computed the ‘explainable variance’—the portion of neural response variance attributable to stimulus identity rather than trial-to-trial variability (Methods). After V2, the explainable variance in responses to these complex object stimuli decreased notably (Fig. 3c), indicating that responses in more anterior areas were less consistent across trials. To determine whether the explainable variance could be accounted for by low-level visual features, we analysed the contributions of luminance, contrast and spatial frequency; in each area, only a small fraction of the variance could be explained by such features (Fig. 3c and Extended Data Fig. 2).

**Fig. 3: Objects are encoded across tree shrew ventral visual areas through axis coding.**

To better understand the nature of the neural code used by each area, we modelled neural responses using AlexNet⁴⁰, an eight-layer DNN trained on object recognition (Fig. 3d). In macaques, single IT neurons are well described by an ‘axis model’, in which each cell linearly projects incoming stimuli onto a preferred axis in a DNN-derived feature space^8,13. In these models, the preferred axes span a relatively low-dimensional basis—such that, for example, just 50 dimensions are sufficient for accurate reconstructions of faces from macaque face patches⁴¹. To test whether this principle also applies in the tree shrew, we computed the preferred axis of each neuron across six recorded areas using the first 50 principal components from AlexNet layer FC6. We focused on FC6 to clarify whether tree shrew cortex represents a high-level object space, as observed in macaque IT cortex⁸. Consistent with axis-based coding, neurons in all six areas showed ramp-shaped tuning along their preferred axes (Fig. 3e and Methods). Moreover, cells showed flat tuning along their principal orthogonal axis (that is, longest axis orthogonal to the preferred axis; Fig. 3f and Methods).

Previous studies in primates have shown that early layers of AlexNet and other DNNs best explain neuronal activity in early retinotopic visual areas, whereas later layers best explain responses in IT cortex^8,13. We asked whether a similar pattern holds across the tree shrew ventral stream. To test this, we regressed single-cell firing rates against the first 50 principal components of each layer in AlexNet (Methods) and identified the layer that best explained the variance in each cell’s response. For one representative cell in V2, AlexNet layer Conv4 best explained its responses (Fig. 4a). Across the V2 population, we found that intermediate layers—specifically Conv4 and Conv5—consistently had greater explanatory power than either early or late layers (Fig. 4b).

**Fig. 4: Neural representation of object stimuli in tree shrew ventral visual areas reveals optimal feature decoding in area V2.**

To compare the explanatory power of different AlexNet layers across brain areas, we calculated the sum across cells within each area of the variance explained by the various AlexNet layers, and normalized these sums by the sum across cells of their explainable variance (Methods). This analysis revealed that early visual areas V1 and V2 were best explained by intermediate layers—specifically Conv3 to Conv5—whereas anterior areas TI-ITi and ITr were best explained by the high-level FC6 layer (Fig. 4c). However, the absolute variance explained by AlexNet was lower in these higher cortical areas (Extended Data Fig. 3a,b), consistent with the reduced trial-to-trial reliability of responses to object identity observed in anterior regions (Fig. 3c). One possible explanation is that AlexNet may lack the expressive capacity to fully capture response properties of anterior tree shrew regions, which have been proposed to be multimodal and not exclusively visual¹⁹

To investigate which feature axes accounted for the most variance in neural responses across areas, we examined how much variance was explained by individual feature principal components from AlexNet layer FC6. In general, earlier principal components explained the greatest proportion of variance in neural responses, with some variability across areas (Fig. 4d). We also analysed how well specific FC6 features could be decoded from population activity in each visual area (Fig. 4e). Again, early principal components were most strongly represented, with decoding performance peaking in V2, substantially higher than in any other region. This finding aligns with the observation that FC6 features explained more variance in V2 than in other areas (Extended Data Fig. 3c). Thus, even though V2 was best explained by Conv4 and Conv5 features, whereas TI-ITi and ITr were best explained by FC6 features, FC6 features were nevertheless better represented in V2 than in these more anterior areas.

Given the strong performance of V2 in decoding AlexNet FC6 features, we next asked whether activity in V2 might be sufficient to reconstruct objects using small neural populations, as has previously been shown in monkey IT cortex⁸. To test this, we used a large auxiliary dataset of 15,901 images, each passed through AlexNet to extract FC6 activations. From activity in each area, we reconstructed an FC6 activation vector and identified the image whose FC6 features were closest to the reconstruction (Extended Data Fig. 3d). To control for cell number, we performed reconstructions using 100 randomly selected cells from each area. Consistent with our results on parameter decoding (Fig. 4e), which were optimal in V2, images reconstructed from V2 closely resembled the original images, whereas images reconstructed from V1 or TI-ITi were notably less accurate (Fig. 4f). To quantitatively compare reconstruction accuracy across areas, we computed the distance between the reconstructed and actual FC6 activation vectors for each image, normalized by the theoretical best decoding distance (Methods). This analysis revealed that V2 had the smallest normalized decoding distances of all areas—indicating the most accurate reconstructions—and matched the performance obtained when pooling neurons from all areas combined (Tukey analysis after ANOVA, F_6, 11144 = 151.248, P < 10⁻¹⁸⁴; Fig. 4g). These results further underscore the rich yet compact object representation present in tree shrew V2.

Primate IT cortex contains regions composed of neurons that respond maximally to images from specific categories, for example, faces^42,43,44. Such category-selective regions can be explained by a normative framework in which IT cortex encodes a general object space—a representational space defined by the first two principal components of the AlexNet FC6 features^8,45. Within this space, different sectors correspond to distinct object categories, such as faces, fruits and animals (Fig. 5a).

**Fig. 5: Single cells across the tree shrew ventral stream show selectivity for different sectors of object space including faces.**

Does the tree shrew visual cortex, like the primate IT cortex, contain regions specialized for representing distinct sectors of object space? To address this question, we projected the preferred axes of all recorded cells onto the same 2D object space (Fig. 5b). In V2, preferred axes were distributed across all four quadrants, whereas in other areas, they were largely confined to quadrants I and III. Given that different object categories are localized to distinct regions of this space, we predicted that individual tree shrew neurons would show selectivity for specific categories. Indeed, analysis of response rasters confirmed that neurons with preferred axes in the face sector were strongly face selective (Fig. 5c). Some face cells also responded to other round shapes, whereas others showed strong selectivity only for faces. In addition to face cells, we identified neurons selective for spiky, elongated objects (quadrant I), round inanimate objects (quadrant II) and spiky animate objects (quadrant IV) (Fig. 5d and Extended Data Fig. 4a,b). However, unlike the modular organization seen in primate IT, we found no evidence for topographic clustering of category-selective neurons within tree shrew visual areas (Extended Data Fig. 4c).

Faces—particularly human faces, which comprised all our face stimuli—are not known to hold special behavioural importance for tree shrews⁴⁶. To confirm that the cells were genuinely face selective, we computed a face selectivity index, defined as the difference between responses to faces and all other objects, for each individual cell (Methods). This confirmed small populations of highly face-selective cells (t ≥ 15) in most areas starting in area V2, with the highest percentages in TI-ITi and Pulv (Fig. 5e).

Primate IT cortex is highly specialized for object recognition and has long served as a foundation for studying visual form processing. To enable direct comparisons with our tree shrew dataset, we performed large-scale recordings in macaque monkeys using NHP Neuropixels probes. We presented the same 1,593 object stimuli while recording from V2, posterior IT (IT_post) and anterior IT (IT_ant) from two monkeys per area (Fig. 6a–c). We found that the explainable variance in responses to complex object stimuli increased along the primate visual hierarchy, from primate V2 to IT_ant (Fig. 6d), whereas in tree shrew visual cortex, it peaked in V2 (Fig. 3c). Similarly, image reconstruction performance improved along the primate hierarchy (Fig. 4g), whereas in tree shrews, it was most accurate in V2 (Fig. 6e). In contrast to tree shrews (Fig. 5e), we did not observe strongly face-selective cells in primate V2 (Fig. 6f). As expected, the number of face cells in primate IT_post and IT_ant was much higher. Notably, in one of the IT_post recordings, the probe partially targeted a known face patch, resulting in a higher proportion of face cells.

**Fig. 6: Comparison of object responses between primate and tree shrew ventral visual areas.**

Last, we asked how well neural populations in the primate and tree shrew visual systems could decode individual face or object identity. To test this, we trained classifiers to decode the identity of either 100 faces or 100 general objects using neural activity from randomly sampled subpopulations within each area (Fig. 6g and Methods). In tree shrews, all areas except TP showed above-chance decoding performance for both faces and objects. When we restricted the analysis to only face-selective cells, face identity decoding improved further. Decoding performance in tree shrew V2 exceeded that in all other tree shrew areas for both face and object identity. By contrast, primate V2 showed substantially lower decoding performance compared with tree shrew V2 (Fig. 6g). Indeed, decoding using tree shrew V2 activity was similar to that of primate posterior IT. As expected, primate anterior IT, which sits at the apex of the primate ventral visual hierarchy, showed the highest decoding accuracy.

A hallmark of the primate ventral stream is gradual emergence of view invariance, raising the question of whether a similar progression exists in tree shrews^8,31,47. Using DNN models, we computed a predicted view invariance index (Methods) based on responses to the 1,593 object images. In macaques, this predicted index was positively correlated with the empirically measured view invariance index, and both increased along the ventral hierarchy (Extended Data Fig. 5a–c). Applying the same approach to tree shrews, we found no such trend in the model-predicted responses (Extended Data Fig. 5d,e). This absence of a clear progression suggests that view invariance may not emerge in the same hierarchical manner—or may not be captured by current models—in the tree shrew ventral stream. However, direct empirical testing within each area is needed to determine whether view invariance is a core organizing principle of the tree shrew visual pathway, as it is in primates³² and rodents^48,49. Taken together, our findings show how visual processing along a series of interconnected areas in tree shrews compares to primates^{5,7,8,9,13,32} and rodents^50,51,52, highlighting both important similarities and differences (Fig. 6h).

A compressed hierarchy for visual form processing in the tree shrew

Attosecond control and measurement of chiral photoionization dynamics

Microbial iron oxide respiration coupled to sulfide oxidation

Deep gashes in the earth are slicing up cities, swallowing houses and displacing vast numbers of people

Most Popular

TikTok now lets users send voice notes and images in DMs

Attosecond control and measurement of chiral photoionization dynamics

Hurricane Katrina Survivor Launches ATL Cafe

Eric Bischoff Says Real American Freestyle Was Never In Doubt After Hulk Hogan’s Death

Recent Comments

ABOUT US

POPULAR POSTS

TikTok now lets users send voice notes and images in DMs

Attosecond control and measurement of chiral photoionization dynamics

Hurricane Katrina Survivor Launches ATL Cafe

POPULAR CATEGORY