Aligning machine and human visual representations across abstraction levels

November 14, 2025

4

Soft alignment

This section is organized as follows. We start by describing how we transform model representations into a space that matches human similarity judgements about coarse-grained semantic object relations. We introduce an affine transformation that matches human similarity judgements and injects the uncertainties that humans assign to their triplet odd-one-out choices into a model’s representation space creating a surrogate teacher model. Using the teacher model’s human-aligned representations, we sample triplets of ImageNet³⁸ images differently than uniform random sampling by clustering the representations into superordinate categories and using those clusters for data partitioning. We pseudo-label these triplets with human-aligned judgement distributions from the surrogate teacher model. Finally, after having created AligNet triplets, we fine-tune student models with a triplet loss object function.

Representational alignment

Data

To increase the degree of alignment between human and neural network similarity spaces, we begin from the publicly available THINGS dataset, which is a large behavioural dataset of 4.7 million unique triplet responses from 12,340 human participants for m = 1,854 natural object images⁵¹ from the public THINGS object concept and image database²⁶. The THINGS dataset can formally be defined as $D\i_s,j_s\,,k_s$_s=1^n\), which denotes a dataset of n object triplets and corresponding human odd-one-out responses, where $\a_s,b_s\\subset \i_s,j_s\,,\,k_s\$ and $\a_s,b_s\$ is the object pair that was chosen by a human participant among the s-th triplet to have the highest similarity. Let $\bfX\in \,\mathbbR^m\times p$ be the teacher model representations for the m = 1,854 objects in the THINGS dataset, where p is the dimension of the image-representation vector. It is noted that each category in the THINGS dataset is represented by one object image. From X we can construct a similarity matrix for all object pairs $\bfS:= \bfX\,\bfX^\rm\top \in \mathbbR^m\times m$, where $S_i,j=\bfx_i^\rm\top \bfx_j$ is the representational similarity for objects i and j, $\rm\top $ denotes the matrix transpose, and x_i refers to the i-th column of X.

Odd-one-out accuracy

The triplet odd-one-out task is frequently used in the cognitive sciences to measure human notions of object similarity^52,53,54,55. To measure the degree of alignment between human and neural network similarity judgements in the THINGS triplet task, we embed the m = 1,854 THINGS images into the representation space of a neural network with $\bfX\in \mathbbR^m\times p$. Given vector representations x₁, x₂ and x₃ of the 3 images in a triplet, we first construct a similarity matrix $\bfS\in \mathbbR^3\times 3$ where $S_i,j\,:= \,\bfx_i^\rm\top \bfx_j$ is the dot product between a pair of image representations. We identify the closest pair of images in the triplet as $\rma\rmr\rmg\,\rmm\rma\rmx_i,j > iS_i,j$ with the remaining image being the odd one out. We define odd-one-out accuracy as the fraction of triplets where the odd one out ‘chosen by a model’ is identical to the human odd-one-out choice. Thus, our goal is to learn an affine transformation into the THINGS human object similarity space of the form: $\bfx^^\prime =\bfW\bfx+\bfb$. Here, $\bfW\in \mathbbR^p\times p$ is a learned transformation matrix, $\bfb\in \mathbbR^p$ is a bias and $\bfx\in \mathbbR^p$ is the neural network representation for a single object image in the THINGS dataset. We learn the affine transformation for the representation of the image encoder space of the teacher model (see the ‘Surrogate teacher model’ section for details about the teacher model). Using this affine transformation, an entry in the pairwise similarity matrix S′—which represents the similarity between two object images i and j—can now be written as $S_i,j^^\prime \,:= \,(\bfW\bfx_i+\bfb)^\rm\top (\bfW\bfx_j+\bfb)$.

Hard-alignment loss

Given a similarity matrix of neural network representations S and a triplet i, j, k, the likelihood of a particular pair, $\a,b\\subset \i,j,k\$, being most similar to the remaining object being the odd one out, is modelled by the softmax of the object similarities,

$$\sigma (\bfS,\tau ):= \exp (S_a,b/\tau )/\exp (S_i,j/\tau )+\exp (S_i,k/\tau )+\exp (S_j,k/\tau )$$

(1)

We can then define the probability of the neural network model to choose the most similar pair (according to the human participants) to be $q(\a,b\|\i,j,k\,\bfS):= \sigma (\bfS,\tau )$ with a temperature parameter τ = 1. For n triplet responses, the discrete negative log-likelihood is defined as follows

$$L_\rmh\rma\rmr\rmd\mbox-\rma\rml\rmi\rmg\rmn(\bfS^^\prime )\,:= \,-\frac1n\mathop\sum \limits_s=1^n\mathrmlog\,q(\a_\rms,b_\rms\|\i_\rms,\,j_\rms,\,k_\rms\,\,\bfS^^\prime )$$

Modelling human uncertainties

As each triplet response is a discrete choice, we do not have direct access to the uncertainties of a human participant over the objects in a triplet. Thus, the above loss function optimizes a transform to match the human choice but does not take into account the uncertainties over the three odd-one-out alternatives. However, it is possible to model these uncertainties using variational interpretable concept embeddings (VICE⁵⁵), a recently proposed, approximate Bayesian inference method for learning an interpretable object concept space from human similarity judgements. VICE has shown remarkable performance in predicting the (dis-)agreement in human similarity judgements for multiple similarity judgement datasets, including THINGS⁵⁵.

We train a VICE model on the official THINGS train triplet dataset using the (default) hyperparameters recommended by the authors. To capture the uncertainties in human triplet responses, VICE learns a mean, $\mu \in \mathbbR^m\times d$, and a variance, $\sigma \in \mathbbR^m\times d$, for each object image m and each object dimension d, respectively. Therefore, the set of VICE parameters is defined as $\theta =\\mu ,\sigma \$. VICE uses the reparameterization trick^56,57 to generate an embedding matrix $\bfY\in \mathbbR^m\times d$, $\bfY_\theta ,\varepsilon =\mu +\sigma \odot \varepsilon $, where $\varepsilon \in \mathbbR^m\times d$ is entrywise N(0, 1), and ⊙ denotes the Hadamard (element-wise) product.

After convergence, we can use a VICE model to obtain a posterior probability distribution for each triplet in the data. We approximate the probability distribution using a Monte Carlo estimate^58,59,60 from R samples $\bfY^(\bfr)=\bfY_\hat\theta ,\varepsilon (r)$ for r = 1, …, R, yielding

$$\hatp(\\,y_s,\,z_s\|\i_s,\,j_s,\,k_s\):= -\frac1R\mathop\sum \limits_r=1^R\mathop{\underbrace\i_\rms,j_s,k_s\,\bfY^(r))}\limits_\rmM\rmo\rmn\rmt\rme-\rmC\rma\rmr\rml\rmo\,\rme\rms\rmt\rmi\rmm\rma\rmt\rme$$

where we set R = 50 because we found it to yield the best predictive performance on the official THINGS validation set. This gives a representative probability estimate for each of the three pairs in a triplet to be selected as the most similar pair.

Soft-alignment loss

Using the posterior probability estimates obtained from VICE, we transform the original THINGS triplet dataset of discrete triplet choices into a triplet dataset of probability distributions that reflect the human uncertainties of the triplet alternatives. Let $D^\dagger \,:= \,(p_s^\ast (\i_s,j_s\,,\,k_s$)_s=1^n\) be the transformed triplet dataset, where

$$p_s^\ast (\i_s,j_s\,,\,k_s\):= \hatp(\\,y_s,z_s\|\i_s,j_s\,,\,k_s\)\,\rm\forall \,\\,y,z\\subset \i,j\,,\,k\.$$

Now, for n triplet responses we can define the negative log-likelihood for the soft alignment loss as

$$\beginarraylL_\rms\rmo\rmf\rmt\mbox-\rma\rml\rmi\rmg\rmn(\bfS^^\prime )\,:= \,\frac1n\mathop\sum \limits_s=1^np_s^\ast (\i_s,j_s\,,\,k_s\)\mathrmlogp_s^\ast (\i_s,j_s\,,\,k_s\)\\ \,\,\,\,\,\,-p_s^\ast (\i_s,j_s\,,\,k_s\)\mathrmlogq_s^\ast (\i_s,j_s\,,\,k_s\)\endarray$$

(3)

where $q_s(\i_s,j_s\,,\,k_s\,\bfS)\,:= \,q(\\,y_s,z_s\|\i_s,j_s\,,\,k_s\,\bfS)\,\rm\forall \,\\,y,z\\subset \i,j,k\.$

Uncertainty distillation

We mainly follow the optimization process introduced in ref. ⁶¹. However, we modify their approach by injecting uncertainty measures about human odd-one-out responses into the representation space of the teacher, using a recent approximate Bayesian inference method for learning object concepts from human behaviour⁵⁵. Thus, we replace the negative log-likelihood of the discrete human odd-one-out choices—which we refer to as hard alignment—with the negative log-likelihood of the probabilities for the pairwise triplet similarities obtained from the Bayesian inference model—referred to as soft alignment. The final objective for learning the uncertainty distillation transformation is thus defined as

$$\mboxarg\,\mathop\mboxmin\limits_W,b\,L_\mboxsoft-align(\bfX,\bfW,\bfb)+\lambda \Vert \bfW-(\mathop\sum \limits_j=1^pW_jj/p)\bfI\Vert _F^2,$$

(4)

where $I\in \mathbbR^p\times p$ is the identity matrix and ||.||_F² denotes the squared Frobenius norm. The right-hand side of the above objective is an ℓ₂-regularization whose aim is to preserve the nearest-neighbour information (or equivalently, the local similarity structure) of the pretrained representations while learning an affine transformation into the THINGS human object similarity space. The above equation is minimized using standard stochastic gradient descent.

Although this expression is similar to the global transform defined in ref. ⁶¹, we find it to yield equally strong downstream task performance as the gLocal transform proposed in ref. ⁶¹ while predicting human uncertainties better than the global transform. It appears as though there is barely any trade-off between representational alignment and downstream task performance for using the uncertainty distillation, whereas ref. ⁶¹ found that the global transform yields slightly better human alignment but worse downstream task performance compared wit the gLocal transform. We use the uncertainty distillation transformation for generating human-like similarity judgements by transforming a model’s representation space with uncertainty distillation.

Data generation

In the following section, we describe the AligNet data-generation process. We start by introducing the data that we use for constructing the triplets. We continue with a detailed description of the different sampling strategies that we consider in our analyses. Finally, we explain how we collect model responses using transformed representations and define the objective function for fine-tuning models on AligNet.

Image data

For creating AligNet, we use the publicly available ImageNet database³⁸. ImageNet is a natural image dataset with approximately 10⁶ training data points and 1,000 image categories²⁸. The categories are almost equally distributed in the data with small variations in the number of images between the different classes. Hence, ImageNet can be considered a highly balanced dataset. ImageNet has been the dominant image dataset for training large computer vision models until the advent of image/text multimodal training a few years ago. Although, so far, larger image datasets exist, ImageNet is still is one of the largest open-source and most widely used image datasets in the field of computer vision.

Triplet sampling

For generating triplets of images, we use three different sampling strategies: random, class-border and cluster-boundary sampling. Let m′ be the number of images in the data where m′ = 1, 281, 167 and C be the number of classes with C = 1,000. Let $D_\rmi\rmm\rma\rmg\rme\,:= \,(x_i,\,y_i)_i=1^m^^\prime $ be the ImageNet dataset of m′ image–label pairs.

Random

Uniform random sampling is the vanilla sampling approach used to create the THINGS datasets (see above). In random sampling, three images are chosen uniformly at random without replacement from all of the m images in the data to create a triplet. As there are C = 1,000 classes and each class has approximately 1,000 images, most of the triplets generated with this approach contain 3 images from 3 different classes. The number of triplets different from triplets with images from three distinct classes is negligible. It is noted that this is the same sampling approach that was used to generate the THINGS triplets⁵⁴. A triplet generated via random sampling can be defined as the following triplet set $\bfS:= \x_i,x_j,x_k\$ with the constraint $(x_i\ne x_j\ne x_k)$.

Class boundary

Another way to sample image triplets is to exploit the label information associated with each data point. Instead of three random images from three distinct classes, we determine class-boundary triplets to contain two images from the same class and one image from a different class. This is similar to the approach introduced in ref. ⁶² where each odd-k-out set of images contains a majority class and k odd class singletons. This sampling approach allows models to learn class boundaries similar to the standard supervised learning setting. A triplet generated via class-boundary sampling can be defined as the following triplet set $\bfS:= \x_i,\,x_j,\,x_k\$ with the constraint $(\,y_i=y_j\ne y_k)\vee (\,y_i\ne y_j=y_k)\vee (\,y_i=y_k\ne y_j)$ where the labels are used for data partitioning.

Cluster boundary

As we want to introduce a general approach that does not rely on label information, we use a third sampling strategy that is, in principle, similar to the class-boundary approach but does not require labels. Let $\bfZ\in \mathbbR^m^^\prime \times p$ be the stacked representations of a neural network model for every image in D_image. The representations can essentially be computed for any layer of a model. Here we use the image encoder for image/text models and the CLS token representation of the penultimate layer for any other model (as we only use ViT-based models). We then apply k-means clustering to the encoded image representations Z and $\bfZ^^\prime \,:= \,(\bfW\bfZ^\rm\top +(\bfb_1,\ldots ,\bfb_m^^\prime ))^\rm\top $ respectively (where the transformation variables W and b are computed via uncertainty distillation optimization using equation (4)) into c representation clusters where c can be regarded as similar to C, the number of labels in the original dataset. We use the Elbow criterion to select c. For all of our main experiments, we set c = 500. Hence, the ImageNet dataset is transformed into a ImageNet dataset of image and cluster pairs. After the clustering, we apply the same sampling method as for class-boundary triplets: for each triplet, we choose uniformly at random two images without replacement from one cluster and one image from a different cluster. Thus, a triplet generated via cluster-boundary sampling can be defined as the following set $\bfS:= \x_i,x_j,x_k\$ with the constraint $(y_i=y_j\ne y_k)\vee (y_i\ne y_j=y_k)\vee (y_i=y_k\ne y_j)$ where instead of the original labels we use the cluster labels for partitioning the data.

Triplet-response generation

We use the responses of a surrogate teacher model (see below) to simulate a dataset of human-aligned triplet odd-one-out responses. More formally, let $D_\mboxtriplets\,:= \,(\x_i,\,x_j,\,x_k$_s=1^n^^\prime \) be the dataset of sampled ImageNet triplets for which we want to collect responses using transformed model representations. It is noted that we can sample an arbitrary number of triplets—upper-bounded by the binomial coefficient m′/k with k = 3—and can thus set n′ to essentially any natural number. For the experiments that we report in the main text, we set n′ = 10⁷ because we found a larger n′ to not yield any downstream task improvements. For now, we regard our surrogate model as a blackbox model with transformed ImageNet representations $\bfZ^^\prime \,:= \,(\bfW\boldsymbolZ^\rm\top +(\bfb_1,\ldots ,\bfb_m^^\prime ))^\rm\top \in \mathbbR^m^^\prime \times p$ where the affine transformation was found via uncertainty distillation optimization (equation (4)). Given transformed representations z₁′, z₂′ and z₃′ of the three images in a triplet, we can construct a similarity matrix $\bfS^^\prime \in \mathbbR^3\times 3$ where $S^^\prime _i,j:=\rmz_i^\rm\top \rmz_j$ is the dot product between a pair of of representations. Similarly to how we do this for learning the uncertainty distillation transformation (see above), we identify the closest pair of images in a triplet as $\arg \mathopmax\limits_i,j > iS_i,j^^\prime $ with the remaining image being the odd one out. Let $D_\rma\rml\rmi\rmg\rmn\,:= \,(\x_a,x_b\_s_s=1^n^^\prime $ then constitute the final AligNet dataset of ImageNet triplets and corresponding model responses, where $\x_a,x_b\\subset \x_i,x_j,x_k\$ and $\x_a,x_b\$ is the image pair that was chosen by the transformed model representations to have the highest pairwise similarity. The model choices are the closest approximation to the human choices due to the uncertainty distillation transformation.

It is noted that the dataset includes not only the discrete model choices but also the exact relationships among all pairwise similarities in a triplet obtained from the probability space of the teacher model. Thus, we have access to soft distributions over the labels for use in distillation.

Objective function

Let f_θ be a neural network function parameterized by θ, the set of its weights and biases. For every input image x that the function f_θ(x) processes it yields a representation f_θ(x) = z. Here, z refers to the image encoder representation of image/text models or the CLS token representation before the final linear layer for other model types. From the representations of the three images in a triplet, we can again construct a similarity matrix $\boldsymbolS^\dagger \in \mathbbR^3\times 3$ where $S^\dagger := \rmz_i^\rm\top \rmz_j$ is the dot product between a pair of image representations. The AligNet loss function is defined as the following KL divergence between teacher and student triplet probabilities,

$$\beginarraycL_\mboxalignet(\bfS^^\prime ,\bfS^\dagger )\,:= \,\\ \,-\frac1B\mathop\sum \limits_s=1^B\sigma ([S_i,j^^\prime ,S_i,k^^\prime ,S_j,k^^\prime ],\tau ^^\prime )_s\mathrmlog\,\sigma ([S_i,j^^\prime ,S_i,k^^\prime ,S_j,k^^\prime ],\tau ^^\prime )_s\\ \,-\sigma ([S_i,j^^\prime ,S_i,k^^\prime ,S_j,k^^\prime ],\tau ^^\prime )_s\mathrmlog\,\sigma ([S_i,j^\dagger ,S_i,k^\dagger ,S_j,k^\dagger ],\tau ^\dagger )_s,\endarray$$

(5)

where τ′ = 1 and τ^† > 1 and B is the batch size. We find τ^† via grid search and set it to τ^† = 100 for all of our experiments. Recall that σ is a softmax function that models the probabilities over the three image similarity pairs (equation (1)). The final AligNet objective is defined as the following minimization problem

$$\mboxarg\,\mathop\mboxmin\limits_\theta \,L_\mboxalignet(f_\theta )+\lambda \Vert \theta ^\ast -\theta ^\dagger \Vert _2^2$$

(6)

where θ^* are the parameters of the pretrained base student model and θ^† are the parameters of the fine-tuned student model. This ℓ₂-regularization, which we refer to as weight decay to initialization, encourages the fine-tuned set of parameters to stay close to its base during training. It is similar to the regularization used for learning the uncertainty distillation transformation (equation (4)) but adapted to the set of all model parameters rather than a linear transform.

Surrogate teacher model

Reference ²⁵ showed that image and text models and models trained on large, diverse datasets are better aligned with human similarity judgements than vision models trained with a self-supervised learning objective or supervised models trained on ImageNet. Thus, we use the best-performing image and text model according to various computer vision benchmarks at the time of writing this paper as our teacher model. This model is called SigLIP⁶³. SigLIP, similar to contrastive language–image pretraining (CLIP)⁶⁴ and ALIGN⁶⁵, is trained via contrastive language–image pretraining using millions of image and text pairs. The difference between CLIP and SigLIP is that the latter uses a paired sigmoid loss instead of the standard softmax function usually used for pretraining image and text models via cross-entropy. Image and text pretraining allows the model to learn an aligned representation space for images and text; thus, adding more semantic information about the objects in an image to the model representations.

We use the SigLIP-So400m variant of SigLIP as our teacher model. This variant uses an optimized ViT backbone whose performance is similar to one of the largest ViTs, ViT-g/14⁶⁶ while having fewer parameters and thus being smaller. The number of parameters of SoViT-400m/14 lies somewhere between that of ViT-L/16 and ViT-g/14. The output dimensionality of the image and text encoder representations of SoViT-400m/14 is p = 1,152 each. We align the image encoder representations with human odd-one-out choices using the uncertainty distillation optimization outlined in equation (6). This allows us to increase the triplet odd-one-out accuracy of SigLIP-So400m from 44.24% to 61.7% (rightmost column in Supplementary Table 1), which is close to the human noise ceiling of 66.67% for THINGS (compare ref. ⁵⁴) and thus among the best human-aligned models without AligNet fine-tuning (compare ref. ²⁵). It is noted that this is a relative increase in performance of 39.47%. Throughout this paper, we use the human-aligned version of SigLIP-So400m as the surrogate teacher model for generating human-aligned similarity judgements and distiling human-like similarity structure into student vision foundation models (VFMs). We select a diverse and representative set of student VFMs.

Student models

As previous research has demonstrated that a model’s architecture has no significant impact on the degree of alignment with human similarity judgements^25,61, we use the same architecture for all student models that we fine-tune on AligNet. Specifically, we use the ViT⁸ for the backbone of each student model. We use the ViT rather than a convolutional-neural-network-based model because ViTs have recently emerged as the dominant neural network architecture for computer vision application and VFMs. Every large VFM used in practice is based on the ViT^30,63,67,68. Unless otherwise mentioned, we use the base model size, that is, ViT-B. ViT-B has 12 attention layers and an internal (hidden) representation size of p = 768. It has been shown that both the training data and the objective function have a substantial impact on the degree of alignment with human behaviour. Thus, we use student models that were trained on different pretraining task with different training data and objective functions.

Supervised pretraining is still the prevailing mode of training computer vision models. Therefore, we trained ViT-B on the popular ImageNet dataset consisting of 1.4 million natural images³⁸. To examine how model performance changes as a function of the model size, we train ViT instances of three different sizes on ImageNet: ViT-S/16, ViT-B/16 and ViT-L/16. The image patch size is the same for each of those models. To evaluate the effect of AligNet on self-supervised pretraining, we use pretrained DINOv1⁶⁹ and DINOv2²⁹ models of which DINOv1 was pretrained on ImageNet and DINOv2 was pretrained on a different, larger image dataset as denoted below. In addition, we investigate multimodal training of vision models that add textual information in the form of both image captioning via the CapPa model³⁰ and CLIP via SigLIP⁶³. The latter model is considered state of the art on many downstream computer vision applications and is used as the image embedding model in modern large visual-language models^67,68. The full list of student models that we consider in our analyses is as follows:

ViT-S,B,L
- Training data: ImageNet³⁸
- Objective: supervised learning

SigLIP (ViT-B,SO400M)
- Training data: WebLI⁶³
- Objective: CLIP⁶⁴

DINOv1 (ViT-B)
- Training data: ImageNet³⁸
- Objective: self-supervised image pretraining⁶⁹

DINOv2 (ViT-B)
- Training data: DINOv2 data (see ref. ²⁹ for details)
- Objective: self-supervised teacher-student distillation²⁹

CapPa (ViT-B)
- Training data: JFT-3B (Google proprietary dataset)
- Objective: multimodal image captioning³⁰

Representational similarity analysis

Representational similarity analysis is a well-established method for comparing neural network representations—extracted from an arbitrary layer of the model—to representations obtained from human behaviour⁴⁴. In representational similarity analysis, one first obtains representational similarity matrices (RSMs) for the human behavioural judgements and for the neural network representations (more specific details can be found in Supplementary Information). These RSMs measure the similarity between pairs of examples according to each source. As in previous work^{23,25,54,61,71}, we flatten the upper triangular of human and model RSMs respectively and quantify their similarities using use the Spearman rank correlation coefficient. In contrast to Pearson correlation, the Spearman rank correlation is scale invariant and thus better suited to measure similarities of judgements obtained from different sources.

Multi-arrangement task

Human similarity judgements for refs. ^23,71 were obtained by using a multi-arrangement task. In a multi-arrangement task, participants are presented with a computer screen showing images of several different objects. The participants are asked to arrange the images into semantically meaningful clusters, given the instruction that images of objects that lie close together are considered more similar. From this arrangement, one can infer pairwise (dis-)similarities of the objects and average those across all participants to obtain a representative (dis-)similarity matrix.

Likert scale

In refs. ^24,72, pairwise similarity judgements were obtained by asking human participants to rate the similarity of pairs of objects on an ordinal scale that ranges from 0 (‘not similar at all’) to 10 (‘very similar’). The pairwise similarity ratings can be averaged across the different participants, which in turn yields a matrix of similarities between pairs of objects.

Neural network representations

RSMs for neural network representations are obtained by first embedding the same set of images that were presented to the human participants in the p-dimensional latent space of a model. The latent space could be any layer of a neural network. For the base models, we use the representations of the image encoder for SigLIO and the CLS token of the penultimate layer for CapPa, DINOv2 and ViT-B. We do this because previous work has shown that the penultimate layer space and the image encoder space of image and text models, respectively, yield the highest similarity to human behaviour^24,25,73. After embedding the images into the neural net’s latent space, we get a representation matrix $\bfX\in \mathbbR^n\times p$ for the n images in the data. Instead of simply computing the dot-product similarity matrix $\bfS:= \bfX\bfX^\rm\top $, in representational similarity analysis one typically uses either a cosine similarity or a Pearson correlation kernel to compute the affinity matrix

$$\cos (\bfx_i,\bfx_j)\,:= \,\frac\bfx_i^\rm\top \bfx_j\Vert \bfx_i\Vert _2\Vert \bfx_j\Vert _2;\,\,\,\,\,\phi (\bfx_i,\bfx_j)\,:= \,\frac{(\bfx_i-\bar\bfx_i)^\rm\top (\bfx_j-\bar\bfx_j)}{\Vert \bfx_i-\bar\bfx_i\Vert _2\Vert \bfx_j-\bar\bfx_j\Vert _2},$$

where the cosine similarity kernel function cos(x_i, x_j) or the Pearson correlation kernel function ϕ(x_i, x_j) is applied to every (x_i, x_j) vector pair of the matrix X for obtaining the final RSM $\bfS^^\prime \in \mathbbR^n\times n$. Here we use the Pearson correlation kernel function ϕ(x_i, x_j) to obtain a neural net’s RSM. Pearson correlation is the centred version of cosine similarity and the ranking of the obtained similarities does not differ between the two kernel functions but Pearson correlation first centres the vectors to have zero mean and is therefore a more robust measure. For obtaining RSMs with transformed representations, the transforms are first applied to X before computing S′.

Alignment with conceptual hierarchy

When analysing alignment with the conceptual hierarchy, we use the original ImageNet category labels for the images³⁸. ImageNet is structured by the WordNet hierarchy, from which we extract basic and superordinate categories aligned with the previous cognitive work. Within and across categories, we measure change in representation distance relative to other changes (by z-scoring across all representation distances for the given model checkpoint), because relative distances are more meaningful than absolute ones (for example, scaling all representations by two would change absolute distances, but not relative ones), and absolute scales of all representations tend to increase during training. We quantify changes with mixed-effects linear regressions that account for the non-independence of representational changes across the different clusters (see Supplementary Information section 3.2 for details).

Levels data

We collected a new multi-level similarity judgement dataset from N = 473 human participants, which we named Levels. The dataset contains odd-one-out judgements on three different types of triplet: coarse-grained semantic, which requires deciding on the odd one out in broadly different categories; fine-grained semantic, which involved discerning subtle within category distinctions; and class boundary, which tested for category-boundary detection. Consistent selection of the same odd-one-out image (for example, i) in multiple participants indicated that the remaining two images (for example, j and k) were closer to each other in the participants’ concept space than either was to the odd one out (see Supplementary Information for details about the data collection). Levels allowed us to evaluate model–human alignment for the same set of stimuli on various levels of abstraction, and to assess how well the models capture the inherent uncertainty in human judgements, inferred from response latencies.

Participants

We recruited N = 508 participants (209 female, 289 male, 3 diverse, N = 7 missing demographic information owing to revocation of study consent; mean age 31.75 ± s.d. = 8.04 years) online via Prolific Academic (https://www.prolific.ac). The eligibility criteria were that participants had to be between 18 and 50 years old, fluent in English, have normal or corrected-to-normal vision, no colourblindness, and have a minimum approval rating of 95% on Prolific. Participants provided informed consent before starting the experiment. The experiment lasted approximately 45 minutes. Participants were reimbursed with £7.70 for completing the experiment and received an additional bonus payment of £0.77. Partial payments were made if the experiment was not completed owing to technical issues (N = 6) or early termination by the participant (N = 1). Participants performing below 90% correct on catch trials (N = 19, 3 female, 16 male), or failing to respond in the allotted time window (15 s) in more than 10 trials (N = 9, 4 female, 4 male, 1 diverse) were excluded. Thus, N = 473 participants remained in the dataset (202 female, 269 male, 2 diverse; mean age 31.82 ± s.d. = 8.03 years). Of these participants, N = 448 were each tested with a different selection of triplets, while ensuring that each triplet was presented N = 5 times across the entire sample of participants (see information on stimuli sampling below). Owing to a server glitch during trial assignment, the remaining N = 25 participants shared their exact triplet selection with one other participant in the sample. These N = 25 participants were excluded from the response times and uncertainty estimation (see ‘Alignment at multiple levels of abstraction’ section) to restrict analysis to participants with different sets of triplets. The experiment was approved by the internal review board of the Max Planck Institute for Human Development.

Stimuli

The experimental stimuli were images taken from the ImageNet dataset³⁸. Another nine images were used for instructions only and depicted natural objects selected from the Bank of Standardized Stimuli (BOSS)⁷⁴, available at https://drive.google.com/drive/folders/1FpnEFkbqe_huRwfsCf7gs5R1zuc1ZOkn. We grouped the visual stimuli presented in the triplets according to different levels of abstraction: coarse-grained semantic, which comprised three images from three different categories; fine-grained semantic, showing three images from the same category; and class boundary, where two images were from the same and one from a different category.

Instead of randomly sampling triplets—which would reproduce dataset biases—we stratified sampling by superclasses. ImageNet classes follow the WordNet hierarchy^28,38, which includes higher-level classes. For instance, all dog breeds can be summarized as the dog superclass. To avoid presenting dogs, birds and other fine-grained classes that are overrepresented in ImageNet more frequently to the participants than other categories, we grouped the ImageNet classes into 717 coarse-grained WordNet superclasses. We uniformly at random sampled images from those 717 superclasses to construct the different kinds of triplets. It is noted that for all superclasses with more than one class, we uniformly at random chose one subclass and uniformly at random sampled one image, two images (without replacement) or three images (without replacement) from that subclass, depending on the triplet type. For most superclasses that comprised a single subclass only, that is, a one-to-one-mapping, we could skip the subclass sampling part. Triplet sampling resulted in N = 450 predefined experiment trial sets, of which N = 448 were used for testing. Across these, each triplet was presented within N = 5 different experiment files. This sampling process ensured a balanced distribution of triplets across the sample, and the repetition of each triplet in five different participants allowed for the calculation of an uncertainty distribution for each triplet.

The triplet odd-one-out task

On each trial, participants were presented with a triplet of images (i, j, k). Participants were asked to select the image that was the most different from the other two, that is, the odd one out. During the instructions, participants saw different triplets with increasing ambiguity regarding which image would likely be picked as the odd one out. Participants were given explanations for potential odd-one-out choices, clarifying that decisions could be based on different criteria, such as semantic or perceptual features of the shown images.

Procedure

The experiment was run online using jsPsych v7.3.3 (www.jspsych.org/7.3/) and custom plugins. Participants were asked to provide demographic information, including their age and gender. Thereafter, they viewed written instructions about the task and performed six practice trials (two trials per triplet level of abstraction). Participants were free to repeat the instructions until they felt confident to perform the experiment. The experiment proper comprised N = 330 experiment trials. Each trial started with a fixation cross (1 s), followed by the presentation of a triplet (maximum 15 s). Participants were asked to select the ‘odd one out’ using the right-, left- or downwards-facing arrow keys on their keyboard. Responses could be entered between 1 s and 15 s after triplet onset, after which the next trial started. Trials in which participants failed to submit a response were rare (M = 0.27% of trials; minimum 0.00%, maximum 6.06%). The serial order of triplet types (for example, fine-grained or coarse-grained semantic) and ImageNet classes (for example, dogs or birds) was counterbalanced across the experiment. We additionally counterbalanced the serial position of trial types across participants using a Latin-square design⁷⁵. Participants could take short breaks (self-paced for up to 2 min) after N = 50, 150 and 200 experiment trials. Experimental trials were interleaved with N = 16 catch trials (class-border triplets), which were predefined based on low model uncertainty and 100% agreement among participants on these specific triplets during piloting. Catch trial performance was used as an indicator of adequate task engagement (see participant inclusion criteria above).

Preprocessing of human response times and uncertainty estimation

Descriptive statistics on response times and uncertainty estimation (see ‘Alignment at multiple levels of abstraction’ section) were calculated based on participants with unique experimental trial sets (N = 448). The response-time data were log transformed (log(RT)), in accordance with current best practices for response-time analysis. Trials with response times longer than 10 s were excluded from analysis (on average M = 2.64% of trials per participant). As responses could be given no earlier than 1 s after triplet onset (see ‘Procedure’ above), no lower bound was set for response-time exclusion. To estimate uncertainty (in terms of the level of (dis-)agreement among observers) for each triplet, we used the discrete (Shannon) entropy of the response distribution across participants.

Human-to-human alignment

We computed the human noise ceiling for each abstraction setting in Levels using a leave-one-out cross-validation approach. In leave one out, the agreement level for a triplet is computed as the average match rate between a held-out participant’s response and the majority response of the remaining population. Thus, for a triplet that was used for five participants, on each leave-one-out iteration, one participant response is held out and the remaining four comprise the population. The human-to-human reliability score is then calculated as the average agreement level across all triplets in the dataset.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.