Mapping the neuronal building blocks of human language with language models

June 18, 2026

24

Single-neuronal recordings

Participants

All aspects of the study were approved and performed in strict accordance with guidelines set by the Massachusetts General Hospital Institutional Review Board (IRB) and Harvard Medical School for the ethical conduct of research. All of the participants included in the study were scheduled to undergo epilepsy monitoring^30,33,34,70. The decision for surgery was made by a multidisciplinary team that commonly included neurologists, neurosurgeons, radiologists and neuropsychologists. Operative planning was made independently by the surgical team and without consideration of study participation. Once the decision to undergo surgery was made, the participant’s research candidacy was reviewed. The participants were enrolled only if the surgical plan was for cortical surface grid recordings for seizure monitoring. The patients were aged at least 18 years, and they had intact language function and were able to provide informed consent. A total of eight participants was included in the study (3 female and 5 male, mean age of 40 years; range, 27 to 52 years). All of the participants that were consented were free to withdraw from study participation at any time without impact on clinical care.

Microelectrode recordings

For each participant, multi-electrode microarrays (Utah arrays; Blackrock Microsystems) were implanted into an area planned for surgical resection. These micro-arrays (used for research) were placed within the cortex in tandem with the surface cortical grids (used for clinical monitoring) and were confirmed to be positioned in areas planned for resection based on intraoperative cortical mapping, as described previously^30,71,72,73. The micro-arrays are configured in a 10 × 10 electrode channel mapping separated by 400 μm (96 recording electrodes covering an approximate area of 4 × 4 mm). For placement, the electrode arrays were implanted using a semi-automated impactor and were connected to a temporarily implanted female port. A total of eight arrays was implanted into eight participants. On the left hemisphere, two were in the anterior temporal cortex, one was in the posterior temporal cortex and one was in the prefrontal cortex. On the right hemisphere, one was in the anterior temporal cortex, two were in the posterior temporal cortex and one was in the prefrontal cortex. Together, these recordings lay in regions that have been shown to reliably engage in language production and sentence construction^{41,42,43,65,66} and to display robust language-selective responses^{41,42,44,65,74,75,76} and activation by validated language localizers on imaging studies^28,41,45 (Fig. 1a). Each participant had a single array. The voltages were recorded and amplified via a 96-multichannel acquisition processor, and a band-pass filter was applied to the neuronal signals (150Hz to 8 kHz; 1-pole low-cut and 3-pole high-cut with 1,000× gain). The signals were then digitized at 30 kHz and stored offline. Audio recordings were tracked and aligned through a synchronization trigger that was integrated into both the neuronal and audio recording systems (Natus Medical). After clinical monitoring, all arrays were removed.

Single-unit isolation and LFP processing

To select and cluster APs of putative neurons, an automatic waveform detection was commonly performed using Kilosort 3.0 (https://github.com/MouseLand/Kilosort). The selected APs were then manually curated using Phy (https://phy.readthedocs.io/en/latest/, version 2.0b6). Candidate units were clustered based on their voltage waveform morphologies, spiking rates and their distribution profiles of inter-spike intervals, auto-correlation and cross-correlation profiles. We excluded units that did not demonstrate waveform stability over the course of recordings or that did not have inter-spike-interval profiles or waveform morphologies consistent with those of cortical neurons. Units from different sessions were recorded at separate time periods and across different linguistic contexts and were therefore sorted separately^77,78,79. Finally, to confirm the quality of recordings, we used SpikeInterface to extract the measures of the single-unit sorting (v.0.102.3). Here the signal-to-noise ratio was defined as the ratio of the maximum amplitude of the mean spike waveform to the s.d. of the background noise on the same channel. Noise was computed through the median absolute deviation of the peak channel. Isolation distances were used to quantify the separation of putative units from their neighbours in principal component space. Here, after projecting spike waveforms into a lower-dimensional principal component space (across 5 principal components), the isolation distances were calculated using the Mahalanobis distance. Lastly, the inter-spike intervals for each unit were defined by the time intervals between consecutive spikes. Any putative units that had a high percentage of ISIs lower than 2 ms or firing rates lower than 0.1 Hz were excluded. Across all the single units, the mean signal-to-noise ratios were 2.5 ± 0.16 (mean ± s.d.), the mean isolation distances were 37.6 ± 2.7 and the mean peak inter-spike intervals (that is, the mean times that the inter-spike interval distributions peaked) were 8.7 ± 0.4 ms and were largely consistent with previous literature using analogous recording techniques^32,80,81.

LFPs were obtained from the same electrode contacts as our single-neuronal recordings. Here, for each electrode of the Utah array (96 active channels in total per array), the raw voltages were processed through a series of notch filters to eliminate the AC powerline at 60 Hz and its higher-order harmonics. Then, a fourth-order Butterworth low-pass filter was applied with a cut-off frequency of 300 Hz to capture LFPs. Finally, the filtered signals were downsampled to 1 kHz for downstream analysis.

Language production and audio recordings

We tracked the activities of neurons as part of real-time conversations to study their responses during natural language production^{42,43,46,47,48,76,82}. Here the participants responded freely to questions and prompts that were given ad hoc, therefore allowing them to produce phrases and sentences in a manner that reflected natural speech (Extended Data Table 1). The sentences were also constructed de novo (for example, rather than being simply read or repeated) and therefore enabled neuronal responses to be evaluated independently of explicit sensory cues. Thus, the participant may be prompted with a question such as “Do you want a phone right now?”, to which the participant may respond with the compound sentence “Yes, keep it there just in case, and then just bring that back over”. Therefore, in alignment with growing efforts in the field to study natural language^{21,42,43,46,47,48,76,82,83,84,85,86,87}, the sentences produced by the participants varied naturalistically in structure and content, were constructed de novo and were representative of natural language production.

Together, the questions heard by the participants could be divided into 20 main topics and covered wide-ranging fields that included: dates and times, feelings and emotions, spatial locations and orientations, personal descriptions, comfort and reassurances, personal thoughts, conditions and situations, thoughts about specific events, general commentary, health, disagreement and personal conflicts, general activities, opinions, validation and agreement, uncertainty about events, passive questions, active questions, binary questions, visual attributes and social greetings. Quantifying the diversity of question topics across the participants, we also calculated the Shannon entropy of the topics for each participant as: $H(X)=-\sum _x\in Xp(x)\log \,p(x)$, where x labels the topics (n = 20), and p(x) is the frequency of prompts belonging to each category. Given that each participant received an average of 14.1 ± 4.9 topics and with a theoretical maximum entropy of 3.0, using an even distribution across 20 topics, we find that the entropy ranged from 1.4 to 2.3, with an average of 2.0 ± 0.3 (mean ± s.d.), meaning that the topics were distributed broadly across the participants.

To align each word to the AP activities and LFPs, each word was transcribed and timestamped from the audio data in an automated manner using either Vosk API (16 kHz 16 bit mono; vosk-model-en-us-aspire-0.2) or WhisperX (https://github.com/m-bain/whisperX). After automatic transcription, each word and its timestamp were then manually validated and corrected by an independent scriber (Audacity; v.2.3). Words containing participant-identifying information were replaced by a set of common names that were independent of the participants.

Single-neuronal and local field responses to linguistic features

Sentence parsing and linguistic feature labelling

A sentence parsing approach was used to identify linguistic features on a word-by-word level as produced by the participants during natural speech. This approach, based on natural language processing models, allows for the basic linguistic properties of words (for example, their parts of speech) to be reliably labelled and tracked^1,2,3,49,88. Here we used two complementary parsing techniques.

Constituency parsing for each sentence was obtained from the AllenNLP package^49,88. Here the word’s part of speech (for example, adverb) was labelled following the Penn Tree Bank, and similar labels were grouped (Extended Data Table 1b; although note that different parsers have also been shown to display differences in labelling)⁸⁹. The word’s constituent label was determined on the basis of which constituent the word belonged to, and was obtained by traversing the parse tree upward from the word to the next highest hierarchical level. The constituency depth of each word in the constituency tree was determined by the word’s hierarchical distance to the sentence root, and the word’s ordinal position was its order within the sentence. Finally, the closing level was determined by the number of finished phrases after the word was produced (Extended Data Fig. 1a). Words that were not located at the rightmost branch of the parse tree were assigned a closing level of 0, as they did not close any syntactic structures. Levels with a single child node were not counted as an extra level.

Dependency parsing for each sentence was similarly obtained from the AllenNLP package^49,88 that followed the model of Deep Biaffine Attention for Neural Dependency Parsing⁹⁰. Built on the verb of a sentence as the root, the dependency tree grows hierarchically with child nodes showing direct dependency on their parent node. Here the grammatical relationship between words (for example, object of a preposition) was determined based on their dependency. To allow comparison across features, the top 10 most frequent categories for each feature were selected for neuronal analysis. For depth, closing level and position, a maximum of 8 was capped on features with excessive layers/number of words with word position then shifted the range to 1–9 (Extended Data Table 1b).

Single-neuronal analysis

Selectivity of neuronal responses

To evaluate selectivity of the neurons to specific linguistic features (for example, parts of speech), we compared their firing rates for words of the target category (such as adverbs) against all words in other categories (for example, all non-adverbs; Extended Data Table 1b, d). This classification was performed for each linguistic category (such as nouns, verbs) and for each neuron. Statistical significance was determined using the one-sided Mann–Whitney U-test (α = 0.05), with P values adjusted using the FDR method for multiple comparisons across categories. To illustrate the specificity of neuronal responses within the population, we combined the selectivity of all neurons and performed a dimension reduction algorithm using UMAP transformation⁹¹. Here to characterize the tuning properties of the neurons during the canonical word planning period²⁷, we used a window from −400 to −100 ms aligned to word onset. However, the precise window was found to have little effect on the proportions of selective neurons identified (−400 ms or −1,100 ms before word onset; χ² test, P > 0.25 for all syntactic features).

Modulation of neuronal responses

While the selectivity of a neuron describes whether the neuron responds preferentially to a specific linguistic feature (for example, adverbs), its neuronal modulation quantifies the magnitude of this response. We calculated the baseline-normalized z scores for individual neuronal firing rates as:

$$Z_i=\frac\mu _i-\mu _\mathrmother\sigma _\mathrmother$$

where the index i labels the firing rates for the linguistic feature to which the neuron is selective, ‘other’ labels the firing rates for all other features; μ labels the mean and σ labels the s.d. To assign a z score to each linguistic feature (for example, parts of speech), we identified the category within that feature that yielded the most significant response (the lowest P value) and selected its corresponding z score.

Controls for generalizability of neuronal responses

To ensure that the selectivity of the neurons was not biased by the particular sets of words, we performed an additional bootstrapping procedure in which we randomly subsampled the words until reaching 80%, 85%, 90% and 95% of the number of words in the original dataset. We then repeated this procedure 100 times to ensure that we broadly sampled across the available words (Extended Data Fig. 1c).

Controls for collinearity with multiple regression analysis

To confirm the selectivity of the neurons and to further confirm that their responses were not influenced by potential collinearities between the features, we performed a multiple regression analysis. Here we used Lasso regression (regularization α = 0.01) to predict neuronal firing rates from one-hot embeddings of linguistic features, thus accounting for the potential effects of multicollinearity between variables^51,92. This yielded standardized coefficients (β) for each feature. Significance was then determined by randomly shuffling the neural firing rates 10,000 times to generate a null distribution of β values. Finally, P values were derived by comparing observed coefficients against this distribution, with a Bonferroni correction applied to account for multiple comparisons. Second, we evaluated for the potential effect of other related properties such as word frequency, length and surprisal (accounting for 7.9%, 3.6% and 11.9% of the neurons, respectively), but found little effect on the proportion of selective neurons (Extended Data Table 1c). Here, word-level surprisal was obtained from the Vicuna-7B model whereby input text for each sentence was tokenized, and token-level surprisal was computed as the negative log-probability given its preceding context (derived from the model’s output logits)⁹³. Word-level surprisal was defined as the sum of surprisals across all subword tokens comprising that word given their sentence context, whereas word frequency was calculated more simply based on the occurrence of each unique word across all the recorded sentences and sessions. Word length was calculated as the number of letters in each word and was top coded at eight characters to reduce the influence of outliers⁹⁴. Word surprisal, frequency and length were all discretized to allow for comparison with the other linguistic features. Finally, we separately identified neurons that maintained consistent selectivity using a sequential feature selection approach that removes correlated features or features that are uninformative^51,95 and again found little effect on the proportion of selective neurons (Extended Data Table 1c reports on the percentages of neurons that were consistently selected in both approaches).

Controls for specificity and overlap between neurons

To further evaluate whether and to what degree the tuning properties of the neurons overlapped, we calculated the percentage of cells that demonstrated significant selectivity for two or more distinct features (for example, parts of speech and dependencies). Here, for every feature pair, we first constructed a neuron × feature binary matrix indicating selectivity for each linguistic feature (Fig. 2a,c). Meanwhile, we independently shuffled the selectivity assignments for each feature across neurons. This preserved the marginal selectivity rates of neurons for each individual feature, while breaking only the joint structure—the association between a neuron being selective for one feature and also being selective for another feature. This permuted data, in turn, is the null distribution needed to test whether co-selectivity is different from expected under independence. Next, we calculated the co-selectivity for each pair of features using the shuffled data. These steps were repeated for 10⁵ iterations to generate a null distribution of co-selectivity values for each feature pair^51,96. The observed overlap for each feature pair was then compared against this null distribution to determine whether it differed significantly from chance-level independence (P < 0.05, two-sided permutation test, Bonferroni corrected for the number of pairs tested). Additional confirmatory analysis of independence included a χ² test and Fisher’s exact test that compare the observed to expected overlapping frequencies of neurons for all possible pairs (P < 0.05, Bonferroni corrected for the number of pairs tested). A χ² test and a Fisher’s exact test were also performed to confirm the robustness of the results. To further examine the relationship between neural activity (the neurons’ firing rates) and feature pairs (parts of speech and dependencies), we first identified the subset of neurons that exhibited significant selectivity to both parts of speech and dependencies. For these neurons, we then calculated their z-scored firing rates by comparing the features to which they were selective versus other categories. Finally, we performed a linear regression analysis in which we predicted the z scores for one feature (dependencies) using the z scores of the other (parts of speech), using the neurons as observations. Here, the P value reflected whether the slope of the linear regression was significantly different from zero.

Controls for lower-order word properties

To evaluate neuronal responses in relation to low-order word properties, we quantified the frequency (pitch) of each uttered word (Parselmouth interface for Praat, v.0.4.3)⁹⁷. The mean pitch was extracted across the duration of each word and aligned to the word production planning period.

Comparing neuronal activities across hemispheres and cortical areas

To allow for comparison across cortical areas (anterior temporal cortices, posterior temporal cortices and posterior frontal cortices) and hemispheres (left and right), we evaluated differences in the proportions of neurons that displayed selectivity using a χ² test. We also evaluate the degree of modulation using Wilcoxon rank-sum test for hemispheres and a Kruskal–Wallis H-test for cortical areas. We next used a one-sided permutation test to further confirm differences in modulation across areas and hemispheres. Finally, two-way ANOVA was used to confirm these findings and to further test potential interaction effects.

Comparing single-neuronal to local field activities

In addition to the APs, we evaluated the LFP activities recorded from each site^50,98,99,100. We calculated the selectivity and modulation of the LFPs that were recorded across all channels. To further allow for a comparison between single neuronal activities and LFPs^50,101, we spatially aligned the activity of each neuron with the specific LFP channels from which they were recorded. We then sorted and visualized the single-neuronal z scores and applied this same sorting order to the LFP z scores (Fig. 5). A χ² contingency test was used to evaluate the proportions of neurons and LFP sites that displayed selectivity and a one-sided permutation test was used to evaluate the overlap in linguistic selectivities (n = 1,000). A two-sided permutation test was used to compare their degrees of modulation (n = 10,000).

Neuronal decoding analysis

Decoding linguistic features from neural activity

Decoding model

A multinomial logistic regression classifier (Scikit-learn package v.0.21.3) was used to decode the linguistic features from neuronal activity with L2 regularization to minimize the loss function as

$$\mathop\textmin\limits_\bfw,c\frac12\bfw^T\bfw+C\mathop\sum \limits_i=1^n\textlog(\textexp(-y_i(\bfx_i^T\bfw+c))+1)$$

where C is the inverse of the regularization strength; vector w are coefficients and c is the bias; x_i represents the feature vector of neuronal firing rates, $y_i\in [-\mathrm1,1]$ labels the actual category for word index $i\in [1..n]$, which labels the number of observations (words). We used balanced class weights, with a regularization strength of 0.0001. The mean decoding accuracies were obtained from the independent observations on the test dataset.

Neural population decoding

As described previously^{27,102,103,104,105}, to perform the population-level decoding analysis, we aggregated neuronal activity across all participants and sessions. Here we performed a stratified sampling to split the training and testing datasets^51,106. Specifically, for a given feature (for example, parts of speech), we randomly selected 70% of trials belonging to a specific category (such as nouns) for the training set and an independent 30% of trials for the testing set. These were repeated 1,000 times with independently drawn random partitions on each iteration for the non-shuffled data. The neural-feature matrix $X_\mathrmcombined=[X^(1)|X^(2)|\cdots X^(S)]\epsilon {\rm\mathbbR}^W\times N_\mathrmtotal$, where $N_\mathrmtotal=\sum _s=1^SN_s$ is the total number of significant neurons pooled across S sessions, and W labels the number of words. $X^(s)$ is composed by $x_w,n^(s)$, which is the firing rate of neuron n in session s for word w. The mean decoding accuracy was obtained as the average decoding performance across the sampling iterations.

Thus, whereas the individual words may differ across sessions, their categories (for example, words that are nouns) could be directly decoded and compared. The neural activity vectors for these subsampled words were then horizontally concatenated across all selective neurons from all participants and sessions to create a unified training and testing matrix, in which the rows represent the aggregated feature observations and the columns represent the combined pool of relevant neurons from all sessions. This process was then repeated for every category (such as nouns) per linguistic feature (parts of speech), with the resulting matrices vertically concatenated to form the final datasets. Finally, to assess statistical significance and estimate chance-level performance, we conducted a permutation test by randomly shuffling the feature labels and repeating the entire modelling procedure to generate a null distribution (n = 1,000). The P value for the permutation test was then derived by calculating the frequency with which the shuffled accuracies exceeded the actual decoding accuracy (P < 0.05; Fig. 2d).

Single and bootstrapped observations

All primary decoding results in the main text, including those shown in Fig. 2, are provided for individual words. However, to further account for the sparse spiking of neurons per word and to ensure high fidelity of decoding^{27,107,108,109,110}, we also generated new observations within the same category by randomly selecting and averaging the neural firing rates across ten observations, and then averaged the firing rates to create a new observation. This procedure therefore further accounts for the neurons’ sparse spiking per word while importantly preserving the linguistic features being decoded. To account for the difference of word numbers in each section, we used this bootstrapping with replacement to standardize the number of observations per category across session. Here the above process was repeated until we created the same number of observations (500 new observations per unique category for the training dataset and 200 for the testing set). This procedure, including the randomized data partitioning and bootstrapping, was repeated for 1,000 iterations to ensure robust estimation and to account for variability in the sampling process. Finally, we confirmed the consistency of observation-averaged decoding by repeating the bootstrapping procedure but varying the number of bootstrapped observations between 10, 20 and 30 (Extended Data Fig. 2a) as well as by selecting neurons that specifically displayed selectivity on model training (Extended Data Fig. 2b) and by selecting only neurons that displayed selectivity for individual participants (Extended Data Fig. 2e). Unless specified, all decoding analyses were performed using observation-averaged activity.

Generalizability of linguistic representations

To examine the generalizability of the linguistic representations across contexts, we used Vicuna model embeddings to cluster the sentences into topics based on their model embeddings, as described previously^111,112. This step clusters the sentences into distinct topics based not only on the lexical meanings of the words but also on their sentence contents and contexts. We used the top-most layer of the model’s hidden embeddings for all words per sentence, and categorize into three or four equal clusters (k-means; Extended Data Fig. 2f).

To further confirm the generalizability of linguistic representations across the participants, we performed a bootstrapping procedure with 5 observations, then trained a classifier on a random selection of 50 observations for each category, and tested on randomly selected 20 for each category. This process was then repeated with 50 repetitions for each session. We then applied a Friedman test to examine whether dropping one participant significantly impacts the observation-averaged decoding accuracy. We also confirmed the relationship between the decoding performance and the number of neurons recorded^27,113 by randomly sampling cells from across the participants and showing that the number of cells correlated with the decoding performance (Extended Data Fig. 2c). Finally, for further comparison, we repeated these analyses but now evaluated decoding performances across individual cortical areas and hemispheres.

Language models

Syntactic and semantic embedding models

Embedding models were used to capture the combination of linguistic features that described the specific words being produced. For example, even though the word ‘dog’ in the sentence ‘The dog ate the bone’ and ‘The man walked the dog’ is the same, the models distinguish its varying syntactic roles—such as nominal subject versus object—as well as the distinct semantic representations between different nouns like ‘dog’ and ‘bone’^1,2,3,49,88. To first examine the combination of syntactic features that defined each word, we constructed embedding models that tracked all features obtained from the constituency and dependency parsers by representing each category (for example, nouns within parts of speech) with a one-hot vector of zeros and ones, and concatenating them for all syntactic features. This combination of features therefore provided a comprehensive representation of all tested features per word (Fig. 3a). Here, the embedding vector $\bfE(w)$ was written as

$$\bfE_\mathrmsyn(w)=\bfv_1(w)\oplus \bfv_2(w)\oplus \cdots \oplus \bfv_N(w)$$

where w is the word for which the embedding is being constructed and v_i(w) is one-hot vector representation of the specific value of the ith feature category for word w.

Second, we examined the lexical semantic properties of each word²⁷ using a pretrained semantic embedding model (word2vec)¹¹⁴ to obtain a vectoral representation for each word. This model maps the semantic meaning of words onto an embedding space in which words that have different meanings also have different vectors. Embeddings were obtained from the word2vec model trained on the Google dataset (about 100 billion words, gensim, v.4.3.2). Out-of-vocabulary words (approximately 7% of all words) were excluded. We also verified that zero-vector padding of these out-of-vocabulary words had little effect on predictivity results (two-sided permutation test, n = 10⁵; P = 0.37 and P = 0.60 for semantic only and combined embeddings; see Language model predictivity).

Finally, to examine the words’ semantic and syntactic properties, we combined the syntactic and semantic vectors to form a shared embedding model which was defined here as

$$\bfE_\mathrmcomb(w)=\bfE_\mathrmsyn(w)\oplus \bfE_\mathrmsem(w)$$

where E_sem is the semantic embedding vector for word w.

Higher-order contextual embedding models

While the above embedding models enabled us to examine how combinations of features are encoded by neurons on a word-by-word level, they did not determine whether the neurons also reflected their specific contexts (for example, the word ‘bark’ within ‘The dog has a loud bark’ versus ‘The tree has thick bark’)^{6,115,116,117,118}. Therefore, to further address this question, we also used contextual models that were capable of further capturing these contextual variations^22,23. Here we used Vicuna-7B (v.1.1), a large language model that was fine-tuned from the LLaMA architecture to serve as a contextual framework for generating high-dimensional vector embeddings. Vicuna-7B is an open-source autoregressive language model based on a transformer architecture and consists of 32 transformer layers and 4,096 embedding dimensions and was further tuned on large-scale, user-shared conversations.

Here the contextual model embeddings were obtained by inputting words from each sentence produced by the participants, such that each sentence was associated with corresponding hidden embeddings for each layer^23,42. Next, to extract word embeddings across varying contextual lengths for each sentence (Fig. 3c), the word strings were generated by sliding a window of size N across the preceding word sequence. These word strings were then processed through the Vicuna-7B model to extract hidden state representations for all layers:

$$\bfE_\mathrmcontextual(w_i,L)=\bfH_L(w)$$

where H_L(w) is the hidden state vector extracted from layer L of the model for word sequence $\w_1,w_2,\cdots ,w_N\$. In combination, this approach therefore allowed us to systematically manipulate the amount of preceding context provided to the model, ranging from two words (that is, N = 2) to extended ten-word sequences. The hidden state activations across all layers were then extracted at the last token and concatenated, thus providing contextual information for each word in relation to its specific sentence context and its varying context lengths. Further validation experiments include using a diverse suite of large language models (Extended Data Fig. 3 and Extended Data Table 2). All models and tokenizers were loaded through Huggingface in Python¹¹⁹ and run with a NVIDIA GeForce RTX 4090 GPU.

Predictivities of neurons across language models

Language model predictivity

We examined the degree to which embeddings from each of the language models (syntactic, semantic and higher-order contextual models) were predictive of the neurons’ activity patterns. First, to allow for comparison across models and to limit overfitting, we first performed principal component analysis (optimal dimensionality was determined as 30 principal components; Extended Data Fig. 4a) from each model layer (constrained within a range of [−10,000, 10,000] to mitigate the influence of outliers). These matrices were then temporally assigned using non-overlapping 100 ms bins using embeddings for the bins that occupied each word. Similarly, we calculated the neural firing rates for individual neurons by binning their APs into 100 ms intervals. The overall population neural activity was calculated from the combined firing rate patterns of all neurons and were smoothed by a three-bin moving average.

Next, we calculated predictivity¹²⁰, which is the Pearson correlation coefficient between the language model’s predicted neuronal activities and the observed neuronal activity:

$$\textPredictivity\,=\,\frac\mathrmcov(y,\haty)\sqrt\mathrmvar(y)\times \mathrmvar(\haty)$$

where y labels the actual neuronal firing rates, and $\haty$ labels the predicted neuronal firing rates based on the model’s embeddings. cov($\cdot $) labels the covariance matrix, and $\mathrmvar(\cdot )$ labels the variance.

Finally, to predict neuronal activity from the models’ embeddings, we performed a Ridge regression (Python scikit-learn package; v.0.21.3) using the dimensionally reduced embeddings from each contextual model and layer to minimize the cost function:

$$\mathop\textmin\limits_\bfw,c\alpha \,\bfw^T\bfw+\mathop\sum \limits_i=1^n(y_i-(\boldsymbolx_i^T\bfw+c))^2$$

where y_i is the neuronal activity for observation $i\in [1..n]$; vector w are coefficients and c is the bias; x_i represents the feature vector of language model embeddings; α is the regularization strength that was set as 0.01. The predictivities were then calculated separately for each session. This approach therefore enabled us not only to obtain an aggregate account of the predictivities of all recorded neurons but also to evaluate their consistency across participants and sessions. The regression weights were trained using data from all but one sentence within each session, and were then used to predict neuronal responses for the withheld sentence. By repeating this process across all sentences, we obtain a full set of predicted activities for each layer of the model.

For single-neuronal predictivities, we calculated the predictivity from each individual neuron’s firing rates and then averaged their predictivities (Fig. 3e (right) and 3f), while the mean population predictivities were obtained by calculating the predictivity from the mean firing rates from all neuronal firing rates in a session and then averaging across sessions (weighted by the number of neurons; Fig. 3e (left)). Here predictivities in Fig. 3b were calculated from the averages during the pre-utterance period and were evaluated with fixed temporal offsets that aligned neuronal activity (obtained before utterance).

To establish a baseline chance performance for predictivity, we generated a null distribution by randomly permuting the average neuronal activities and recalculating the predictivity on this shuffled dataset. This randomization procedure was conducted for the same number of iterations as the primary modelling analysis. We then performed a one-sided permutation test to determine whether the predictivity of the actual data was significantly higher than that of the shuffled control. For comparison of predictivities across hemispheres and brain areas, we used a rank-sum test and a Kruskal–Wallis test, respectively.

Neural predictivity controls

To confirm that the predictivities reflected information specifically captured by the language models, we performed an additional control analysis using randomly generated non-meaningful stimuli as inputs to the contextual model. Here the tokenizer was replaced with a random drawing of a number (index of a token) to ensure that no linguistically meaningful information was input into the model. Furthermore, to confirm that the predictivity reflects the contextual information specific to each sentence and cannot be simply explained by the order of words, we also performed the swap sentence control. We first grouped the sentences produced by the participants based on their word length to ensure matched comparisons. Within these word-length-matched groups, we then randomly shuffled the sentences. These swapped-sentence embeddings were then used to calculate the predictivity of neuronal population activity. To allow for consistency with our contextual predictivity analyses above, both of these controls were performed at the population level, and both results were reported at −1,000 ms before utterance.

Neural predictivity dynamics and mapping

To characterize the temporal evolution of neural predictivity, we systematically shifted the temporal alignment between neuronal firing rates and contextual model embeddings by introducing a series of lead and lag offsets in 100 ms increments. This approach enabled us to identify the specific latency at which neural activity most strongly reflects the linguistic information encoded in the contextual model embeddings across a wide temporal window, spanning from 2,000 ms before word onset (pre-articulatory planning) to 2,000 ms after onset (post-articulatory processing). This was applied to both the population’s firing rates (Fig. 3e (left)) and the individual neuronal firing rates (Fig. 3e (right)).

Finally, to illustrate the relationship between the neurons’ peak predictivity and their linguistic feature selectivity, we calculated the neuron’s average predictivity for each model layer and time point. This approach therefore enabled us to map the layer and temporal lag at which neuronal predictivity peaked for each neuron based on its encoding properties (Fig. 3f and Extended Data Fig. 4). For all contextual predictivity results in Fig. 3c,d and Extended Data Figs. 3 and 4, we used the −1,000 ms timepoint relative to the embeddings, which was determined as the approximate predictivity peak observed in our time-resolved population analysis. However, the exact timing used for analyses was not critical as the contextual model significantly outperformed the other syntactic, semantic and combined models (one-sided permutation test, P < 0.004 at −1,000 ms across all comparisons; P < 0.042 at −400 ms to −100 ms interval across all comparisons; n = 10⁵) and exceeded chance for both tested intervals (one-sided permutation test, P < 10⁻⁵, n = 10⁵).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.