In this section, we detail the methods used in all parts of our analyses, including our observational comparisons of gender and age bias in online images, videos and texts, as well as our Google Image search experiment and our resume audit of ChatGPT. The pre-registration for our online image search experiment is available at https://osf.io/x9scm. This experiment was a successful replication of a previous study with a nearly identical design; the pre-registration of this previous study is available at https://osf.io/2b58d. This study was approved by the ethics review board at the University of California, Berkeley, where this study was conducted.
Observational methods
In what follows, we describe our observational methodology for collecting and analysing large bodies of images, videos and text online. With regard to the crowdsourcing methods applied to analysing our main Google and Wikipedia Image datasets, many of the methods described below (including the procedure and demographic details of the coder population) were reproduced from the original data collection summary provided as part of the first publication of these datasets6. In addition to this reproduced description, we include information on how age classifications of these images were collected, because this feature was not explored or discussed as part of the original publication of these datasets6.
Image data collection procedure
Our crowdsourcing methodology consisted of four steps. We began by identifying all social categories in WordNet69, a canonical lexical database of English. WordNet captures 3,495 social categories, including occupations (such as doctor) and generic social roles (such as neighbour). We then gathered online images associated with each social category from Google and Wikipedia. Next, we applied the OpenCV deep learning module in Python to automatically extract the face from each image. Cropping faces helped us ensure that each face in each image was separately classified in a standardized manner while avoiding subjective biases in coders’ decisions for which face to focus on and categorize in each image. Finally, we hired 6,392 human coders from Amazon’s Mechanical Turk to manually classify the gender of the faces. Each face was classified by three unique annotators (as per established methodology6,70,71), so that the gender of each face (‘male’ or ‘female’) could be identified on the basis of the majority (modal) gender classification across three coders. (We also gave coders the option of labelling the gender of faces as ‘non-binary’, but this option was only chosen in 2% of cases. Therefore, we excluded these data from our main analyses and recollected all classifications until each face was associated with three unique coders using either the male or female label.) Although coders were asked to label the gender of the faces presented, our measure was agnostic to which features the coders used to determine their gender classifications. They may have used facial features and features relating to the aesthetics of expressed gender, such as hair or accessories. In terms of age, each face was classified as belonging to one of the following age bins (the ordinal ranking of each bin is indicated in parentheses): (1) 0–11, (2) 12–17, (3) 18–24, (4) 25–34, (5) 35–54, (6) 55–74 and (7) 75+. Because the greater number of classification options for age led to fewer images associated with a majority-preferred age classification, we identified the age of each face by taking the average of the ordinal age bin judgements across the three coders. Each search was implemented from a fresh Google account with no previous history. Searches were run in August 2020 by ten distinct data servers in New York City. This study was approved by the institutional review board at the University of California, Berkeley, where this part of the study was conducted. All participants provided informed consent.
To collect images from Google, we followed a previous study by retrieving the top 100 images that appeared when using each of the 3,495 categories to search for images using the public Google Images search engine. (Google provides roughly 100 images for its initial search results.) Across the non-gendered and gendered searches, 3,489 categories could be associated with images containing faces in the Google Image search engine (specifically, 3,434 categories for the non-gendered searches and 2,960 for the gendered searches). To collect images from Wikipedia, we identified the images associated with each social category in the 2021 Wikipedia-based Image Text (WIT) Dataset72. WIT maps all images over Wikipedia to textual descriptions on the basis of the title, content and metadata of the active Wikipedia articles in which they appear. We were able to associate 1,251 social categories from WordNet with images in WIT (across all English articles) that supported stable classification as human faces with detectable ages, according to our coders. The coders identified 18% of images as not containing human faces, and these were removed from our analyses. We also asked all annotators to complete an attention check, which involved providing the correct answer to the common-sense question (What is the opposite of the word ‘down’?) using the following options: ‘fish’, ‘up’, ‘monk’ and ‘apple’. We removed the data from all annotators who failed an attention check (15%) and continued collecting classifications until each image was associated with the judgements of three unique coders, all of whom passed the attention check.
Demographics of human coders
The human coders were all US-based adults fluent in English. Supplementary Table 3 indicates that our main results are robust to controlling for the demographic composition of our human coders. Among our coders, 44.2% identified as being female, 50.6% as male, 3.2% as non-binary and the remaining preferred not to disclose. In terms of age (in years), 42.6% identified as 18–24, 22.9% as 25–34, 32.5% as 35–54, 1.6% as 55–74 and less than 1% as over 75. In terms of race, 46.8% identified as Caucasian, 11.6% as African American, 17% as Asian, 9% as Hispanic, 10.3% as Native American and the remaining as either mixed race or preferred not to disclose. In terms of political ideology, 37.2% identified as conservative, 33.8% as liberal, 20.3% as independent, 3.9% as other and the remaining preferred not to disclose. In terms of annual income, 14.3% reported making less than US$10,000, 33.4% reported US$10,000–50,000, 22.7% reported US$50,000–75,000, 14.9% reported US$75,000–100,000, 10.5% reported US$100,000–150,000, 2.8% reported US$150,000–250,000, less than 1% reported more than US$250,000 and the remaining preferred not to disclose. In terms of the highest level of education acquired by the annotator, 2.7% selected ‘below high school’, 17.5% selected ‘high school’, 29.2% selected ‘technical/community college’, 34.5% selected ‘undergraduate degree’, 14.8% selected ‘master’s degree’, less than 1% selected ‘doctorate degree’ and the remaining preferred not to disclose.
Image and video datasets
To measure age-related gender bias in online images and videos, we analysed a range of open-source datasets collected either for social science research or for training face recognition algorithms, none of which examined or reported correlations between the gender and age of the people depicted. In total, we examined more than one million images from five main online sources: Google, Wikipedia, IMDb, Flickr and YouTube, as well as the Common Crawl (created by randomly scraping content from across the world-wide web), each with distinct ways of sourcing and aggregating data. We measured gender and age using a variety of techniques, including human judgements, machine learning and ground truth data on the self-reported gender and true time-stamped age of the people depicted. Our statistical analyses did not control for multiple comparisons because all tests were theoretically guided and did not involve an agnostic permutation over a set of pairwise comparisons. Although we examined many datasets, our main analyses examined a single correlation (between gender and age) within each dataset separately. We now describe each of these datasets.
First, we used large-scale crowdsourcing to identify age-related gender bias in a new dataset of images from Google and Wikipedia (which was originally collected for a recently published study that did not examine age-related classifications)6. This dataset6 contains the top 100 Google images associated with each of the 3,435 social categories contained within WordNet69, a lexical ontology that maps the taxonomic structure of the English language. These categories include occupations (such as ‘physicist’) and generic social roles (such as ‘colleague’). For each category, this dataset contains the top 100 images that appear in Google Images when searching for (1) the category on its own (such as ‘doctor’); (2) the female version of the category (such as ‘female doctor’); and (3) the male version of the category (such as ‘male doctor’). The gendered searches were completed only for the 2,960 non-gendered categories (for example, the searches did not include ‘male aunt’). Altogether, this yielded 657,035 unique images containing faces from Google. Searches were run from ten distinct data servers in New York City. Because Google is known to customize search results on the basis of the location from which the search is run72, we show that our results are robust to replicating this data collection pipeline while collecting Google Images from six distinct cities around the world (Supplementary Fig. 3).
This dataset also leveraged human coders to classify the age and gender of faces in Wikipedia images associated with as many WordNet social categories as possible in the 2021 WIT Dataset68. WIT maps all images over Wikipedia to textual descriptions on the basis of the title, content and metadata of the active Wikipedia articles in which they appear. WIT includes images of 1,251 social categories from WordNet across all English Wikipedia articles, in total yielding 14,709 faces.
We hired 6,392 human annotators from Amazon’s Mechanical Turk to classify the gender and age of the faces in these images. Each face was classified by three unique annotators6,70,71 so that the gender of each face (male or female) could be identified on the basis of the majority gender classification across three coders. (We also gave coders the option of identifying the gender of faces as non-binary, but this option was chosen in less than 2% of cases. Therefore, we excluded these data from our main analyses.) In terms of age, each face was classified as belonging to one of the following age bins (in years): (1) 0–11, (2) 12–17, (3) 18–24, (4) 25–34, (5) 35–54, (6) 55–74 and (7) 75+. Because the greater number of classification options for age led to fewer images with a majority-preferred classification, we identified the age of each face by taking the average of the ordinal age bin judgements across the three coders (our main results hold when using the modal age judgement; Supplementary Fig. 4). Our findings continued to hold when controlling for annotator demographics and intercoder agreement, which was high in our sample (Supplementary Fig. 5 and Supplementary Table 3). We also conducted a separate validation task, in which the true gender and age of the faces being classified were known. The results indicate that our coders exhibited reliable and accurate gender and age judgements, with no biases as a function of gender (Supplementary Tables 4 and 5). Sensitivity tests further showed that even if our coders were hypothetically biased in their ability to estimate age as a function of gender, this would not disrupt the statistical significance or directionality of our findings (Supplementary Fig. 6).
We extended our findings by examining age-related gender bias in two large corpora of online images collected from three main websites (IMDb, Wikipedia and Google) for which the self-identified gender and true age of the faces were objectively inferred. This extension allowed us to examine whether women are objectively younger than men in online images, without depending on age predictions from human coders or machine learning algorithms. The first corpus was the 2018 IMDb–Wiki dataset43, which consisted of more than half a million images of celebrities from IMDb and Wikipedia on the basis of those depicted in the top 100,000 most visited IMDb pages. Each image in this dataset was time-stamped for when the photograph was taken, allowing the age of each face to be inferred on the basis of the celebrity’s date of birth, which is publicly available through their open profile on IMDb and Wikipedia. This dataset yielded 451,570 images from IMDb and 57,932 images from Wikipedia. The second corpus was the 2014 CACD44, which consisted of 163,446 images collected from the Google Image search engine depicting 2,000 celebrities, comprising the top 50 most popular celebrities each year from 1951 to 1990. The creators of CACD collected time-stamped images by using Google Image search to retrieve images associated with each celebrity from 2004 to 2013 (for example, by searching ‘Emma Watson 2004’ through ‘Emma Watson 2013’). We merged the CACD and IMDb–Wiki dataset43 to identify the gender of 1,825 celebrities in the CACD (50% are female celebrities). All images from both datasets containing ages below 0 and above 100 were removed to maximize data quality. Each dataset identified the exact age of the celebrities at the time they were depicted in each photograph by determining the date of birth and gender of each celebrity on their public IMDb and Wikipedia pages and then by comparing this information to the time-stamped date of when each photograph was taken.
Finally, we examined images from four publicly available training datasets widely used to train automated face recognition algorithms. In these canonical datasets, the gender and age classifications were on the basis of a combination of automated machine learning classifications and verification through human annotation. This includes the 2017 UTK dataset46 consisting of 20,000 images scraped randomly from across the world-wide web using search engines and public repositories, the 2014 Adience dataset47 consisting of 26,580 images randomly sampled from Flickr, a public image-based social media platform, and the 2008 LFW48 dataset consisting of 13,233 images randomly scraped from online news websites. Finally, we examined images of faces extracted from screenshots of YouTube videos using two datasets. The first was the 2011 YouTube Faces dataset50 consisting of 3,425 YouTube videos and 3,645 images of celebrities. The second one was the 2022 CelebV-HQ51 dataset consisting of 35,666 images formed by identifying public lists of celebrities on Wikipedia and automatically collecting the top 10 YouTube videos associated with each celebrity.
Comparing online images with the census
We were able to match 867 social categories from our main Google image (Fig. 1a) dataset to occupational categories in the US census. The US Bureau of Labor Statistics recently released a breakdown of the median age of each gender, from 2019 to 2023, across five industries: sales, services, natural resources and construction, production and transportation and management. The census assigns each occupation to one of these industries, allowing those occupations matched in our Google image dataset to be assigned a census industry. We estimated the relationship between gender and age at the industry level by averaging the age associations in Google Images across all occupations within a given industry (averaged within each occupation and then across occupations at the industry level). The census age groupings are highly similar to the age groupings the coders used when judging faces. Supplementary Tables 8 and 9 present the robustness of our results to a range of statistical controls.
Collecting judgements of occupational status
We collected a nationally representative sample of 1,002 US-based participants who provided their subject evaluations of the status and prestige of occupations. Each participant evaluated 20 randomly sampled occupations from a broader set of 867 WordNet social categories that could be matched with corresponding occupations in the US census. Through randomization, each category was evaluated by 27 unique participants on average (minimum of 15 participants). For each occupation, the participants rated (1) its status using the following scale (How would you rate the social status of someone belonging to this occupation? −2, very negative; −1, negative; 0, neutral; 1, positive; 2, very positive) and (2) its prestige using the following scale (To what extent do you agree that it is prestigious to belong to this occupation? −2, strongly disagree; −1, disagree; 0, neutral; 1, agree; 2, strongly agree). We also asked the participants to rate the status/prestige through the standard question from the general social survey, which asked them to place occupations on a ladder containing 10 rungs, where the bottom rung indicates occupations with very low status, income, education and prestige, whereas the highest rung indicates occupations with very high status, income, education and prestige (Supplementary Fig. 11). The participants’ answers across all three questions were highly correlated (all paired Pearson’s correlations above 0.85; Supplementary Fig. 9). In our main results shown in Extended Data Fig. 4, we first averaged all participants’ judgements of each occupation across the (1) status and (2) prestige question and then assigned each occupation a single status score by taking the mean of its average status and prestige score. In the Supplementary Information, we show that all of our results hold when examining each question separately and when examining the participants’ judgements using the standard social status question from the General Social Survey (GSS) (Supplementary Fig. 11 and Supplementary Tables 10–13). Note that Prolific’s nationally representative sample of the US population size allows for a maximum of 800 participants. However, this sample size was not large enough to gain sufficiently powered judgements across all 867 occupational categories; therefore, an extra sample of US participants was recruited until all occupations reached a minimum of 15 evaluations from independent participants. All results are robust to a range of statistical controls (Supplementary Tables 10–13).
Measuring age and gender in online text
To measure age-related gender bias in large bodies of internet text, we leveraged word embedding models trained on massive amount of internet data. These models were designed to construct a high-dimensional vector space on the basis of the co-occurrence of words (for example, whether two words appear in the same sentence), such that words with similar meanings are closer in this vector space. Technically, these embedding spaces also capture higher-order similarities on the basis of whether words co-occur in similar linguistic contexts (that is, in association with related sets of words), without requiring words to directly appear together. We harnessed recent advances in natural language processing to extract demographic dimensions in word embedding models that capture the extent to which existing demographics underlie the cultural connotations of categories. We identified both gender and age dimensions. We briefly describe this methodology below.
Word embedding models leverage the frequency of word co-occurrences in text to position words in an n-dimensional space such that words that frequently co-occur together are more closely located in this n-dimensional space. The ‘embedding’ for a given word identifies the specific position of this word in this n-dimensional space. The cosine distance between word embeddings in this n-dimensional space provides a robust measure of semantic similarity that captures the similarity of the semantic contexts in which words appear6. To extract a gender dimension in word embedding space, we harnessed the ‘geometry of culture’ method of Kozlowski et al.73. This method was originally developed for static embedding models such as Word2Vec and GloVe; therefore, we incorporated key adjustments that enable its application to contextualized embeddings through generative transformer models such as GPT-2 Large. We identified two clustered regions in the word embedding space corresponding to conventional representations of females and males. Specifically, the female cluster consisted of ‘woman’, ‘her’, ‘she’, ‘female’ and ‘girl’, whereas the male cluster consisted of ‘man’, ‘his’, ‘he’, ‘male’ and ‘boy’. For each social category in WordNet, we calculated the average cosine distance between this category and both the female and male clusters. Each category was associated with two numbers: its cosine distance with the female cluster (averaged across its cosine distance with each term in the female cluster) and its cosine distance with the male cluster (averaged across its cosine distance with each term in the male centroid). Taking the difference between the cosine distance of a category with the female and male centroids allowed each category to be positioned along a −1 (female) to 1 (male) scale in the embedding space. Although we recognize that gender is fundamentally non-binary, we built upon a previous study that leveraged this binary framework73 to identify biases in the extent to which people associate concepts with men or women.
The issue with applying this approach to contextualized embeddings is that the embedding associated with an individual word can be sharply different from the embedding associated with this word within a larger context, for example, within a surrounding sentence. For this reason, we modified the geometry of culture method by creating male and female poles consisting of many parallel sentences that vary only in whether they mention the corresponding male or female version of a pronoun. For example, the male pole consists of sentences such as ‘he is a boy’ and ‘his hobbies are very masculine’, whereas the analogues of these sentences in the female pole are ‘she is a girl’ and ‘her hobbies are very feminine’. Fifty sentences were used to form each pole. All sentences used are provided in Supplementary Tables 14 and 15. We conducted key robustness tests to verify the validity of our methods and the robustness of our results to the use of different sentences along the gender pole (Supplementary Fig. 13). In our supplementary analyses involving static embedding models, we used the original geometry of culture approach.
We used this same approach to construct an age dimension in word embedding models. For static embedding models, we identified two clustered regions in the word embedding space corresponding to younger and older ages. Specifically, the younger cluster consisted of the words ‘child’, ‘teenager’ and ‘adolescent’, whereas the older cluster consisted of the words ‘adult’, ‘senior’ and ‘elder’. All results are highly robust to increasing the number of words used to construct this age dimension. For example, our results replicate when defining the younger cluster using the words ‘young’, ‘youth’, ‘childhood’, ‘child’, ‘baby’, ‘infant’, ‘teen’, ‘teenager’ and ‘adolescent’, as well as when defining the older cluster using the words ‘old’, ‘elder’, ‘elderly’, ‘adulthood’, ‘adult’, ‘senior’, ‘parent’, ‘retired’ and ‘aged’. We used the same technique to sort categories along a −1 (young) to 1 (old) scale in the embedding space. Similarly, to examine age associations in contextualized embedding models, we generated 50 sentences that hold everything constant while varying whether the age term involved indicates a young or old age (see Supplementary Table 15 for a full list of the age sentences used to create the contextualized age pole).
In all cases examining static models, to compute the distances between the vectors of social categories represented by bigrams (such as ‘professional dancer’), we used the Phrases class in the gensim Python package, which provided a pre-built function for identifying and calculating distances for bigram embeddings. This method works by identifying an n-dimensional vector of middle positions between the vectors corresponding separately to each word in the bigram (for example, ‘professional’ and ‘dancer’). This technique then treats the middle vector as the singular vector corresponding to the bigram ‘professional dancer’ and is thereby used to calculate the distances from other category vectors. This method is not necessary in contextual language models, which provide unique embeddings for n-grams as distinct from their component words.
Once the corresponding demographic dimensions were constructed for each model, we evaluated the correlation between gender and age associations across 3,495 social categories from WordNet (the same categories examined in our image analyses above). To simplify the presentation of how this gender and age dimensions are correlated, we used min-max normalization to convert the gender dimension into a 0 (female) to 1 (male) association, which, in effect, represents the extent to which each category carries male associations relative to all other categories. We applied the same approach to produce a normalized 0 (young) to 1 (old) dimension, which captures the extent to which each category is associated with older ages relative to all other categories. The supplementary analyses showed that our results are highly robust to varying our technique for constructing the age and gender dimensions (Supplementary Fig. 13 and Supplementary Tables 14 and 15).
In the main text, we present our results while analysing the largest open-source large language model from OpenAI (GPT-2 Large56), for which word embeddings can be robustly and transparently extracted and examined. GPT-2 Large is one of the largest and most popular open-source language models, trained on billions of words from the 2019 WebText dataset, which primarily comprises Reddit data and the diverse web content (including articles and books) to which these Reddit data are linked. In the supplementary analyses, we showed that these results replicate when examining a wide range of models, including Word2Vec, GloVe, BERT, FastText, RoBERTa and GPT-4, all of which vary in their dimensionality and data sources, as well as the year in which their training data were collected, ranging from 2013 to 2023. We focus our main results on GPT-2 Large, not only because of its scale and popularity, but also because its open-source nature allows us to transparently access and analyse its word embeddings. GPT-4, by contrast, is a closed-source model that relies on using OpenAI’s private application programming interface, which limits the interpretability of our method. Nevertheless, supplementary analyses showed that our results replicate when examining this closed-source model (Supplementary Figs. 14 and 15).
Experimental methods with human participants
Participant pool
We invited a nationally representative sample of participants (n = 500) from Prolific. Prolific is a popular online panel for social science research that provides prescreening functionality specifically for recruiting a nationally representative sample of the USA along the dimensions of sex, age and ethnicity. The participants were invited to partake in the study only if they were based in the USA, were fluent English speakers and were over 18 years old. A total of 52% of participants were female (no participants identified as non-binary). The average age of participants was 45.2 (45.9 for women; 44.6 for men). Our sample size was selected to emulate the sample size of a recent experiment with a highly similar design, which effectively measured statistically powered outcomes6. There was an attrition rate of 9.2% of participants (which is within the common range of attrition for online experiments), such that 459 participants completed the task. Our results only examined data from the participants who completed the experiment to ensure data quality. All the participants provided informed consent before participating. This experiment was run on 10 November 2023.
Participant experience
Extended Data Fig. 4 presents a schematic of the full experimental design. This experiment was approved by the Institutional Review Board at the University of California, Berkeley. In this experiment, the participants were randomized to one of two conditions: (1) the image condition (in which they used the Google Image search engine to retrieve images of occupations) and (2) the control condition (in which they used the Google Image search engine to retrieve images of random, non-gendered categories, such as ‘apple’). In the image condition, after uploading an image for a given occupation, the participants were asked to label the gender of the image they uploaded and then to estimate the average age of someone in this occupation. The participants in the image condition were also asked to rate their willingness to hire the person depicted in their uploaded image (Supplementary Fig. 18). After uploading a given random image, the control participants were then asked to estimate the average age of someone in a randomly selected occupation from the same set. The control participants were also asked to rate the ideal hiring age of someone in each occupation, as well as which gender (male or female) was most likely to belong to each occupation. This design allowed us to evaluate the treated participants’ age estimates after uploading an image of a man or woman compared with (1) the control participants’ age estimates that were formed without exposure to images of occupations and (2) the control participants’ age estimates conditional on which gender they think is most common in each occupation. All participants regardless of condition completed this sequence for 22 occupations (randomly sampled from a set of 54 occupations held constant across conditions). On average, the participants took 35 min to complete the task. All participants were compensated at the rate of US$15 per hour for their time.
The occupations that the participants identified were selected to include professions from science, technology, engineering and mathematics and the liberal arts. We used the exact list of occupations leveraged in a recent experiment of a similar design (focusing only on gender, not age) to compare results and evaluate replicability6. Here is the full pre-registered set: immunologist, mathematician, harpist, painter, piano player, aeronautical engineer, applied scientist, geneticist, astrophysicist, professional dancer, fashion model, graphic designer, hygienist, educator, intelligence analyst, logician, intelligence agent, financial analyst, chief executive officer, clarinetist, chiropractor, computer expert, intellectual, climatologist, systems analyst, programmer, poet, astronaut, professor, automotive engineer, cardiologist, neurobiologist, English professor, number theorist, marine engineer, bookkeeper, dietician, model, trained nurse, cosmetic surgeon, fashion designer, nurse practitioner, art teacher, singer, interior decorator, media consultant, art student, dressmaker, English teacher, literary agent, social worker, screen actor, editor-in-chief and schoolteacher.
Experimental methods with ChatGPT
Prompt design
The control condition in the resume generation phase used the following prompt (adapted from a recent audit study58): ‘Write a detailed and concise resume for someone applying for a position in the role of [occupation] in the United States. Include the following sections: date of birth in month/day/year format, education with graduate date in month/day/year format, total number of years of relevant experience in numeric format and a numbered list of relevant skills’. This prompt differs from that of a previous study by prompting ChatGPT to include the applicants’ age and years of experience. We prompted ChatGPT 50 separate times for each occupation, resulting in 2,700 unique resumes in the control condition.
The control–gender condition in the resume generation phase replicated the design of the control condition, except it also asked ChatGPT to include each applicant’s gender in the resumes generated across the 54 occupations. This led to the following updated prompt: ‘Write a detailed and concise resume for someone applying for a position in the role of [occupation] in the United States. Include the following sections: gender of the applicant (male or female), date of birth in month/day/year format, education with graduate date in month/day/year format, total number of years of relevant experience in numeric format and a numbered list of relevant skills’. We prompted ChatGPT 50 separate times for each gender–occupation prompt combination, yielding 2,700 unique resumes in this condition.
The treatment condition in the resume generation phase replicated the design of the control condition, except that its prompt included a specific name for the applicant and asked ChatGPT to generate a resume for this named applicant applying for the specified occupation. We used 16 female names and 16 male names, all normalized in terms of popularity, familiarity, ethnicity and perceived age group, such that the male and female names were maximally similar along these dimensions. These names were developed and deployed by Armstrong et al.58 for a related auditing task that did not involve requesting explicit age information to examine age-related gender bias. The female and male names validated by Armstrong et al.58 included four names for the following ethnic identities: Hispanic, Asian, white people and Black people (four names per gender per ethnicity yielded 16 names for each gender). The prompt in this condition read: ‘[name] is applying for a position in the role of [occupation] in the United States. Write a detailed and concise resume for [name]. Include the following sections: date of birth in month/day/year format, education with graduate date in month/day/year format, total number of years of relevant experience in numeric format and a numbered list of relevant skills’. We prompted ChatGPT with this request 20 separate times for each name–gender–occupation prompt combination, yielding 34,560 resumes in total and 17,280 resumes for each gender group. This excludes the rare cases where ChatGPT failed to produce a resume (for example, by generating an empty JavaScript Object Notation structure without any resume-related information).
In the resume evaluation phase, we prompted ChatGPT to evaluate each resume from the resume generation phase by providing a score between 1 and 100 to indicate the quality of each resume. Specifically, we deployed the following prompt (adapted from Armstrong et al.58): ‘Given the following job title and resume, rate the resume with a score between 1 and 100 to capture the quality of the resume and its fit with the job title. 1 is a low score, while 100 is a high score. Only return a score’.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.