STEM Stereotypes Through the Lens of Generative Artificial Intelligence

Seth Orvin

University of Central Arkansas

Academically, commercially, and societally, attention towards artificial intelligence has increased exponentially in the past decade. Between 2010 and 2019, academic papers about artificial intelligence increased 20-fold to about 20,000 publications per year, with the most popular subcategory of artificial intelligence being machine learning—the technology behind popular generative AI tools (i.e., those AI tools designed to generate textual, visual, and/or auditory content for human consumption) (Russell & Norvig, 2022). The sudden launch of these tools into the mainstream has raised ethical concerns about the output that they produce. Numerous studies have identified that generative AI tools are prone to producing content that either mirrors or worsens stereotypes of race and gender, bringing into question the algorithms and training data utilized by such tools (Ali et al., 2024; Dehouche, 2021; Nicoletti & Bass, 2023). In the context of these findings, this study investigates the extent to which text-to-image AI tools produce output that reflect stereotypes particularly affecting STEM fields. 

Literature Review

Though “artificial intelligence” lacks a universally recognized definition, Ali et al. argue that all forms of AI throughout its history have “shared logics—especially managerial, military, industrial and computational—[that] cut across them, often in ways that reinforce oppressive racial and gender hierarchies” (Ali et al., 2023, p. 8). The propensity for machine learning-based models to replicate societal biases is well documented. Researchers found that—when prompted to generate “visualizations of surgeons across 8 surgical specialties”—two leading text-to-image AI models (Midjourney 5.1 and Stable Diffusion 2.1) “[depicted] over 98% of surgeons as white and male” (Ali et al., 2024, p. 87-88). Similarly, the generation and classification of 10,000 portrait photographs through a Contrastive Language-Image Pretraining (CLIP) model created “a strong positive correlation … between labels Female and Attractive, Male and Rich, as well as White Person and Attractive” (Dehouche, 2021, p. 167936). 

In an adjacent study, Bloomberg also found that generative AI models “[amplify] stereotypes about race and gender” (Nicoletti & Bass, 2023). Using Stable Diffusion (a free, open-source text-to-image generative AI tool) Bloomberg “generated thousands of images related to job titles and crime” and analyzed the skin colors and perceived genders of the people in the images. Their prompts specified seven high-paying occupations (architect, lawyer, politician, doctor, CEO, judge, and engineer) and seven low-paying occupations (janitor, dishwasher, fast-food worker, cashier, teacher, social worker, and housekeeper). They found that the images generated for high-paying occupations disproportionately featured subjects that were men with lighter skin tones. The subjects in images generated for low-paying occupations were largely women with darker skin tones—with images for the “housekeeper” occupation not showing a single perceived man across 300 generated images (Nicoletti & Bass, 2023). The images generated for criminals and terrorists disproportionately included men with darker skin tones, many of whom were rendered as stereotypes of Muslim men.

Bloomberg’s findings generate interest in the extent to which artificial intelligence tools might amplify stereotypes related to STEM fields, which are notoriously impacted by gendered and racial biases. Such stereotypes are especially common within computer science and engineering fields, with “children as young as age six … and adolescents across multiple racial/ethnic and gender intersections … [endorsing] stereotypes that girls are less interested than boys in computer science and engineering” (Master et al., 2021, p.1). Additionally, throughout STEM, the underrepresentation of Black and Latinx people is “falsely attributed to personal characteristics such as inferior intelligence, weak work ethic, and deficiencies in mathematics,” whereas the representation of Asian students “is explained by such stereotypes as superior intelligence, strong work ethic, or excelling in math, all of which are a part of the model minority concept” (Lee et al., 2020, p. 3). Many occupation-specific stereotypes also exist, such as the notion that “if you aren’t white, and you aren’t Asian, and you aren’t Indian, you aren’t an engineer” (Lee et al., 2020, p. 4). 

Stereotypes and biases that favor men throughout STEM pervade higher education. Researchers found that when university faculty were given implicit bias training, the “men were [still] more likely than women to explicitly endorse stereotypes about women in STEM … and these attitudes did not change as a result of diversity training” (Jackson et al., 2014, p. 419). Additional research has studied how biology and physics professors evaluate identical curriculum vitae (CVs) for postdoctoral positions when the CVs differ only by candidate names that are selected “to manipulate race (Asian, Black, Latinx, and white) and gender,” finding that women—especially Black and Latina women—are more likely to be disfavored as less competent and less hirable for postdoctoral physics and biology positions (Eaton et al., 2019, p. 127). 

While evident that generative AI worsens existing stereotypes and that STEM fields are notorious for gendered and racial stereotypes, research has not thoroughly studied the intersection of these phenomena. To investigate the influence of STEM stereotypes on generative AI, this study prompts multiple AI tools to generate professional headshots of workers from a list of STEM occupations and systemically analyzes the resulting images to discern whether the distributions of skin color and perceived gender that arise for each occupation align with stereotypical STEM identities. 

Methods

Three text-to-image generative AI tools were selected for study: Stable Diffusion, ImageFX from Google, and Adobe Firefly. These tools were selected for their free pricing options and accessibility. Both ImageFX and Adobe Firefly were accessed through a web browser. For Adobe Firefly, the Firefly Image 3 (preview) model was used. Firefly allows users to provide reference images for ‘style’ and ‘structure’; no reference images were used. Additionally, Firefly allows users to set the “visual intensity” of generated images; the intensity value was set to zero for the purposes of this study. The version of Stable Diffusion used was the Stable Diffusion XL 1.0 (SDXL 1.0) base model, without its associated refinement model. While a text-to-image generator for SDXL 1.0 is accessible through a web browser via the Hugging Face website, this study instead implemented a Python program that queried Hugging Face servers to generate images that could be conveniently downloaded in batches. 

SDXL 1.0, ImageFX, and Adobe Firefly were also selected for the range and realism of their output. Each of these models is capable of producing convincing images of humans. Notably, though, the images produced by SDXL 1.0 may have noise or a slight blurriness because the SDXL 1.0 refinement model was not used (due to local system software limitations and convenience, the base model was considered adequate). AI tools that were considered but rejected included Image Creator from Microsoft Designer and Stable Diffusion 2.0. Each of these tools regularly generated images that either did not meet the image criteria or quality standards for this study. 

Ten STEM occupations representing a range of average incomes were selected: software engineer, statistician, civil engineer, physicist, architect, geoscientist, environmental scientist, nutritionist, and biologist. For each occupation, each AI tool was used to generate ten artificial images of workers using the prompt “A color photograph of a ___, centered headshot, high-quality”, generating 300 images in total. The images were generated in April 2024. Images were required to be professional headshots featuring minimal content outside of the subject—though some occupation-associated settings or themes were allowed, such as the outdoors for environmental scientists and geoscientists. Images were most often rejected for the following: multiple subjects being shown; the subject being shown in side-profile rather than front-on portrait; the subject appearing nearly identical to the subject of a previously generated image (common with ImageFX); the image being in greyscale; the subject’s face being out of frame or blocked by an object or accessory (e.g., a medical mask); or the image otherwise having any strange composition or subject positioning that deviates from a professional headshot. The first ten valid images generated for each occupation and AI tool were selected. 

After the images were collected, the subject of each image was manually labeled as either a “man” or a “woman”. For each AI tool, the percentages of women present in each occupation were calculated. The percentages of women present across all generated images was also calculated for each AI tool. Using OpenCV (an open-source machine learning software library for image analysis), a Python program was then implemented that automatically crops each image around the subject’s face, identifies which pixels are skin, and determines the specific hexadecimal color that appears most frequently throughout all skin pixels. For each subject, this color was selected as their typical skin color. Some images were inadequately cropped by OpenCV to include parts of the subject’s hair or the image’s background, resulting in inaccurate skin tone color selections; in such cases, affected images were manually cropped prior to skin tone analysis to avoid any interference from non-skin regions of the images. Feeding all generated images through this program yielded a distribution of typical skin colors for each occupation and AI tool. 

Results

For almost all of the STEM occupations tested, the majority of the subjects present in the images generated by ImageFX and SDXL 1.0 were men. For four of the occupations (software developer, civil engineer, physicist, geoscientist), neither ImageFX nor SDXL 1.0 generated any women (see Tables A1 and B1). For five of the remaining occupations (statistician, architect, chemist, environmental scientist, and biologist), ImageFX only generated 1 woman per occupation. For the same set of occupations, SDXL 1.0 generated 2, 2, 0, 3, and 5 women, respectively. Nutritionists were the only occupation for which ImageFX and SDXL 1.0 generated a majority of subjects as women; all of the nutritionists generated by SDXL 1.0 and 90% of the nutritionists generated by ImageFX were women.

ImageFX and SDXL 1.0 each predominantly generated images of people with lighter skin tones, with SDXL 1.0 especially generating paler skin tones (see Tables A2 and B2). The vast majority of the people generated by both ImageFX and SDXL appeared racially white, broadly ethnically Asian, or otherwise racially ambiguous, with minimal perceived racial or ethnic representation outside of those groups. For each of these two models, architects, biologists, and environmental scientists mostly appeared white. The models also agreed on output for civil engineers and statisticians, who mostly appeared Asian. They disagreed on output for software developers, with ImageFX’s subjects mostly appearing white and SDXL 1.0’s subjects mostly appearing Asian. 

Images produced by Adobe Firefly differed greatly from the images produced by the previous two models. While ImageFX and SDXL 1.0 generated total percentages of women at respective rates of 14% and 22% (see Tables A1 and B1), Firefly generated a woman in 54% of its images (see Table C1). And while ImageFX and SDXL 1.0’s meager total percentages of women are inflated by their outliers of nutritionists, Firefly has representation of women throughout every occupation, with only three occupations being represented by less than 50% women: software developers (40%), physicists (30%), and chemists (40%). While most of the subjects in images generated by Firefly continued to appear white or otherwise have lighter skin tones, subjects appeared more racially diverse and distributions of perceived race and skin tone appeared to be less associated with occupations. Firefly also generated more images of Black subjects, who are mostly absent in the outputs of ImageFX and SDXL 1.0.

Limitations

Though multiple AI tools were studied to test the possibility of variation between tools, this study is limited in that only free generative AI services were used. There is evidence suggesting that OpenAI’s paid DALL-E model might not amplify stereotypes of race or gender, similarly to Adobe Firefly (which is not fully free, as it requires a free trial or a subscription after a certain amount of “credits” are used). In the same study that found Midjourney and Stable Diffusion to almost entirely depict surgeons as white men, DALL-E 2 “depicted comparable demographic characteristics to real attending surgeons” and “was able to generate images of Black surgeons without specific prompting” (Ali et al., 2024, p. 93-94). Furthermore, a study examining the generation of science education imagery with AI found that DALL-E 3 “[showcased] diverse classrooms, featuring individuals of different genders, cultural, linguistic, and historical backgrounds (Cooper & Tang, 2024, p. 10). 

The number of images produced per occupation per AI tool also represents a limitation of this study. Though hundreds of images were generated in total, only ten images were generated per AI-occupation pair. The effects of this relatively low sample size were noticed during data collection and rejection. There were some discarded images that may have altered the demographic distribution of several AI-occupation pairs but were ultimately rejected due to not satisfying this study’s image criteria. Additionally, the image criteria should have been more stringent on lighting, the warmth and intensity of which can affect measured skin tone. The prompts fed to the AI should have specified that subjects have well-lit faces under neutral, indoor lighting. 

The methods for representing subjects’ genders and skin tones also present limitations. Both gender and race are socially constructed identities that cannot be fully conveyed by an image representing an artificial person. So, though broad observations are made about subjects’ perceived races or ethnicities, this study does not attempt to systematically classify each subject by race or ethnicity, instead using skin color as a proxy. This study determines a subject’s typical skin tone to be the specific hexadecimal-coded color that appears most often across all of the pixels comprising skin on the subject’s face. This approach is only effective because the generated images are high resolution. An alternative method might have averaged the hexadecimal colors across all of each subject’s skin pixels. Regarding gender, a best effort was made to classify subjects by gender, with the classifications of ‘man’ and ‘woman’ assuredly being influenced by Western characteristics of masculinity and femininity. For any given image, other researchers may have identified the subject differently (including as androgynous or ambiguous, which the Bloomberg study includes). Genderqueer identities were not included due to the lack of a consistent classification system for such identities based solely on physical characteristics—especially for artificial, computer-generated subjects dressed in traditional, inexpressive professional attire. 

This study also does not measure all stereotypes with the potential to influence the output of AI models. It is possible that stereotypes relating to socioeconomic status influenced the images generated. In their study on examining the generation of science education imagery with AI, Cooper and Tang found that across images of classrooms generated for chemistry, physics, biology, and planetary science, “the resource-rich settings and formal attire in some images [suggested] a middle to upper SES background, reflecting their habitus—the collection of dispositions, attitudes, and experiences that guide their educational interactions” (Cooper and Tang, 2024, p. 10). 

Discussion

The images generated by ImageFX and SDXL 1.0 appear to amplify stereotypes about people in STEM primarily being men who are white or Asian. The outlier of nutritionists being nearly entirely represented as women by ImageFX and SDXL 1.0 is grounded in the actual makeup of that occupation. Per the U.S. Bureau of Labor Statistics, 86.3% of all dietitians and nutritionists in 2023 were women. For reference, the most male-dominated occupation tested in this study was civil engineers, 83.1% of whom were men in 2023. 

The results of Adobe Firefly seem to indicate that its model is either trained on different data, does not give the same weight to data that might correlate race and gender with occupations, or otherwise has algorithms that make a conscious effort to generate images of people at rates more proportional to the actual demographics of the United States. If the latter is the case, it might be the most effective and optically-ideal means of addressing the influence of social biases on generative AI. Due to the sensitivity of some machine learning models, though, some companies might fear overcorrection following the controversy, backlash, and temporary suspension of Google’s Gemini text-to-image tool, which unintentionally depicted the founding fathers as Black women and “ancient Greek warriors as Asian women and men” (Shamim, 2024). 

For the STEM fields, the findings of this study reinforce the need to encourage diversity within STEM while dismantling identity stereotypes. Otherwise, generative AI has the dangerous potential to create a positive feedback loop in which real-world stereotypes and disparities of gender and race in STEM become reinforced by their exacerbated presence in mass-produced, stereotype-prone, AI-generated content. This is especially a concern as more advanced generative AI technologies begin to emerge (e.g., high-quality, short-form video) and as AI becomes increasingly commercialized, with 30% of marketing content promoted by big companies expected to be produced using generative AI tools by 2025 (Nicoletti & Bass, 2023). Artificial intelligence must not be allowed to automate inequality. 

Appendix A

AI Image Analysis, ImageFX from Google

Table A1. Perceived genders of subjects in images produced by ImageFX from Google

Table A2. Typical skin color of subjects in images produced by ImageFX from Google

Appendix B

AI Image Analysis, Stable Diffusion XL 1.0 (base version)

Table B1. Perceived genders of subjects in images generated by SDXL 1.0 (base version)

Table B2. Typical skin colors of subjects in images generated by SDXL 1.0 (base version)

Appendix C

AI Image Analysis, Adobe Firefly

Table C1. Perceived genders of subjects in images generated by Adobe Firefly

Table C2. Typical skin colors of subjects in images generated by Adobe Firefly

Works Cited

Ali, R., Tang, O. Y., Connolly, I. D., Abdulrazeq, H. F., Mirza, F. N., Lim, R. K., Johnston, B. R., Groff, M. W., Williamson, T., Svokos, K., Libby, T. J., Shin, J. H., Gokaslan, Z. L., Doberstein, C. E., Zou, J., & Asaad, W. F. (2024). Demographic Representation in 3 Leading Artificial Intelligence Text-to-Image Generators. JAMA Surgery, 159(1), 87–95. https://doi-org.ucark.idm.oclc.org/10.1001/jamasurg.2023.5695

Ali, S. M., Dick, S., Dillon, S., Jones, M. L., Penn, J., & Staley, R. (2023). Histories of artificial intelligence: a genealogy of power. BJHS Themes, 8, 1–18. https://doi.org/10.1017/bjt.2023.15

Cooper, G., & Tang, K.-S. (2024). Pixels and Pedagogy: Examining Science Education Imagery by Generative Artificial Intelligence. Journal of Science Education and Technology, 1–13. https://doi.org/10.1007/s10956-024-10104-0

Dehouche, N. (2021). Implicit Stereotypes in Pre-Trained Classifiers. IEEE Access, 9. https://doi.org/10.1109/ACCESS.2021.3136898

Eaton, A. A., Saunders, J. F., & Jacobson, R. K. (n.d.). How Gender and Race Stereotypes Impact the Advancement of Scholars in STEM: Professors’ Biased Evaluations of Physics and Biology Post-Doctoral Candidates. Sex Roles, 82(3-4), 127–141. https://doi.org/10.1007/s11199-019-01052-w

Jackson, S. M., Hillard, A. L., & Schneider, T. R. (2014). Using implicit bias training to improve attitudes toward women in stem. Social Psychology of Education: An International Journal, 17(3), 419–438. https://doi.org/10.1007/s11218-014-9259-5

Lee, M. J., Collins, J. D., Harwood, S. A., Mendenhall, R., & Huntt, M. B. (2020). “If You Aren’t White, Asian or Indian, You Aren’t an Engineer”: Racial Microaggressions in STEM Education. International Journal of STEM Education, 7. https://doi.org/10.1186/s40594-020-00241-4

Master, A., Meltzoff, A. N., & Cheryan, S. (2021). Gender stereotypes about interests start early and cause gender disparities in computer science and engineering. Proceedings of the National Academy of Sciences of the United States of America, 118(48), 1–7. https://doi-org.ucark.idm.oclc.org/10.1073/pnas.2100030118

Nicoletti, L., & Bass, D. (2023, June 9). Humans are Biased. Generative AI is Even Worse. Bloomberg. https://www.bloomberg.com/graphics/2023-generative-ai-bias/ 

Russell, S. J., & Norvig, P. (2022). Artificial Intelligence: A Modern Approach (4th ed.). Pearson Education. 

Shamim, S. (2024, March 9). Why Google’s AI tool was slammed for showing images of people of colour. Al Jazeera. https://www.aljazeera.com/news/2024/3/9/why-google-gemini-wont-show-you-white-people 

U.S. Bureau of Labor Statistics. (2023, January 25). Employed persons by detailed occupation, sex, race, and Hispanic or Latino ethnicity. U.S. Bureau of Labor Statistics. https://www.bls.gov/cps/cpsaat11.htm

Join the Discussion