Articles

Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

TitleType of ResourceLink to ResourceDate RecordedOpen ScienceUse of LLMResearch Discipline(s)Description of Resource
Diminished Diversity-of-Thought in a Standard Large Language Model Research Article September 14, 2023 Preprint Data Generation Computer Science We test whether Large Language Models (LLMs) can be used to simulate human participants in social-science studies. To do this, we run replications of 14 studies from the Many Labs 2 replication project with OpenAI's text-davinci-003 model, colloquially known as GPT3.5. Based on our pre-registered analyses, we find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results. However, we were unable to analyse the remaining six studies due to an unexpected phenomenon we call the "correct answer" effect. Different runs of GPT3.5 answered nuanced questions probing political orientation, economic preference, judgement, and moral philosophy with zero or near-zero variation in responses: with the supposedly "correct answer." In one exploratory follow-up study, we found that a "correct answer" was robust to changing the demographic details that precede the prompt. In another, we found that most but not all "correct answers" were robust to changing the order of answer choices. One of our most striking findings occurred in our replication of the Moral Foundations Theory survey results, where we found GPT3.5 identifying as a political conservative in 99.6% of the cases, and as a liberal in 99.3% of the cases in the reverse-order condition. However, both self-reported 'GPT conservatives' and 'GPT liberals' showed right-leaning moral foundations. Our results cast doubts on the validity of using LLMs as a general replacement for human participants in the social sciences. Our results also raise concerns that a hypothetical AI-led future may be subject to a diminished diversity-of-thought.
Using OpenAI models as a new tool for text analysis in political leaders’ unstructured discourse Research Article August 17, 2023 Preprint Data Analysis Political Science This study explores the application of Large Language Models (LLMs) and Automatic Speech Recognition (ASR) models in the analysis of right-wing unstructured political discourse in Peru, focusing on how the concept of freedom is framed. Three types of freedom are identified: personal autonomy, economic freedom, and civil liberties. Utilizing the transcription of OpenAI’s ASR Whisper and GPT-3.5 and GPT-4 models, interviews with three Peruvian right-wing political leaders are analyzed: Rafael López Aliaga, Hernando de Soto and Keiko Fujimori. The results show that GPT-4 beats GPT-3.5 in identifying dimensions of freedom, although there are discrepancies compared to human coding. Despite challenges in classifying abstract and ambiguous concepts, the findings demonstrate GPT-4's ability to classify complexities within political discourse at comparatively small costs and easy access. The research suggests the need for additional refinement, ethical consideration, and ongoing exploration in the analysis of political speeches through AI.
Extracting protest events from newspaper articles with ChatGPT Research Article August 12, 2023 Preprint Data Collection Other This research note examines the abilities of a large language model (LLM), ChatGPT, to extract structured data on protest events from media accounts. Based on our analysis of 500 articles on Black Lives Matter protests, after an iterative process of prompt improvement on a training dataset, ChatGPT can produce data comparable to or better than a hand-coding method with an enormous reduction in time and minimal cost. While the technique has limitations, LLMs show promise and deserve further study for their use in protest event analysis.
CoAIcoder: Examining the Effectiveness of AI-assisted Human-to-Human Collaboration in Qualitative Analysis Research Article July 26, 2023 Preprint Data Analysis Computer Science While AI-assisted individual qualitative analysis has been substantially studied, AI-assisted collaborative qualitative analysis (CQA)-a process that involves multiple researchers working together to interpret data-remains relatively unexplored. After identifying CQA practices and design opportunities through formative interviews, we designed and implemented CoAIcoder, a tool leveraging AI to enhance human-to-human collaboration within CQA through four distinct collaboration methods. With a between-subject design, we evaluated CoAIcoder with 32 pairs of CQA-trained participants across common CQA phases under each collaboration method. Our findings suggest that while using a shared AI model as a mediator among coders could improve CQA efficiency and foster agreement more quickly in the early coding stage, it might affect the final code diversity. We also emphasize the need to consider the independence level when using AI to assist human-to-human collaboration in various CQA scenarios. Lastly, we suggest design implications for future AI-assisted CQA systems.
LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding Research Article July 24, 2023 Preprint Data Analysis Computer Science, Other Deductive coding is a widely used qualitative research method for determining the prevalence of themes across documents. While useful, deductive coding is often burdensome and time consuming since it requires researchers to read, interpret, and reliably categorize a large body of unstructured text documents. Large language models (LLMs), like ChatGPT, are a class of quickly evolving AI tools that can perform a range of natural language processing and reasoning tasks. In this study, we explore the use of LLMs to reduce the time it takes for deductive coding while retaining the flexibility of a traditional content analysis. We outline the proposed approach, called LLM-assisted content analysis (LACA), along with an in-depth case study using GPT-3.5 for LACA on a publicly available deductive coding data set. Additionally, we conduct an empirical benchmark using LACA on 4 publicly available data sets to assess the broader question of how well GPT-3.5 performs across a range of deductive coding tasks. Overall, we find that GPT-3.5 can often perform deductive coding at levels of agreement comparable to human coders. Additionally, we demonstrate that LACA can help refine prompts for deductive coding, identify codes for which an LLM is randomly guessing, and help assess when to use LLMs vs. human coders for deductive coding. We conclude with several implications for future practice of deductive coding and related research methods.
Utilizing Machine Learning Algorithms Trained on AI-generated Synthetic Participant Recent Music-Listening Activity in Predicting Big Five Personality Traits Research Article July 13, 2023 Preprint, Open Source Data Generation Engineering, Psychology The recent rise of publicly available artificial intelligence (AI) tools such as ChatGPT has raised a plethora of questions among users and skeptics alike. One major question asks, "Has AI gained the ability to indistinguishably mimic the psychology of its organic, human counterpart?". Since music has been known to be a positive predictor of personality traits due to the individuality of personal preference, in this paper we use machine learning (ML) algorithms to analyze the predictability of AI-generated or 'synthetic' participants' Big 5 personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) using their recent music listening activity and motivations for listening to music. Recent music listening history for synthetic participants is generated using ChatGPT and the corresponding audio features for the songs are derived via the Spotify Application Programming Interface (Beats per minute, Danceability, Instrumentals, Happiness, etc). This study will also administer the Uses of Music Inventory to account for synthetic participants’ motivations for listening to music: emotional, cognitive, and background.The dataset will be trained and tested on scaler-model combinations to identify the predictions with the least mean absolute error using ML models such as Random Forest, Decision Tree, K-Nearest Neighbors, Logistic Regression, and Support Vector Machine. Both regression (continuous numeric value) and classification (Likert scale option) prediction methods will be used. An Exploratory Factor Analysis (EFA) will be conducted on the audio features to find a latent representation of the dataset that machine learning is also trained and tested on. A full literature review showed this is the first study to use both Spotify API data, rather than self-reported music preference, and machine learning, in addition to traditional statistical tests and regression models, to predict the personality of a synthetic college student demographic. The findings of this study show ChatGPT struggles to mimic the diverse and complex nature of human personality psychology and music taste. This paper is a pilot study to a broader ongoing investigation where the findings of synthetic participants are compared to that of real college students using the same inventories for which data collection is ongoing
Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies Research Article July 11, 2023 Preprint Data Generation Computer Science, Economics, Psychology, Sociology We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.
Can Large Language Models Help Augment English Psycholinguistic Datasets? Research Article July 8, 2023 Preprint Data Generation Languages Research on language and cognition relies extensively on large, psycholinguistic datasets —sometimes called “norms”. These datasets contain judgments of lexical properties like concreteness and age of acquisition, and can be used to norm experimental stimuli, discover empirical relationships in the lexicon, and stress-test computational models. However, collecting human judgments at scale is both time-consuming and expensive. This issue of scale is made more difficult for norms containing multiple semantic dimensions and especially for norms that incorporate linguistic context. In the current work, I explore whether advances in Large Language Models (LLMs) can be leveraged to augment the creation of large, psycholinguistic datasets in English. I use GPT-4 to collect multiple kinds of semantic judgments (e.g., word similarity, contextualized sensorimotor associations, iconicity) for English words and compare these judgments against the human “gold standard”. For each dataset, I find that GPT-4’s judgments are positively correlated with human judgments, in some cases rivaling or even exceeding the average inter-annotator agreement displayed by humans. I then explore whether and how LLM-generated norms differ from human-generated norms systematically. I also perform several “substitution analyses”, which demonstrate that replacing human-generated norms with LLM-generated norms in a statistical model does not change the sign of parameter estimates (though in select cases, there are significant changes to their magnitude). Finally, I conclude by discussing the limitations of this approach and under what conditions LLM-generated norms could be useful to researchers.
Friend or Foe? Exploring the Implications of Large Language Models on the Science System Research Article June 20, 2023 Preprint, Open Source Other Other The advent of ChatGPT by OpenAI has prompted extensive discourse on its potential implications for science and higher education. While the impact on education has been a primary focus, there is limited empirical research on the effects of large language models (LLMs) and LLM-based chatbots on science and scientific practice. To investigate this further, we conducted a Delphi study involving 72 experts specialising in research and AI. The study focused on applications and limitations of LLMs, their effects on the science system, ethical and legal considerations, and the required competencies for their effective use. Our findings highlight the transformative potential of LLMs in science, particularly in administrative, creative, and analytical tasks. However, risks related to bias, misinformation, and quality assurance need to be addressed through proactive regulation and science education. This research contributes to informed discussions on the impact of generative AI in science and helps identify areas for future action.
Taking Advice from ChatGPT Research Article May 23, 2023 Preprint Data Collection, Data Generation Psychology A growing literature studies how humans incorporate advice from algorithms. This study examines an algorithm with millions of daily users: ChatGPT. We conduct a lab experiment in which 118 student participants answer 2,828 multiple-choice questions across 25 academic subjects. We present participants with answers from a GPT model and allow them to update their initial responses. We find that the advisor’s identity (``AI chatbot'' versus a human ``expert''), presence of written justification, and advice correctness do not significant affect weight on advice. Instead, we show that participants weigh advice more heavily if they (1) are unfamiliar with the topic, (2) used ChatGPT in the past, or (3) received more accurate advice previously. These three effects—task difficulty, algorithm familiarity, and experience, respectively—appear to be stronger with an AI chatbot as the advisor. Moreover, we find that participants are able to place greater weight on correct advice only when written justifications are provided. In a parallel analysis, we find that the student participants are miscalibrated and significantly underestimate the accuracy of ChatGPT on 10/25 topics. Students under-weigh advice by over 50% and would have scored better if they trusted ChatGPT more.