Articles

Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

TitleType of ResourceLink to ResourceDate RecordedOpen ScienceUse of LLMResearch Discipline(s)Description of Resource
Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies Research Article July 11, 2023 Preprint Data Generation Computer Science, Economics, Psychology, Sociology We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.
Can Large Language Models Help Augment English Psycholinguistic Datasets? Research Article July 8, 2023 Preprint Data Generation Languages Research on language and cognition relies extensively on large, psycholinguistic datasets —sometimes called “norms”. These datasets contain judgments of lexical properties like concreteness and age of acquisition, and can be used to norm experimental stimuli, discover empirical relationships in the lexicon, and stress-test computational models. However, collecting human judgments at scale is both time-consuming and expensive. This issue of scale is made more difficult for norms containing multiple semantic dimensions and especially for norms that incorporate linguistic context. In the current work, I explore whether advances in Large Language Models (LLMs) can be leveraged to augment the creation of large, psycholinguistic datasets in English. I use GPT-4 to collect multiple kinds of semantic judgments (e.g., word similarity, contextualized sensorimotor associations, iconicity) for English words and compare these judgments against the human “gold standard”. For each dataset, I find that GPT-4’s judgments are positively correlated with human judgments, in some cases rivaling or even exceeding the average inter-annotator agreement displayed by humans. I then explore whether and how LLM-generated norms differ from human-generated norms systematically. I also perform several “substitution analyses”, which demonstrate that replacing human-generated norms with LLM-generated norms in a statistical model does not change the sign of parameter estimates (though in select cases, there are significant changes to their magnitude). Finally, I conclude by discussing the limitations of this approach and under what conditions LLM-generated norms could be useful to researchers.
Friend or Foe? Exploring the Implications of Large Language Models on the Science System Research Article June 20, 2023 Preprint, Open Source Other Other The advent of ChatGPT by OpenAI has prompted extensive discourse on its potential implications for science and higher education. While the impact on education has been a primary focus, there is limited empirical research on the effects of large language models (LLMs) and LLM-based chatbots on science and scientific practice. To investigate this further, we conducted a Delphi study involving 72 experts specialising in research and AI. The study focused on applications and limitations of LLMs, their effects on the science system, ethical and legal considerations, and the required competencies for their effective use. Our findings highlight the transformative potential of LLMs in science, particularly in administrative, creative, and analytical tasks. However, risks related to bias, misinformation, and quality assurance need to be addressed through proactive regulation and science education. This research contributes to informed discussions on the impact of generative AI in science and helps identify areas for future action.
Taking Advice from ChatGPT Research Article May 23, 2023 Preprint Data Collection, Data Generation Psychology A growing literature studies how humans incorporate advice from algorithms. This study examines an algorithm with millions of daily users: ChatGPT. We conduct a lab experiment in which 118 student participants answer 2,828 multiple-choice questions across 25 academic subjects. We present participants with answers from a GPT model and allow them to update their initial responses. We find that the advisor’s identity (``AI chatbot'' versus a human ``expert''), presence of written justification, and advice correctness do not significant affect weight on advice. Instead, we show that participants weigh advice more heavily if they (1) are unfamiliar with the topic, (2) used ChatGPT in the past, or (3) received more accurate advice previously. These three effects—task difficulty, algorithm familiarity, and experience, respectively—appear to be stronger with an AI chatbot as the advisor. Moreover, we find that participants are able to place greater weight on correct advice only when written justifications are provided. In a parallel analysis, we find that the student participants are miscalibrated and significantly underestimate the accuracy of ChatGPT on 10/25 topics. Students under-weigh advice by over 50% and would have scored better if they trusted ChatGPT more.
Out of One, Many: Using Language Models to Simulate Human Samples Research Article May 13, 2023 Open Source Data Collection Political Science We propose and explore the possibility that language models can be studied as effective proxies for specific human subpopulations in social science research. Practical and research applications of artificial intelligence tools have sometimes been limited by problematic biases (such as racism or sexism), which are often treated as uniform properties of the models. We show that the “algorithmic bias” within one such tool—the GPT-3 language model—is instead both fine-grained and demographically correlated, meaning that proper conditioning will cause it to accurately emulate response distributions from a wide variety of human subgroups. We term this property algorithmic fidelity and explore its extent in GPT-3. We create “silicon samples” by conditioning the model on thousands of sociodemographic backstories from real human participants in multiple large surveys conducted in the United States. We then compare the silicon and human samples to demonstrate that the information contained in GPT-3 goes far beyond surface similarity. It is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and sociocultural context that characterize human attitudes. We suggest that language models with sufficient algorithmic fidelity thus constitute a novel and powerful tool to advance understanding of humans and society across a variety of disciplines.
Supporting Qualitative Analysis with Large Language Models: Combining Codebook with GPT-3 for Deductive Coding Research Article April 24, 2023 Preprint Data Analysis Computer Science Qualitative analysis of textual contents unpacks rich and valuable information by assigning labels to the data. However, this process is often labor-intensive, particularly when working with large datasets. While recent AI-based tools demonstrate utility, researchers may not have readily available AI resources and expertise, let alone be challenged by the limited generalizability of those task-specific models. In this study, we explored the use of large language models (LLMs) in supporting deductive coding, a major category of qualitative analysis where researchers use pre-determined codebooks to label the data into a fixed set of codes. Instead of training task-specific models, a pre-trained LLM could be used directly for various tasks without fine-tuning through prompt learning. Using a curiosity-driven questions coding task as a case study, we found, by combining GPT-3 with expert-drafted codebooks, our proposed approach achieved fair to substantial agreements with expert-coded results. We lay out challenges and opportunities in using LLMs to support qualitative coding and beyond.
Surveying Generative AI’s Economic Expectations Research Article May 8, 2023 Preprint, Open Source Data Generation, Data Analysis Economics I introduce a survey of economic expectations formed by querying a large language model (LLM)’s expectations of various financial and macroeconomic variables based on a sample of news articles from the Wall Street Journal between 1984 and 2021. I find the resulting expectations closely match existing surveys including the Survey of Professional Forecasters (SPF), the American Association of Individual Investors, and the Duke CFO Survey. Importantly, I document that LLM based expectations match many of the deviations from full-information rational expectations exhibited in these existing survey series. The LLM’s macroeconomic expectations exhibit under-reaction commonly found in consensus SPF forecasts. Additionally, its return expectations are extrapolative, disconnected from objective measures of expected returns, and negatively correlated with future realized returns. Finally, using a sample of articles outside of the LLM’s training period I find that the correlation with existing survey measures persists – indicating these results do not reflect memorization but generalization on the part of the LLM. My results provide evidence for the potential of LLMs to help us better understand human beliefs and navigate possible models of nonrational expectations.
Generative Agents: Interactive Simulacra of Human Behavior Research Article May 13, 2023 Preprint Data Generation Computer Science Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture--observation, planning, and reflection--each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.
Machine vs. Human, who makes better judgment? Take Large Language Model GPT-4 For Example Machine vs. Human, who makes better judgment? Take Large Language Model GPT-4 For Example Research Article April 18, 2023 Preprint Data Collection This essay explores the topic of human decision-making and the concept of noise, which refers to random and irrelevant factors that can affect decision-making. The essay argues that while humans are prone to noise in their decision-making processes, artificial intelligence (AI) can make less noise due to its ability to process large amounts of data and apply logical algorithms to make decisions. The essay examines examples and studies to demonstrate the impact of noise on human decision-making, including business ideas. Additionally, the essay highlights the potential that machine intuition is doing better than humans.
Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability Analysis against Human Performance Research Article April 14, 2023 Preprint Other Education ChatGPT and Bard are AI chatbots based on Large Language Models (LLM) that are slated to promise different applications in diverse areas. In education, these AI technologies have been tested for applications in assessment and teaching. In assessment, AI has long been used in automated essay scoring and automated item generation. One psychometric property that these tools must have to assist or replace humans in assessment is high reliability in terms of agreement between AI scores and human raters. In this paper, we measure the reliability of OpenAI ChatGP and Google Bard LLMs tools against experienced and trained humans in perceiving and rating the complexity of writing prompts. Intraclass correlation (ICC) as a performance metric showed that the inter-reliability of both the OpenAI ChatGPT and the Google Bard were low against the gold standard of human ratings.