Articles

Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

TitleType of ResourceDescription of ResourceLink to ResourceOpen ScienceUse of LLMResearch Discipline(s)
Utilizing Machine Learning Algorithms Trained on AI-generated Synthetic Participant Recent Music-Listening Activity in Predicting Big Five Personality Traits Research Article The recent rise of publicly available artificial intelligence (AI) tools such as ChatGPT has raised a plethora of questions among users and skeptics alike. One major question asks, "Has AI gained the ability to indistinguishably mimic the psychology of its organic, human counterpart?". Since music has been known to be a positive predictor of personality traits due to the individuality of personal preference, in this paper we use machine learning (ML) algorithms to analyze the predictability of AI-generated or 'synthetic' participants' Big 5 personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) using their recent music listening activity and motivations for listening to music. Recent music listening history for synthetic participants is generated using ChatGPT and the corresponding audio features for the songs are derived via the Spotify Application Programming Interface (Beats per minute, Danceability, Instrumentals, Happiness, etc). This study will also administer the Uses of Music Inventory to account for synthetic participants’ motivations for listening to music: emotional, cognitive, and background.The dataset will be trained and tested on scaler-model combinations to identify the predictions with the least mean absolute error using ML models such as Random Forest, Decision Tree, K-Nearest Neighbors, Logistic Regression, and Support Vector Machine. Both regression (continuous numeric value) and classification (Likert scale option) prediction methods will be used. An Exploratory Factor Analysis (EFA) will be conducted on the audio features to find a latent representation of the dataset that machine learning is also trained and tested on. A full literature review showed this is the first study to use both Spotify API data, rather than self-reported music preference, and machine learning, in addition to traditional statistical tests and regression models, to predict the personality of a synthetic college student demographic. The findings of this study show ChatGPT struggles to mimic the diverse and complex nature of human personality psychology and music taste. This paper is a pilot study to a broader ongoing investigation where the findings of synthetic participants are compared to that of real college students using the same inventories for which data collection is ongoing Preprint, Open Source Data Generation Engineering, Psychology
Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies Research Article We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts. Preprint Data Generation Computer Science, Economics, Psychology, Sociology
Can Large Language Models Help Augment English Psycholinguistic Datasets? Research Article Research on language and cognition relies extensively on large, psycholinguistic datasets —sometimes called “norms”. These datasets contain judgments of lexical properties like concreteness and age of acquisition, and can be used to norm experimental stimuli, discover empirical relationships in the lexicon, and stress-test computational models. However, collecting human judgments at scale is both time-consuming and expensive. This issue of scale is made more difficult for norms containing multiple semantic dimensions and especially for norms that incorporate linguistic context. In the current work, I explore whether advances in Large Language Models (LLMs) can be leveraged to augment the creation of large, psycholinguistic datasets in English. I use GPT-4 to collect multiple kinds of semantic judgments (e.g., word similarity, contextualized sensorimotor associations, iconicity) for English words and compare these judgments against the human “gold standard”. For each dataset, I find that GPT-4’s judgments are positively correlated with human judgments, in some cases rivaling or even exceeding the average inter-annotator agreement displayed by humans. I then explore whether and how LLM-generated norms differ from human-generated norms systematically. I also perform several “substitution analyses”, which demonstrate that replacing human-generated norms with LLM-generated norms in a statistical model does not change the sign of parameter estimates (though in select cases, there are significant changes to their magnitude). Finally, I conclude by discussing the limitations of this approach and under what conditions LLM-generated norms could be useful to researchers. Preprint Data Generation Languages
Friend or Foe? Exploring the Implications of Large Language Models on the Science System Research Article The advent of ChatGPT by OpenAI has prompted extensive discourse on its potential implications for science and higher education. While the impact on education has been a primary focus, there is limited empirical research on the effects of large language models (LLMs) and LLM-based chatbots on science and scientific practice. To investigate this further, we conducted a Delphi study involving 72 experts specialising in research and AI. The study focused on applications and limitations of LLMs, their effects on the science system, ethical and legal considerations, and the required competencies for their effective use. Our findings highlight the transformative potential of LLMs in science, particularly in administrative, creative, and analytical tasks. However, risks related to bias, misinformation, and quality assurance need to be addressed through proactive regulation and science education. This research contributes to informed discussions on the impact of generative AI in science and helps identify areas for future action. Preprint, Open Source Other Other
Taking Advice from ChatGPT Research Article A growing literature studies how humans incorporate advice from algorithms. This study examines an algorithm with millions of daily users: ChatGPT. We conduct a lab experiment in which 118 student participants answer 2,828 multiple-choice questions across 25 academic subjects. We present participants with answers from a GPT model and allow them to update their initial responses. We find that the advisor’s identity (``AI chatbot'' versus a human ``expert''), presence of written justification, and advice correctness do not significant affect weight on advice. Instead, we show that participants weigh advice more heavily if they (1) are unfamiliar with the topic, (2) used ChatGPT in the past, or (3) received more accurate advice previously. These three effects—task difficulty, algorithm familiarity, and experience, respectively—appear to be stronger with an AI chatbot as the advisor. Moreover, we find that participants are able to place greater weight on correct advice only when written justifications are provided. In a parallel analysis, we find that the student participants are miscalibrated and significantly underestimate the accuracy of ChatGPT on 10/25 topics. Students under-weigh advice by over 50% and would have scored better if they trusted ChatGPT more. Preprint Data Collection, Data Generation Psychology
Out of One, Many: Using Language Models to Simulate Human Samples Research Article We propose and explore the possibility that language models can be studied as effective proxies for specific human subpopulations in social science research. Practical and research applications of artificial intelligence tools have sometimes been limited by problematic biases (such as racism or sexism), which are often treated as uniform properties of the models. We show that the “algorithmic bias” within one such tool—the GPT-3 language model—is instead both fine-grained and demographically correlated, meaning that proper conditioning will cause it to accurately emulate response distributions from a wide variety of human subgroups. We term this property algorithmic fidelity and explore its extent in GPT-3. We create “silicon samples” by conditioning the model on thousands of sociodemographic backstories from real human participants in multiple large surveys conducted in the United States. We then compare the silicon and human samples to demonstrate that the information contained in GPT-3 goes far beyond surface similarity. It is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and sociocultural context that characterize human attitudes. We suggest that language models with sufficient algorithmic fidelity thus constitute a novel and powerful tool to advance understanding of humans and society across a variety of disciplines. Open Source Data Collection Political Science
Supporting Qualitative Analysis with Large Language Models: Combining Codebook with GPT-3 for Deductive Coding Research Article Qualitative analysis of textual contents unpacks rich and valuable information by assigning labels to the data. However, this process is often labor-intensive, particularly when working with large datasets. While recent AI-based tools demonstrate utility, researchers may not have readily available AI resources and expertise, let alone be challenged by the limited generalizability of those task-specific models. In this study, we explored the use of large language models (LLMs) in supporting deductive coding, a major category of qualitative analysis where researchers use pre-determined codebooks to label the data into a fixed set of codes. Instead of training task-specific models, a pre-trained LLM could be used directly for various tasks without fine-tuning through prompt learning. Using a curiosity-driven questions coding task as a case study, we found, by combining GPT-3 with expert-drafted codebooks, our proposed approach achieved fair to substantial agreements with expert-coded results. We lay out challenges and opportunities in using LLMs to support qualitative coding and beyond. Preprint Data Analysis Computer Science
Surveying Generative AI’s Economic Expectations Research Article I introduce a survey of economic expectations formed by querying a large language model (LLM)’s expectations of various financial and macroeconomic variables based on a sample of news articles from the Wall Street Journal between 1984 and 2021. I find the resulting expectations closely match existing surveys including the Survey of Professional Forecasters (SPF), the American Association of Individual Investors, and the Duke CFO Survey. Importantly, I document that LLM based expectations match many of the deviations from full-information rational expectations exhibited in these existing survey series. The LLM’s macroeconomic expectations exhibit under-reaction commonly found in consensus SPF forecasts. Additionally, its return expectations are extrapolative, disconnected from objective measures of expected returns, and negatively correlated with future realized returns. Finally, using a sample of articles outside of the LLM’s training period I find that the correlation with existing survey measures persists – indicating these results do not reflect memorization but generalization on the part of the LLM. My results provide evidence for the potential of LLMs to help us better understand human beliefs and navigate possible models of nonrational expectations. Preprint, Open Source Data Generation, Data Analysis Economics
Generative Agents: Interactive Simulacra of Human Behavior Research Article Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture--observation, planning, and reflection--each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior. Preprint Data Generation Computer Science
Machine vs. Human, who makes better judgment? Take Large Language Model GPT-4 For Example Machine vs. Human, who makes better judgment? Take Large Language Model GPT-4 For Example Research Article This essay explores the topic of human decision-making and the concept of noise, which refers to random and irrelevant factors that can affect decision-making. The essay argues that while humans are prone to noise in their decision-making processes, artificial intelligence (AI) can make less noise due to its ability to process large amounts of data and apply logical algorithms to make decisions. The essay examines examples and studies to demonstrate the impact of noise on human decision-making, including business ideas. Additionally, the essay highlights the potential that machine intuition is doing better than humans. Preprint Data Collection