Articles

Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

Title	Type of Resource	Description of Resource	Link to Resource	Open Science	Use of LLM	Research Discipline(s)
Using OpenAI models as a new tool for text analysis in political leaders’ unstructured discourse	Research Article	This study explores the application of Large Language Models (LLMs) and Automatic Speech Recognition (ASR) models in the analysis of right-wing unstructured political discourse in Peru, focusing on how the concept of freedom is framed. Three types of freedom are identified: personal autonomy, economic freedom, and civil liberties. Utilizing the transcription of OpenAI’s ASR Whisper and GPT-3.5 and GPT-4 models, interviews with three Peruvian right-wing political leaders are analyzed: Rafael López Aliaga, Hernando de Soto and Keiko Fujimori. The results show that GPT-4 beats GPT-3.5 in identifying dimensions of freedom, although there are discrepancies compared to human coding. Despite challenges in classifying abstract and ambiguous concepts, the findings demonstrate GPT-4's ability to classify complexities within political discourse at comparatively small costs and easy access. The research suggests the need for additional refinement, ethical consideration, and ongoing exploration in the analysis of political speeches through AI.	Political Discourse	Preprint	Data Analysis	Political Science
Extracting protest events from newspaper articles with ChatGPT	Research Article	This research note examines the abilities of a large language model (LLM), ChatGPT, to extract structured data on protest events from media accounts. Based on our analysis of 500 articles on Black Lives Matter protests, after an iterative process of prompt improvement on a training dataset, ChatGPT can produce data comparable to or better than a hand-coding method with an enormous reduction in time and minimal cost. While the technique has limitations, LLMs show promise and deserve further study for their use in protest event analysis.	Extracting	Preprint	Data Collection	Other
CoAIcoder: Examining the Effectiveness of AI-assisted Human-to-Human Collaboration in Qualitative Analysis	Research Article	While AI-assisted individual qualitative analysis has been substantially studied, AI-assisted collaborative qualitative analysis (CQA)-a process that involves multiple researchers working together to interpret data-remains relatively unexplored. After identifying CQA practices and design opportunities through formative interviews, we designed and implemented CoAIcoder, a tool leveraging AI to enhance human-to-human collaboration within CQA through four distinct collaboration methods. With a between-subject design, we evaluated CoAIcoder with 32 pairs of CQA-trained participants across common CQA phases under each collaboration method. Our findings suggest that while using a shared AI model as a mediator among coders could improve CQA efficiency and foster agreement more quickly in the early coding stage, it might affect the final code diversity. We also emphasize the need to consider the independence level when using AI to assist human-to-human collaboration in various CQA scenarios. Lastly, we suggest design implications for future AI-assisted CQA systems.	CoAICoding	Preprint	Data Analysis	Computer Science
LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding	Research Article	Deductive coding is a widely used qualitative research method for determining the prevalence of themes across documents. While useful, deductive coding is often burdensome and time consuming since it requires researchers to read, interpret, and reliably categorize a large body of unstructured text documents. Large language models (LLMs), like ChatGPT, are a class of quickly evolving AI tools that can perform a range of natural language processing and reasoning tasks. In this study, we explore the use of LLMs to reduce the time it takes for deductive coding while retaining the flexibility of a traditional content analysis. We outline the proposed approach, called LLM-assisted content analysis (LACA), along with an in-depth case study using GPT-3.5 for LACA on a publicly available deductive coding data set. Additionally, we conduct an empirical benchmark using LACA on 4 publicly available data sets to assess the broader question of how well GPT-3.5 performs across a range of deductive coding tasks. Overall, we find that GPT-3.5 can often perform deductive coding at levels of agreement comparable to human coders. Additionally, we demonstrate that LACA can help refine prompts for deductive coding, identify codes for which an LLM is randomly guessing, and help assess when to use LLMs vs. human coders for deductive coding. We conclude with several implications for future practice of deductive coding and related research methods.	Deductive Coding	Preprint	Data Analysis	Computer Science, Other
Utilizing Machine Learning Algorithms Trained on AI-generated Synthetic Participant Recent Music-Listening Activity in Predicting Big Five Personality Traits	Research Article	The recent rise of publicly available artificial intelligence (AI) tools such as ChatGPT has raised a plethora of questions among users and skeptics alike. One major question asks, "Has AI gained the ability to indistinguishably mimic the psychology of its organic, human counterpart?". Since music has been known to be a positive predictor of personality traits due to the individuality of personal preference, in this paper we use machine learning (ML) algorithms to analyze the predictability of AI-generated or 'synthetic' participants' Big 5 personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) using their recent music listening activity and motivations for listening to music. Recent music listening history for synthetic participants is generated using ChatGPT and the corresponding audio features for the songs are derived via the Spotify Application Programming Interface (Beats per minute, Danceability, Instrumentals, Happiness, etc). This study will also administer the Uses of Music Inventory to account for synthetic participants’ motivations for listening to music: emotional, cognitive, and background.The dataset will be trained and tested on scaler-model combinations to identify the predictions with the least mean absolute error using ML models such as Random Forest, Decision Tree, K-Nearest Neighbors, Logistic Regression, and Support Vector Machine. Both regression (continuous numeric value) and classification (Likert scale option) prediction methods will be used. An Exploratory Factor Analysis (EFA) will be conducted on the audio features to find a latent representation of the dataset that machine learning is also trained and tested on. A full literature review showed this is the first study to use both Spotify API data, rather than self-reported music preference, and machine learning, in addition to traditional statistical tests and regression models, to predict the personality of a synthetic college student demographic. The findings of this study show ChatGPT struggles to mimic the diverse and complex nature of human personality psychology and music taste. This paper is a pilot study to a broader ongoing investigation where the findings of synthetic participants are compared to that of real college students using the same inventories for which data collection is ongoing	Synthetic music listening	Preprint, Open Source	Data Generation	Engineering, Psychology
Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies	Research Article	We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.	LLM to Simulate Multiple Humans and Replicate Human Subject Studies	Preprint	Data Generation	Computer Science, Economics, Psychology, Sociology
Can Large Language Models Help Augment English Psycholinguistic Datasets?	Research Article	Research on language and cognition relies extensively on large, psycholinguistic datasets —sometimes called “norms”. These datasets contain judgments of lexical properties like concreteness and age of acquisition, and can be used to norm experimental stimuli, discover empirical relationships in the lexicon, and stress-test computational models. However, collecting human judgments at scale is both time-consuming and expensive. This issue of scale is made more difficult for norms containing multiple semantic dimensions and especially for norms that incorporate linguistic context. In the current work, I explore whether advances in Large Language Models (LLMs) can be leveraged to augment the creation of large, psycholinguistic datasets in English. I use GPT-4 to collect multiple kinds of semantic judgments (e.g., word similarity, contextualized sensorimotor associations, iconicity) for English words and compare these judgments against the human “gold standard”. For each dataset, I find that GPT-4’s judgments are positively correlated with human judgments, in some cases rivaling or even exceeding the average inter-annotator agreement displayed by humans. I then explore whether and how LLM-generated norms differ from human-generated norms systematically. I also perform several “substitution analyses”, which demonstrate that replacing human-generated norms with LLM-generated norms in a statistical model does not change the sign of parameter estimates (though in select cases, there are significant changes to their magnitude). Finally, I conclude by discussing the limitations of this approach and under what conditions LLM-generated norms could be useful to researchers.	Can Large Language Models Help Augment English Psycholinguistic Datasets?	Preprint	Data Generation	Languages
Friend or Foe? Exploring the Implications of Large Language Models on the Science System	Research Article	The advent of ChatGPT by OpenAI has prompted extensive discourse on its potential implications for science and higher education. While the impact on education has been a primary focus, there is limited empirical research on the effects of large language models (LLMs) and LLM-based chatbots on science and scientific practice. To investigate this further, we conducted a Delphi study involving 72 experts specialising in research and AI. The study focused on applications and limitations of LLMs, their effects on the science system, ethical and legal considerations, and the required competencies for their effective use. Our findings highlight the transformative potential of LLMs in science, particularly in administrative, creative, and analytical tasks. However, risks related to bias, misinformation, and quality assurance need to be addressed through proactive regulation and science education. This research contributes to informed discussions on the impact of generative AI in science and helps identify areas for future action.	Friend or Foe	Preprint, Open Source	Other	Other
Taking Advice from ChatGPT	Research Article	A growing literature studies how humans incorporate advice from algorithms. This study examines an algorithm with millions of daily users: ChatGPT. We conduct a lab experiment in which 118 student participants answer 2,828 multiple-choice questions across 25 academic subjects. We present participants with answers from a GPT model and allow them to update their initial responses. We find that the advisor’s identity (``AI chatbot'' versus a human ``expert''), presence of written justification, and advice correctness do not significant affect weight on advice. Instead, we show that participants weigh advice more heavily if they (1) are unfamiliar with the topic, (2) used ChatGPT in the past, or (3) received more accurate advice previously. These three effects—task difficulty, algorithm familiarity, and experience, respectively—appear to be stronger with an AI chatbot as the advisor. Moreover, we find that participants are able to place greater weight on correct advice only when written justifications are provided. In a parallel analysis, we find that the student participants are miscalibrated and significantly underestimate the accuracy of ChatGPT on 10/25 topics. Students under-weigh advice by over 50% and would have scored better if they trusted ChatGPT more.	Taking Advice	Preprint	Data Collection, Data Generation	Psychology
Out of One, Many: Using Language Models to Simulate Human Samples	Research Article	We propose and explore the possibility that language models can be studied as effective proxies for specific human subpopulations in social science research. Practical and research applications of artificial intelligence tools have sometimes been limited by problematic biases (such as racism or sexism), which are often treated as uniform properties of the models. We show that the “algorithmic bias” within one such tool—the GPT-3 language model—is instead both fine-grained and demographically correlated, meaning that proper conditioning will cause it to accurately emulate response distributions from a wide variety of human subgroups. We term this property algorithmic fidelity and explore its extent in GPT-3. We create “silicon samples” by conditioning the model on thousands of sociodemographic backstories from real human participants in multiple large surveys conducted in the United States. We then compare the silicon and human samples to demonstrate that the information contained in GPT-3 goes far beyond surface similarity. It is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and sociocultural context that characterize human attitudes. We suggest that language models with sufficient algorithmic fidelity thus constitute a novel and powerful tool to advance understanding of humans and society across a variety of disciplines.	Out of One, Many	Open Source	Data Collection	Political Science

Archives

Categories