Articles

Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

Title	Type of Resource	Link to Resource	Date Recorded	Open Science	Use of LLM	Research Discipline(s)	Description of Resource
Artificial intelligence in applied family research involving families with young children: A scoping review	Research Article	Family	June 6, 2025	Open Source	Other	Psychology	This scoping review systematically examined the applied family science literature involving families raising young children to understand how relevant studies have applied artificial intelligence (AI)-facilitated technologies. Family research is exploring the application of AI. However, there is a critical need for a review study that systematically examines the varied use of AI in applied family science to inform family practitioners and policymakers. Comprehensive literature searches were conducted in nine databases. Of the 10,022 studies identified, 21 met inclusion criteria: peer-reviewed journal article; published between 2014–2024; written in English; involved the use of AI in collecting data, analyzing data, or providing family-centered services; included families raising young children 0–5 years; and was quantitative in analysis. Most studies focused on maternal and child health outcomes in low- and middle-income countries. All studies identified were in the AI use domain of data analysis, with 76% of the studies having a focus on identifying the most important predictors. Random forest performed as the best machine learning model. Only one study directly mentioned the ethical use of AI.
Large Language Models: A Survey with Applications in Political Science	Research Article	PoliSci	June 5, 2025	Preprint	Other	Political Science	Large language models (LLMs) have taken the world by storm, but political scientists have been slow to adopt the tool. Attempts to use LLMs have been very limited in scope with scholars using LLMs for simple binary classification tasks or text generation. Whether this lack of update is due to lack of programming abilities or a result of hesitancy to use LLMs, political science is leaving a valuable tool on the table. This paper attempts to encourage uptake of LLMs for political science and makes three primary contributions: (1) it surveys LLMs from the practitioner’s perspective; (2) demonstrates the applicability of LLMs on a variety of political textual analysis applications; and (3) provides example software and data to encourage researchers to explore use of the tools.
It Knew Too Much: On the Unsuitability of LLMs as Replacements for Human Subjects	Research Article	TooMuch	May 31, 2025	Preprint	Data Collection	Sociology	Psychometric and moral benchmarks are increasingly used to evaluate large language models (LLMs), aiming to measure their capabilities, surface implicit biases, and assess alignment with human values. However, interpreting LLM responses to these benchmarks is methodologically challenging, a nuance often overlooked in existing literature. We empirically demonstrate that LLM responses to a standard psychometric benchmark (generalized trust from the World Values Survey) correlate strongly with known survey results across language communities. Critically, we observe LLMs achieve this while explicitly referencing known survey results and the broader literature, even without direct prompting. We further show these correlations can be amplified or effectively eliminated by subtle changes in evaluation task design, revealing that replicating known results does not validate LLMs as naive subjects. Given LLMs' access to relevant literature, their ability to replicate known human behavior constitutes an invalid evaluation for assessing the suitability of large language models as naive subjects. Fascinating though it may be, this ability provides no evidence of generalizability to novel or out-of-sample behaviors. We discuss implications for alignment research and benchmarking practices.
Artificial Intelligence, Scientific Discovery, and Product Innovation	Research Article	Innovation	November 7, 2024	Open Source	Other	Economics	THIS PAPER HAS BEEN RETRACTED: https://economics.mit.edu/news/assuring-accurate-research-record. This paper studies the impact of artificial intelligence on innovation, exploiting the randomized introduction of a new materials discovery technology to 1,018 scientists in the R&D lab of a large U.S. firm. AI-assisted researchers discover 44% more materials, resulting in a 39% increase in patent filings and a 17% rise in downstream product innovation. These compounds possess more novel chemical structures and lead to more radical inventions. However, the technology has strikingly disparate effects across the productivity distribution: while the bottom third of scientists see little benefit, the output of top researchers nearly doubles. Investigating the mechanisms behind these results, I show that AI automates 57% of “idea-generation” tasks, reallocating researchers to the new task of evaluating model-produced candidate materials. Top scientists leverage their domain knowledge to prioritize promising AI suggestions, while others waste significant resources testing false positives. Together, these findings demonstrate the potential of AI-augmented research and highlight the complementarity between algorithms and expertise in the innovative process. Survey evidence reveals that these gains come at a cost, however, as 82% of scientists report reduced satisfaction with their work due to decreased creativity and skill underutilization.
Leveraging Large Language Models in Message Stimuli Generation and Validation for Experimental Research	Research Article	Stimuli	May 11, 2025	Preprint	Data Collection	Psychology, Sociology	Despite the wide application of message stimuli in communication experiments, creating effective stimuli is often challenging and costly. However, the advent of generative artificial intelligence (AI) and large language models (LLMs) suggests great potential to facilitate this process. To advance AI-assisted communication research, we examined the performance of ChatGPT (powered by GPT-4) in generating message stimuli for experimental research. Through four pre-registered experiments, we compared GPT-generated stimuli with human-generated stimuli in (1) manipulating target variables (discrete emotions and moral intuitions) and (2) controlling unintended variables. We found GPT-generated message stimuli performed equivalently to or even surpassed human-generated stimuli in manipulating target variables, while the performance in controlling unintended variables was mixed. Our study suggests that LLMs can generate effective message stimuli for communication experimental research. This research serves as a foundational resource for integrating LLMs in stimuli generation across various communication contexts, with its effectiveness, opportunities and challenges discussed.
Empowering Scientific Workflows with Federated Agents	Research Article	Federated	May 11, 2025	Preprint	Other	Computer Science	Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, the agentic frameworks used to build these systems have not previously enabled use with research cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem, including HPC systems, experimental facilities, and data repositories. To meet the demands of scientific computing, Academy supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability. It provides abstractions for expressing stateful agents, managing inter-agent coordination, and integrating computation with experimental control. We present microbenchmark results that demonstrate high performance and scalability in HPC environments. To demonstrate the breadth of applications that can be supported by agentic workflow designs, we also present case studies in materials discovery, decentralized learning, and information extraction in which agents are deployed across diverse HPC systems.
PaperBench: Evaluating AI's Ability to Replicate AI Research	Research Article	PaperBench	April 11, 2025	Preprint	Other	Computer Science	We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We \href{this https URL}{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents.
Do Two AI Scientists Agree?	Research Article	2 AI	April 4, 2025	Preprint	Research Design, Data Analysis, Describing Results, Other	Computer Science, Other	When two AI models are trained on the same scientific task, do they learn the same theory or two different theories? Throughout history of science, we have witnessed the rise and fall of theories driven by experimental validation or falsification: many theories may co-exist when experimental data is lacking, but the space of survived theories become more constrained with more experimental data becoming available. We show the same story is true for AI scientists. With increasingly more systems provided in training data, AI scientists tend to converge in the theories they learned, although sometimes they form distinct groups corresponding to different theories. To mechanistically interpret what theories AI scientists learn and quantify their agreement, we propose MASS, Hamiltonian-Lagrangian neural networks as AI Scientists, trained on standard problems in physics, aggregating training results across many seeds simulating the different configurations of AI scientists. Our findings suggests for AI scientists switch from learning a Hamiltonian theory in simple setups to a Lagrangian formulation when more complex systems are introduced. We also observe strong seed dependence of the training dynamics and final learned weights, controlling the rise and fall of relevant theories. We finally demonstrate that not only can our neural networks aid interpretability, it can also be applied to higher dimensional problems.
Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning	Research Article	Physicist	April 3, 2025	Preprint	Other	Computer Science, Other	Large Language Models (LLMs) are playing an expanding role in physics research by enhancing reasoning, symbolic manipulation, and numerical computation. However, ensuring the reliability and interpretability of their outputs remains a significant challenge. In our framework, we conceptualize the collaboration between AI and human scientists as a dynamic interplay among three modules: the reasoning module, the interpretation module, and the AI-scientist interaction module. Recognizing that effective physics reasoning demands rigorous logical consistency, quantitative precision, and deep integration with established theoretical models, we introduce the interpretation module to improve the understanding of AI-generated outputs, which is not previously explored in the literature. This module comprises multiple specialized agents, including summarizers, model builders, UI builders, and testers, which collaboratively structure LLM outputs within a physically grounded framework, by constructing a more interpretable science model. A case study demonstrates that our approach enhances transparency, facilitates validation, and strengthens AI-augmented reasoning in scientific discovery.
LLM4SR: A Survey on Large Language Models for Scientific Research	Research Article	LLM4RS	April 2, 2025	Preprint	Other	Any Discipline	In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process. We analyze the unique roles LLMs play across four critical stages of research: hypothesis discovery, experiment planning and implementation, scientific writing, and peer reviewing. Our review comprehensively showcases the task-specific methodologies and evaluation benchmarks. By identifying current challenges and proposing future research directions, this survey not only highlights the transformative potential of LLMs, but also aims to inspire and guide researchers and practitioners in leveraging LLMs to advance scientific inquiry.

Highlighted Resources

Categories