Articles

Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

TitleType of ResourceLink to ResourceDate RecordedOpen ScienceUse of LLMResearch Discipline(s)Description of Resource
AgentRxiv: Towards Collaborative Autonomous Research Research Article June 13, 2025 Preprint Research Design, Other Computer Science Progress in scientific discovery is rarely the result of a single "Eureka" moment, but is rather the product of hundreds of scientists incrementally working together toward a common goal. While existing agent workflows are capable of producing research autonomously, they do so in isolation, without the ability to continuously improve upon prior research results. To address these challenges, we introduce AgentRxiv—a framework that lets LLM agent laboratories upload and retrieve reports from a shared preprint server in order to collaborate, share insights, and iteratively build on each other’s research. We task agent laboratories to develop new reasoning and prompting techniques and find that agents with access to their prior research achieve higher performance improvements compared to agents operating in isolation (11.4% relative improvement over baseline on MATH-500). We find that the best performing strategy generalizes to benchmarks in other domains (improving on average by 3.3%). Multiple agent laboratories sharing research through AgentRxiv are able to work together towards a common goal, progressing more rapidly than isolated laboratories, achieving higher overall accuracy (13.7% relative improvement over baseline on MATH-500). These findings suggest that autonomous agents may play a role in designing future AI systems alongside humans. We hope that AgentRxiv allows agents to collaborate toward research goals and enables researchers to accelerate discovery.
Artificial Intelligence Software to Accelerate Screening for Living Systematic Reviews Research Article June 13, 2025 Preprint Other Any Discipline Background: Systematic and meta-analytic reviews provide gold-standard evidence but are static and outdate quickly. Here we provide performance data on a new software platform that uses artificial intelligence technologies to (1) accelerate screening of titles and abstracts from library literature searches, and (2) provide a software solution for enabling Living Systematic Reviews by maintaining a saved AI algorithm for updated searches. Methods: Performance testing was based on Living Review System (LRS) data from seven systematic reviews. LRS efficiency was estimated as the proportion (%) of the total yield of an initial literature search (titles/abstracts) that needed human screening prior to reaching the in-built stop threshold. LRS algorithm performance was measured as work saved over sampling (WSS) for a certain recall. LRS accuracy was estimated as the proportion of incorrectly classified papers in the rejected pool, as determined by two independent human raters. Results: On average, around 36% of the total yield of a literature search needed to be human screened prior to reaching the stop-point. However, this ranged from 22-53% depending on the complexity of language structure across papers included in specific reviews. Accuracy was 99% at an interrater reliability of 95%, and 0% of titles/abstracts were incorrectly assigned. Conclusion: Findings suggest that the LRS can be a cost-effective and time-efficient solution to supporting living systematic reviews, particularly for rapidly developing areas of science. Further development of the LRS is planned, including facilitated full-text data extraction and community-of-practice access to living systematic review finding.
Artificial intelligence in applied family research involving families with young children: A scoping review Research Article June 6, 2025 Open Source Other Psychology This scoping review systematically examined the applied family science literature involving families raising young children to understand how relevant studies have applied artificial intelligence (AI)-facilitated technologies. Family research is exploring the application of AI. However, there is a critical need for a review study that systematically examines the varied use of AI in applied family science to inform family practitioners and policymakers. Comprehensive literature searches were conducted in nine databases. Of the 10,022 studies identified, 21 met inclusion criteria: peer-reviewed journal article; published between 2014–2024; written in English; involved the use of AI in collecting data, analyzing data, or providing family-centered services; included families raising young children 0–5 years; and was quantitative in analysis. Most studies focused on maternal and child health outcomes in low- and middle-income countries. All studies identified were in the AI use domain of data analysis, with 76% of the studies having a focus on identifying the most important predictors. Random forest performed as the best machine learning model. Only one study directly mentioned the ethical use of AI.
Large Language Models: A Survey with Applications in Political Science Research Article June 5, 2025 Preprint Other Political Science Large language models (LLMs) have taken the world by storm, but political scientists have been slow to adopt the tool. Attempts to use LLMs have been very limited in scope with scholars using LLMs for simple binary classification tasks or text generation. Whether this lack of update is due to lack of programming abilities or a result of hesitancy to use LLMs, political science is leaving a valuable tool on the table. This paper attempts to encourage uptake of LLMs for political science and makes three primary contributions: (1) it surveys LLMs from the practitioner’s perspective; (2) demonstrates the applicability of LLMs on a variety of political textual analysis applications; and (3) provides example software and data to encourage researchers to explore use of the tools.
It Knew Too Much: On the Unsuitability of LLMs as Replacements for Human Subjects Research Article May 31, 2025 Preprint Data Collection Sociology Psychometric and moral benchmarks are increasingly used to evaluate large language models (LLMs), aiming to measure their capabilities, surface implicit biases, and assess alignment with human values. However, interpreting LLM responses to these benchmarks is methodologically challenging, a nuance often overlooked in existing literature. We empirically demonstrate that LLM responses to a standard psychometric benchmark (generalized trust from the World Values Survey) correlate strongly with known survey results across language communities. Critically, we observe LLMs achieve this while explicitly referencing known survey results and the broader literature, even without direct prompting. We further show these correlations can be amplified or effectively eliminated by subtle changes in evaluation task design, revealing that replicating known results does not validate LLMs as naive subjects. Given LLMs' access to relevant literature, their ability to replicate known human behavior constitutes an invalid evaluation for assessing the suitability of large language models as naive subjects. Fascinating though it may be, this ability provides no evidence of generalizability to novel or out-of-sample behaviors. We discuss implications for alignment research and benchmarking practices.
Artificial Intelligence, Scientific Discovery, and Product Innovation Research Article November 7, 2024 Open Source Other Economics THIS PAPER HAS BEEN RETRACTED: https://economics.mit.edu/news/assuring-accurate-research-record. This paper studies the impact of artificial intelligence on innovation, exploiting the randomized introduction of a new materials discovery technology to 1,018 scientists in the R&D lab of a large U.S. firm. AI-assisted researchers discover 44% more materials, resulting in a 39% increase in patent filings and a 17% rise in downstream product innovation. These compounds possess more novel chemical structures and lead to more radical inventions. However, the technology has strikingly disparate effects across the productivity distribution: while the bottom third of scientists see little benefit, the output of top researchers nearly doubles. Investigating the mechanisms behind these results, I show that AI automates 57% of “idea-generation” tasks, reallocating researchers to the new task of evaluating model-produced candidate materials. Top scientists leverage their domain knowledge to prioritize promising AI suggestions, while others waste significant resources testing false positives. Together, these findings demonstrate the potential of AI-augmented research and highlight the complementarity between algorithms and expertise in the innovative process. Survey evidence reveals that these gains come at a cost, however, as 82% of scientists report reduced satisfaction with their work due to decreased creativity and skill underutilization.
Leveraging Large Language Models in Message Stimuli Generation and Validation for Experimental Research Research Article May 11, 2025 Preprint Data Collection Psychology, Sociology Despite the wide application of message stimuli in communication experiments, creating effective stimuli is often challenging and costly. However, the advent of generative artificial intelligence (AI) and large language models (LLMs) suggests great potential to facilitate this process. To advance AI-assisted communication research, we examined the performance of ChatGPT (powered by GPT-4) in generating message stimuli for experimental research. Through four pre-registered experiments, we compared GPT-generated stimuli with human-generated stimuli in (1) manipulating target variables (discrete emotions and moral intuitions) and (2) controlling unintended variables. We found GPT-generated message stimuli performed equivalently to or even surpassed human-generated stimuli in manipulating target variables, while the performance in controlling unintended variables was mixed. Our study suggests that LLMs can generate effective message stimuli for communication experimental research. This research serves as a foundational resource for integrating LLMs in stimuli generation across various communication contexts, with its effectiveness, opportunities and challenges discussed.
Empowering Scientific Workflows with Federated Agents Research Article May 11, 2025 Preprint Other Computer Science Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, the agentic frameworks used to build these systems have not previously enabled use with research cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem, including HPC systems, experimental facilities, and data repositories. To meet the demands of scientific computing, Academy supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability. It provides abstractions for expressing stateful agents, managing inter-agent coordination, and integrating computation with experimental control. We present microbenchmark results that demonstrate high performance and scalability in HPC environments. To demonstrate the breadth of applications that can be supported by agentic workflow designs, we also present case studies in materials discovery, decentralized learning, and information extraction in which agents are deployed across diverse HPC systems.
PaperBench: Evaluating AI's Ability to Replicate AI Research Research Article April 11, 2025 Preprint Other Computer Science We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We \href{this https URL}{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents.
Do Two AI Scientists Agree? Research Article April 4, 2025 Preprint Research Design, Data Analysis, Describing Results, Other Computer Science, Other When two AI models are trained on the same scientific task, do they learn the same theory or two different theories? Throughout history of science, we have witnessed the rise and fall of theories driven by experimental validation or falsification: many theories may co-exist when experimental data is lacking, but the space of survived theories become more constrained with more experimental data becoming available. We show the same story is true for AI scientists. With increasingly more systems provided in training data, AI scientists tend to converge in the theories they learned, although sometimes they form distinct groups corresponding to different theories. To mechanistically interpret what theories AI scientists learn and quantify their agreement, we propose MASS, Hamiltonian-Lagrangian neural networks as AI Scientists, trained on standard problems in physics, aggregating training results across many seeds simulating the different configurations of AI scientists. Our findings suggests for AI scientists switch from learning a Hamiltonian theory in simple setups to a Lagrangian formulation when more complex systems are introduced. We also observe strong seed dependence of the training dynamics and final learned weights, controlling the rise and fall of relevant theories. We finally demonstrate that not only can our neural networks aid interpretability, it can also be applied to higher dimensional problems.