Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.
Title | Type of Resource | Link to Resource | Date Recorded | Open Science | Use of LLM | Research Discipline(s) | Description of Resource |
---|---|---|---|---|---|---|---|
It Knew Too Much: On the Unsuitability of LLMs as Replacements for Human Subjects | Research Article | TooMuch | May 31, 2025 | Preprint | Data Collection | Sociology | Psychometric and moral benchmarks are increasingly used to evaluate large language models (LLMs), aiming to measure their capabilities, surface implicit biases, and assess alignment with human values. However, interpreting LLM responses to these benchmarks is methodologically challenging, a nuance often overlooked in existing literature. We empirically demonstrate that LLM responses to a standard psychometric benchmark (generalized trust from the World Values Survey) correlate strongly with known survey results across language communities. Critically, we observe LLMs achieve this while explicitly referencing known survey results and the broader literature, even without direct prompting. We further show these correlations can be amplified or effectively eliminated by subtle changes in evaluation task design, revealing that replicating known results does not validate LLMs as naive subjects. Given LLMs' access to relevant literature, their ability to replicate known human behavior constitutes an invalid evaluation for assessing the suitability of large language models as naive subjects. Fascinating though it may be, this ability provides no evidence of generalizability to novel or out-of-sample behaviors. We discuss implications for alignment research and benchmarking practices. |
Artificial Intelligence, Scientific Discovery, and Product Innovation | Research Article | Innovation | November 7, 2024 | Open Source | Other | Economics | THIS PAPER HAS BEEN RETRACTED: https://economics.mit.edu/news/assuring-accurate-research-record. This paper studies the impact of artificial intelligence on innovation, exploiting the randomized introduction of a new materials discovery technology to 1,018 scientists in the R&D lab of a large U.S. firm. AI-assisted researchers discover 44% more materials, resulting in a 39% increase in patent filings and a 17% rise in downstream product innovation. These compounds possess more novel chemical structures and lead to more radical inventions. However, the technology has strikingly disparate effects across the productivity distribution: while the bottom third of scientists see little benefit, the output of top researchers nearly doubles. Investigating the mechanisms behind these results, I show that AI automates 57% of “idea-generation” tasks, reallocating researchers to the new task of evaluating model-produced candidate materials. Top scientists leverage their domain knowledge to prioritize promising AI suggestions, while others waste significant resources testing false positives. Together, these findings demonstrate the potential of AI-augmented research and highlight the complementarity between algorithms and expertise in the innovative process. Survey evidence reveals that these gains come at a cost, however, as 82% of scientists report reduced satisfaction with their work due to decreased creativity and skill underutilization. |
Leveraging Large Language Models in Message Stimuli Generation and Validation for Experimental Research | Research Article | Stimuli | May 11, 2025 | Preprint | Data Collection | Psychology, Sociology | Despite the wide application of message stimuli in communication experiments, creating effective stimuli is often challenging and costly. However, the advent of generative artificial intelligence (AI) and large language models (LLMs) suggests great potential to facilitate this process. To advance AI-assisted communication research, we examined the performance of ChatGPT (powered by GPT-4) in generating message stimuli for experimental research. Through four pre-registered experiments, we compared GPT-generated stimuli with human-generated stimuli in (1) manipulating target variables (discrete emotions and moral intuitions) and (2) controlling unintended variables. We found GPT-generated message stimuli performed equivalently to or even surpassed human-generated stimuli in manipulating target variables, while the performance in controlling unintended variables was mixed. Our study suggests that LLMs can generate effective message stimuli for communication experimental research. This research serves as a foundational resource for integrating LLMs in stimuli generation across various communication contexts, with its effectiveness, opportunities and challenges discussed. |
Empowering Scientific Workflows with Federated Agents | Research Article | Federated | May 11, 2025 | Preprint | Other | Computer Science | Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, the agentic frameworks used to build these systems have not previously enabled use with research cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem, including HPC systems, experimental facilities, and data repositories. To meet the demands of scientific computing, Academy supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability. It provides abstractions for expressing stateful agents, managing inter-agent coordination, and integrating computation with experimental control. We present microbenchmark results that demonstrate high performance and scalability in HPC environments. To demonstrate the breadth of applications that can be supported by agentic workflow designs, we also present case studies in materials discovery, decentralized learning, and information extraction in which agents are deployed across diverse HPC systems. |
PaperBench: Evaluating AI's Ability to Replicate AI Research | Research Article | PaperBench | April 11, 2025 | Preprint | Other | Computer Science | We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We \href{this https URL}{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents. |
Do Two AI Scientists Agree? | Research Article | 2 AI | April 4, 2025 | Preprint | Research Design, Data Analysis, Describing Results, Other | Computer Science, Other | When two AI models are trained on the same scientific task, do they learn the same theory or two different theories? Throughout history of science, we have witnessed the rise and fall of theories driven by experimental validation or falsification: many theories may co-exist when experimental data is lacking, but the space of survived theories become more constrained with more experimental data becoming available. We show the same story is true for AI scientists. With increasingly more systems provided in training data, AI scientists tend to converge in the theories they learned, although sometimes they form distinct groups corresponding to different theories. To mechanistically interpret what theories AI scientists learn and quantify their agreement, we propose MASS, Hamiltonian-Lagrangian neural networks as AI Scientists, trained on standard problems in physics, aggregating training results across many seeds simulating the different configurations of AI scientists. Our findings suggests for AI scientists switch from learning a Hamiltonian theory in simple setups to a Lagrangian formulation when more complex systems are introduced. We also observe strong seed dependence of the training dynamics and final learned weights, controlling the rise and fall of relevant theories. We finally demonstrate that not only can our neural networks aid interpretability, it can also be applied to higher dimensional problems. |
Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning | Research Article | Physicist | April 3, 2025 | Preprint | Other | Computer Science, Other | Large Language Models (LLMs) are playing an expanding role in physics research by enhancing reasoning, symbolic manipulation, and numerical computation. However, ensuring the reliability and interpretability of their outputs remains a significant challenge. In our framework, we conceptualize the collaboration between AI and human scientists as a dynamic interplay among three modules: the reasoning module, the interpretation module, and the AI-scientist interaction module. Recognizing that effective physics reasoning demands rigorous logical consistency, quantitative precision, and deep integration with established theoretical models, we introduce the interpretation module to improve the understanding of AI-generated outputs, which is not previously explored in the literature. This module comprises multiple specialized agents, including summarizers, model builders, UI builders, and testers, which collaboratively structure LLM outputs within a physically grounded framework, by constructing a more interpretable science model. A case study demonstrates that our approach enhances transparency, facilitates validation, and strengthens AI-augmented reasoning in scientific discovery. |
LLM4SR: A Survey on Large Language Models for Scientific Research | Research Article | LLM4RS | April 2, 2025 | Preprint | Other | Any Discipline | In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process. We analyze the unique roles LLMs play across four critical stages of research: hypothesis discovery, experiment planning and implementation, scientific writing, and peer reviewing. Our review comprehensively showcases the task-specific methodologies and evaluation benchmarks. By identifying current challenges and proposing future research directions, this survey not only highlights the transformative potential of LLMs, but also aims to inspire and guide researchers and practitioners in leveraging LLMs to advance scientific inquiry. |
Accelerating Scientific Research Through a Multi-LLM Framework | Research Article | multi-llm | April 2, 2025 | Preprint | Other | Computer Science | The exponential growth of academic publications poses challenges for the research process, such as literature review and procedural planning. Large Language Models (LLMs) have emerged as powerful AI tools, especially when combined with additional tools and resources. Recent LLM-powered frameworks offer promising solutions for handling complex domain-specific tasks, yet their domain-specific implementation limits broader applicability. This highlights the need for LLM-integrated systems that can assist in cross-disciplinary tasks, such as streamlining the research process across science and engineering disciplines. To address this need, we introduce Artificial Research Innovator Assistant (ARIA), a four-agent, multi-LLM framework. By emulating a team of expert assistants, ARIA systematically replicates the human research workflow to autonomously search, retrieve, and filter hundreds of papers, subsequently synthesizing relevant literature into actionable research procedures. In a case study on dropwise condensation enhancement, ARIA demonstrates its capability to streamline research tasks within an hour, maintaining user oversight during execution and ultimately liberating researchers from time-intensive tasks. |
Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data | Research Article | LLMSurveys | March 24, 2025 | Preprint | Data Collection | Computer Science, Any Discipline | Large Language Models (LLMs) offer a promising alternative to traditional survey methods, potentially enhancing efficiency and reducing costs. In this study, we use LLMs to create virtual populations that answer survey questions, enabling us to predict outcomes comparable to human responses. We evaluate several LLMs-including GPT-4o, GPT-3.5, Claude 3.5-Sonnet, and versions of the Llama and Mistral models-comparing their performance to that of a traditional Random Forests algorithm using demographic data from the World Values Survey (WVS). LLMs demonstrate competitive performance overall, with the significant advantage of requiring no additional training data. However, they exhibit biases when predicting responses for certain religious and population groups, underperforming in these areas. On the other hand, Random Forests demonstrate stronger performance than LLMs when trained with sufficient data. We observe that removing censorship mechanisms from LLMs significantly improves predictive accuracy, particularly for underrepresented demographic segments where censored models struggle. These findings highlight the importance of addressing biases and reconsidering censorship approaches in LLMs to enhance their reliability and fairness in public opinion research. |