Articles

Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

Title	Type of Resource	Link to Resource	Date Recorded	Open Science	Use of LLM	Research Discipline(s)	Description of Resource
Leveraging Large Language Models in Message Stimuli Generation and Validation for Experimental Research	Research Article	Stimuli	May 11, 2025	Preprint	Data Collection	Psychology, Sociology	Despite the wide application of message stimuli in communication experiments, creating effective stimuli is often challenging and costly. However, the advent of generative artificial intelligence (AI) and large language models (LLMs) suggests great potential to facilitate this process. To advance AI-assisted communication research, we examined the performance of ChatGPT (powered by GPT-4) in generating message stimuli for experimental research. Through four pre-registered experiments, we compared GPT-generated stimuli with human-generated stimuli in (1) manipulating target variables (discrete emotions and moral intuitions) and (2) controlling unintended variables. We found GPT-generated message stimuli performed equivalently to or even surpassed human-generated stimuli in manipulating target variables, while the performance in controlling unintended variables was mixed. Our study suggests that LLMs can generate effective message stimuli for communication experimental research. This research serves as a foundational resource for integrating LLMs in stimuli generation across various communication contexts, with its effectiveness, opportunities and challenges discussed.
Empowering Scientific Workflows with Federated Agents	Research Article	Federated	May 11, 2025	Preprint	Other	Computer Science	Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, the agentic frameworks used to build these systems have not previously enabled use with research cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem, including HPC systems, experimental facilities, and data repositories. To meet the demands of scientific computing, Academy supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability. It provides abstractions for expressing stateful agents, managing inter-agent coordination, and integrating computation with experimental control. We present microbenchmark results that demonstrate high performance and scalability in HPC environments. To demonstrate the breadth of applications that can be supported by agentic workflow designs, we also present case studies in materials discovery, decentralized learning, and information extraction in which agents are deployed across diverse HPC systems.
PaperBench: Evaluating AI's Ability to Replicate AI Research	Research Article	PaperBench	April 11, 2025	Preprint	Other	Computer Science	We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We \href{this https URL}{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents.
Do Two AI Scientists Agree?	Research Article	2 AI	April 4, 2025	Preprint	Research Design, Data Analysis, Describing Results, Other	Computer Science, Other	When two AI models are trained on the same scientific task, do they learn the same theory or two different theories? Throughout history of science, we have witnessed the rise and fall of theories driven by experimental validation or falsification: many theories may co-exist when experimental data is lacking, but the space of survived theories become more constrained with more experimental data becoming available. We show the same story is true for AI scientists. With increasingly more systems provided in training data, AI scientists tend to converge in the theories they learned, although sometimes they form distinct groups corresponding to different theories. To mechanistically interpret what theories AI scientists learn and quantify their agreement, we propose MASS, Hamiltonian-Lagrangian neural networks as AI Scientists, trained on standard problems in physics, aggregating training results across many seeds simulating the different configurations of AI scientists. Our findings suggests for AI scientists switch from learning a Hamiltonian theory in simple setups to a Lagrangian formulation when more complex systems are introduced. We also observe strong seed dependence of the training dynamics and final learned weights, controlling the rise and fall of relevant theories. We finally demonstrate that not only can our neural networks aid interpretability, it can also be applied to higher dimensional problems.
Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning	Research Article	Physicist	April 3, 2025	Preprint	Other	Computer Science, Other	Large Language Models (LLMs) are playing an expanding role in physics research by enhancing reasoning, symbolic manipulation, and numerical computation. However, ensuring the reliability and interpretability of their outputs remains a significant challenge. In our framework, we conceptualize the collaboration between AI and human scientists as a dynamic interplay among three modules: the reasoning module, the interpretation module, and the AI-scientist interaction module. Recognizing that effective physics reasoning demands rigorous logical consistency, quantitative precision, and deep integration with established theoretical models, we introduce the interpretation module to improve the understanding of AI-generated outputs, which is not previously explored in the literature. This module comprises multiple specialized agents, including summarizers, model builders, UI builders, and testers, which collaboratively structure LLM outputs within a physically grounded framework, by constructing a more interpretable science model. A case study demonstrates that our approach enhances transparency, facilitates validation, and strengthens AI-augmented reasoning in scientific discovery.
LLM4SR: A Survey on Large Language Models for Scientific Research	Research Article	LLM4RS	April 2, 2025	Preprint	Other	Any Discipline	In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process. We analyze the unique roles LLMs play across four critical stages of research: hypothesis discovery, experiment planning and implementation, scientific writing, and peer reviewing. Our review comprehensively showcases the task-specific methodologies and evaluation benchmarks. By identifying current challenges and proposing future research directions, this survey not only highlights the transformative potential of LLMs, but also aims to inspire and guide researchers and practitioners in leveraging LLMs to advance scientific inquiry.
Accelerating Scientific Research Through a Multi-LLM Framework	Research Article	multi-llm	April 2, 2025	Preprint	Other	Computer Science	The exponential growth of academic publications poses challenges for the research process, such as literature review and procedural planning. Large Language Models (LLMs) have emerged as powerful AI tools, especially when combined with additional tools and resources. Recent LLM-powered frameworks offer promising solutions for handling complex domain-specific tasks, yet their domain-specific implementation limits broader applicability. This highlights the need for LLM-integrated systems that can assist in cross-disciplinary tasks, such as streamlining the research process across science and engineering disciplines. To address this need, we introduce Artificial Research Innovator Assistant (ARIA), a four-agent, multi-LLM framework. By emulating a team of expert assistants, ARIA systematically replicates the human research workflow to autonomously search, retrieve, and filter hundreds of papers, subsequently synthesizing relevant literature into actionable research procedures. In a case study on dropwise condensation enhancement, ARIA demonstrates its capability to streamline research tasks within an hour, maintaining user oversight during execution and ultimately liberating researchers from time-intensive tasks.
Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data	Research Article	LLMSurveys	March 24, 2025	Preprint	Data Collection	Computer Science, Any Discipline	Large Language Models (LLMs) offer a promising alternative to traditional survey methods, potentially enhancing efficiency and reducing costs. In this study, we use LLMs to create virtual populations that answer survey questions, enabling us to predict outcomes comparable to human responses. We evaluate several LLMs-including GPT-4o, GPT-3.5, Claude 3.5-Sonnet, and versions of the Llama and Mistral models-comparing their performance to that of a traditional Random Forests algorithm using demographic data from the World Values Survey (WVS). LLMs demonstrate competitive performance overall, with the significant advantage of requiring no additional training data. However, they exhibit biases when predicting responses for certain religious and population groups, underperforming in these areas. On the other hand, Random Forests demonstrate stronger performance than LLMs when trained with sufficient data. We observe that removing censorship mechanisms from LLMs significantly improves predictive accuracy, particularly for underrepresented demographic segments where censored models struggle. These findings highlight the importance of addressing biases and reconsidering censorship approaches in LLMs to enhance their reliability and fairness in public opinion research.
Chatbots for Data Collection in Surveys: A Comparison of Four Theory-Based Interview Probes	Research Article	Survey	March 12, 2025	Preprint	Data Collection	Computer Science	Surveys are a widespread method for collecting data at scale, but their rigid structure often limits the depth of qualitative insights obtained. While interviews naturally yield richer responses, they are challenging to conduct across diverse locations and large participant pools. To partially bridge this gap, we investigate the potential of using LLM-based chatbots to support qualitative data collection through interview probes embedded in surveys. We assess four theory-based interview probes: descriptive, idiographic, clarifying, and explanatory. Through a split-plot study design (N=64), we compare the probes' impact on response quality and user experience across three key stages of HCI research: exploration, requirements gathering, and evaluation. Our results show that probes facilitate the collection of high-quality survey data, with specific probes proving effective at different research stages. We contribute practical and methodological implications for using chatbots as research tools to enrich qualitative data collection.
EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants	Research Article, Application or Tool	EAIRA	March 2, 2025	Preprint	Other	Computer Science, Any Discipline	Recent advancements have positioned AI, and particularly Large Language Models (LLMs), as transformative tools for scientific research, capable of addressing complex tasks that require reasoning, problem-solving, and decision-making. Their exceptional capabilities suggest their potential as scientific research assistants but also highlight the need for holistic, rigorous, and domain-specific evaluation to assess effectiveness in real-world scientific applications. This paper describes a multifaceted methodology for Evaluating AI models as scientific Research Assistants (EAIRA) developed at Argonne National Laboratory. This methodology incorporates four primary classes of evaluations. 1) Multiple Choice Questions to assess factual recall; 2) Open Response to evaluate advanced reasoning and problem-solving skills; 3) Lab-Style Experiments involving detailed analysis of capabilities as research assistants in controlled environments; and 4) Field-Style Experiments to capture researcher-LLM interactions at scale in a wide range of scientific domains and applications. These complementary methods enable a comprehensive analysis of LLM strengths and weaknesses with respect to their scientific knowledge, reasoning abilities, and adaptability. Recognizing the rapid pace of LLM advancements, we designed the methodology to evolve and adapt so as to ensure its continued relevance and applicability. This paper describes the methodology state at the end of February 2025. Although developed within a subset of scientific domains, the methodology is designed to be generalizable to a wide range of scientific domains.

Highlighted Resources

Categories