Articles

Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

Title	Type of Resource	Link to Resource	Date Recorded	Open Science	Use of LLM	Research Discipline(s)	Description of Resource
Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science	Research Article	Lit Reviews (with ads)	February 9, 2025	Open Source	Data Collection	Computer Science	Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers to perform systematic literature reviews (SLR). We evaluate the performance of LLMs for SLR tasks in these case studies. In each, we explore the impact of changing parameters on the accuracy of LLM responses. The LLM was tasked with extracting evidence from chosen academic papers to answer specific research questions. We evaluate the models’ performance in faithfully reproducing quotes from the literature and subject experts were asked to assess the model performance in answering the research questions. We developed a semantic text highlighting tool to facilitate expert review of LLM responses. We found that state of the art LLMs were able to reproduce quotes from texts with greater than 95% accuracy and answer research questions with an accuracy of approximately 83%. We use two methods to determine the correctness of LLM responses; expert review and the cosine similarity of transformer embeddings of LLM and expert answers. The correlation between these methods ranged from 0.48 to 0.77, providing evidence that the latter is a valid metric for measuring semantic similarity.
Agent Laboratory: Using LLM Agents as Research Assistants	Research Article, Application/Tool	Lab Asst	February 9, 2025	Preprint	Research Design, Data Collection, Data Analysis, Describing Results, Science Communication	Computer Science	Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.
Beyond Likert Scales: Convergent Validity of an NLP-Based Future Self-Continuity Assessment from Verbal Data	Research Article	Future Self Validation	February 8, 2025	Preprint	Data Analysis	Psychology	Psychological assessment using self-report Likert items suffers from numerous inherent biases. These biases limit our capability to assess complex psychological constructs such as Future Self-Continuity (FSC), i.e. the perceived connection between one's present and future self. However, recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) have opened new possibilities for psychological assessment. In this paper, we introduce a novel method of psychological assessment applied to measuring FSC that uses an LLM for NLP of transcripts from self-recorded audio of responses to 15 structured interview prompts developed from FSC theory and research. 164 whitelisted MTurk workers completed an online survey and interview task. Claude 3.5 Sonnet was used to process the transcripts and generate quantitative scores. The resulting FSC scores (including total score, and the similarity, vividness, and positivity components) showed significant correlations with scores on the Future Self-Continuity Questionnaire (FSCQ), a well-validated Likert item measure of FSC, supporting the new method's convergent validity. A Bland-Altman analysis indicating general agreement with standard FSCQ scores, replication using an updated Claude 3.5 Sonnet model, and the strong correlations between NLP-based FSC scores using the two models supports the new assessment method's validity and robustness. This measurement approach can inform treatment planning and interventions by providing clinicians with a more authentic FSC assessment. Beyond FSC, this NLP/LLM approach can enhance psychological assessment broadly, with significant implications for research and clinical practice.
From Assistance to Autonomy -- A Researcher Study on the Potential of AI Support for Qualitative Data Analysis	Research Article	Qualitative	February 3, 2025	Preprint	Data Analysis	Computer Science	The advent of AI tools, such as Large Language Models, has introduced new possibilities for Qualitative Data Analysis (QDA), offering both opportunities and challenges. To help navigate the responsible integration of AI into QDA, we conducted semi-structured interviews with 15 HCI researchers experienced in QDA. While our participants were open to AI support in their QDA workflows, they expressed concerns about data privacy, autonomy, and the quality of AI outputs. In response, we developed a framework that spans from minimal to high AI involvement, providing tangible scenarios for integrating AI into HCI researchers' QDA practices while addressing their needs and concerns. Aligned with real-life QDA workflows, we identify potentials for AI tools in areas such as data pre-processing, researcher onboarding, or mediation. Our framework aims to provoke further discussion on the development of AI-supported QDA and to help establish community standards for their responsible use.
Quantifying the use and potential benefits of artificial intelligence in scientific research	Research Article	Growth	January 16, 2025	Open Source	Other	Data Science, Any Discipline	The rapid advancement of artificial intelligence (AI) is poised to reshape almost every line of work. Despite enormous efforts devoted to understanding AI’s economic impacts, we lack a systematic understanding of the benefits to scientific research associated with the use of AI. Here we develop a measurement framework to estimate the direct use of AI and associated benefits in science. We find that the use and benefits of AI appear widespread throughout the sciences, growing especially rapidly since 2015. However, there is a substantial gap between AI education and its application in research, highlighting a misalignment between AI expertise supply and demand. Our analysis also reveals demographic disparities, with disciplines with higher proportions of women or Black scientists reaping fewer benefits from AI, potentially exacerbating existing inequalities in science. These findings have implications for the equity and sustainability of the research enterprise, especially as the integration of AI with science continues to deepen.
Use of large language models as artificial intelligence tools in academic research and publishing among global clinical researchers	Research Article	Use of LLMs	January 10, 2025	Preprint	Other	Medicine	With breakthroughs in Natural Language Processing and Artificial Intelligence (AI), the usage of Large Language Models (LLMs) in academic research has increased tremendously. Models such as Generative Pre-trained Transformer (GPT) are used by researchers in literature review, abstract screening, and manuscript drafting. However, these models also present the attendant challenge of providing ethically questionable scientific information. Our study provides a snapshot of global researchers’ perception of current trends and future impacts of LLMs in research. Using a cross-sectional design, we surveyed 226 medical and paramedical researchers from 59 countries across 65 specialties, trained in the Global Clinical Scholars’ Research Training certificate program of Harvard Medical School between 2020 and 2024. Majority (57.5%) of these participants practiced in an academic setting with a median of 7 (2,18) PubMed Indexed published articles. 198 respondents (87.6%) were aware of LLMs and those who were aware had higher number of publications (p < 0.001). 18.7% of the respondents who were aware (n = 37) had previously used LLMs in publications especially for grammatical errors and formatting (64.9%); however, most (40.5%) did not acknowledge its use in their papers. 50.8% of aware respondents (n = 95) predicted an overall positive future impact of LLMs while 32.6% were unsure of its scope. 52% of aware respondents (n = 102) believed that LLMs would have a major impact in areas such as grammatical errors and formatting (66.3%), revision and editing (57.2%), writing (57.2%) and literature review (54.2%). 58.1% of aware respondents were opined that journals should allow for use of AI in research and 78.3% believed that regulations should be put in place to avoid its abuse. Seeing the perception of researchers towards LLMs and the significant association between awareness of LLMs and number of published works, we emphasize the importance of developing comprehensive guidelines and ethical framework to govern the use of AI in academic research and address the current challenges.
Rise of Generative Artificial Intelligence in Science	Research Article	Rise	January 9, 2025	Preprint	Other	Any Discipline	Generative Artificial Intelligence (GenAI, generative AI) has rapidly become available as a tool in scientific research. To explore the use of generative AI in science, we conduct an empirical analysis using OpenAlex. Analyzing GenAI publications and other AI publications from 2017 to 2023, we profile growth patterns, the diffusion of GenAI publications across fields of study, and the geographical spread of scientific research on generative AI. We also investigate team size and international collaborations to explore whether GenAI, as an emerging scientific research area, shows different collaboration patterns compared to other AI technologies. The results indicate that generative AI has experienced rapid growth and increasing presence in scientific publications. The use of GenAI now extends beyond computer science to other scientific research domains. Over the study period, U.S. researchers contributed nearly two-fifths of global GenAI publications. The U.S. is followed by China, with several small and medium-sized advanced economies demonstrating relatively high levels of GenAI deployment in their research publications. Although scientific research overall is becoming increasingly specialized and collaborative, our results suggest that GenAI research groups tend to have slightly smaller team sizes than found in other AI fields. Furthermore, notwithstanding recent geopolitical tensions, GenAI research continues to exhibit levels of international collaboration comparable to other AI technologies.
Hypothesis Generation with Large Language Models	Research Article	Hypotheses	December 20, 2024	Preprint	Research Design	Any Discipline	Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled examples). To enable LLMs to handle arbitrarily long contexts, we generate initial hypotheses from a small number of examples and then update them iteratively to improve the quality of hypotheses. Inspired by multi-armed bandits, we design a reward function to inform the exploitation-exploration tradeoff in the update process. Our algorithm is able to generate hypotheses that enable much better predictive performance than few-shot prompting in classification tasks, improving accuracy by 31.7% on a synthetic dataset and by 13.9%, 3.3% and, 24.9% on three real-world datasets. We also outperform supervised learning by 12.8% and 11.2% on two challenging real-world datasets. Furthermore, we find that the generated hypotheses not only corroborate human-verified theories but also uncover new insights for the tasks.
Scaling up the Evaluation of Collaborative Problem Solving: Promises and Challenges of Coding Chat Data with ChatGPT	Research Article	Coding	November 19, 2024	Preprint	Data Analysis	Computer Science	Collaborative problem solving (CPS) is widely recognized as a critical 21st century skill. Efficiently coding communication data is a big challenge in scaling up research on assessing CPS. This paper reports the findings on using ChatGPT to directly code CPS chat data by benchmarking performance across multiple datasets and coding frameworks. We found that ChatGPT-based coding outperformed human coding in tasks where the discussions were characterized by colloquial languages but fell short in tasks where the discussions dealt with specialized scientific terminology and contexts. The findings offer practical guidelines for researchers to develop strategies for efficient and scalable analysis of communication data from CPS tasks.
AI-Augmented Cultural Sociology: Guidelines for LLM-assisted text analysis and an illustrative example	Research Article, Use Case Example	Sociology	December 3, 2024	Preprint	Data Analysis	Sociology	The advent of large language models (LLMs) presents a promising opportunity for how we analyze text and, by extension, can study the role of culture and symbolic meanings in social life. Using an illustrative example focused on the concept of “personalized service” within Michelin-starred restaurants, this research note demonstrates how LLMs can reliably identify complex, multifaceted concepts similarly to a qualitative data analyst, but in a more scalable manner. We extend existing validation approaches, offering guidelines on the amount of manually coded data needed to evaluate LLM-generated outputs, drawing on sampling theory and a data simulation. We also discuss broader applications of LLMs in cultural sociology, such as investigations on established concepts (e.g., cultural consecration) and emerging concepts (e.g., future-oriented deliberation). This discussion underscores that AI-tools can significantly augment the empirical scope of research projects, building on rather than replacing traditional qualitative approaches. Our study ultimately advocates for an optimistic yet cautious engagement with AI-tools in social scientific inquiry, highlighting both their analytic potential and the need for ongoing reflection on their ethical implications.

Highlighted Resources

Categories