Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

TitleType of ResourceDescription of ResourceLink to ResourceOpen ScienceUse of LLMResearch Discipline(s)
Leveraging generative artificial intelligence to simulate student learning behavior Research Article Student simulation presents a transformative approach to enhance learning outcomes, advance educational research, and ultimately shape the future of effective pedagogy. We explore the feasibility of using large language models (LLMs), a remarkable achievement in AI, to simulate student learning behaviors. Unlike conventional machine learning based prediction, we leverage LLMs to instantiate virtual students with specific demographics and uncover intricate correlations among learning experiences, course materials, understanding levels, and engagement. Our objective is not merely to predict learning outcomes but to replicate learning behaviors and patterns of real students. We validate this hypothesis through three experiments. The first experiment, based on a dataset of N = 145, simulates student learning outcomes from demographic data, revealing parallels with actual students concerning various demographic factors. The second experiment (N = 4524) results in increasingly realistic simulated behaviors with more assessment history for virtual students modelling. The third experiment (N = 27), incorporating prior knowledge and course interactions, indicates a strong link between virtual students' learning behaviors and fine-grained mappings from test questions, course materials, engagement and understanding levels. Collectively, these findings deepen our understanding of LLMs and demonstrate its viability for student simulation, empowering more adaptable curricula design to enhance inclusivity and educational effectiveness. Preprint Data Generation, Data Analysis Data Science, Education
Reaching the Gold Standard: Automated Text Analysis with Generative Pre-trained Transformers Matches Human-Level Performance Research Article Natural language is a vital source of evidence for the social sciences. Yet quantifying large volumes of text rigorously and precisely is extremely difficult, and automated methods have struggled to match the “gold standard” of human coding. The present work used GPT-4 to conduct an automated analysis of 1,356 essays, rating the authors’ spirituality on a continuous scale. This presents an especially challenging test for automated methods, due to the subtlety of the concept and the difficulty of inferring complex personality traits from a person’s writing. Nonetheless, we found that GPT-4’s ratings demonstrated excellent internal reliability, remarkable consistency with a human rater, and strong correlations with self-report measures and behavioral indicators of spirituality. These results suggest that, even on nuanced tasks requiring a high degree of conceptual sophistication, automated text analysis with Generative Pre-trained Transformers can match human-level performance. Hence, these results demonstrate the extraordinary potential for such tools to advance social scientific research. Preprint Data Analysis Computer Science
AUTOGEN: A Personalized Large Language Model for AcademicEnhancement—Ethics and Proof of Principle Research Article Fine-tuning on authors' previously published papers Open Source Data Generation Other
Can large language models provide useful feedback on research papers? A large-scale empirical analysis Research Article Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production and intricate knowledge specialization challenge the conventional scientific feedback mechanisms. High-quality peer reviews are increasingly difficult to obtain. Researchers who are more junior or from under-resourced settings have especially hard times getting timely feedback. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback on research manuscripts. However, the utility of LLM-generated feedback has not been systematically studied. To address this gap, we created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. We evaluated the quality of GPT-4's feedback through two large-scale studies. We first quantitatively compared GPT-4's generated feedback with human peer reviewer feedback in 15 Nature family journals (3,096 papers in total) and the ICLR machine learning conference (1,709 papers). The overlap in the points raised by GPT-4 and by human reviewers (average overlap 30.85% for Nature journals, 39.23% for ICLR) is comparable to the overlap between two human reviewers (average overlap 28.58% for Nature journals, 35.25% for ICLR). The overlap between GPT-4 and human reviewers is larger for the weaker papers. We then conducted a prospective user study with 308 researchers from 110 US institutions in the field of AI and computational biology to understand how researchers perceive feedback generated by our GPT-4 system on their own papers. Overall, more than half (57.4%) of the users found GPT-4 generated feedback helpful/very helpful and 82.4% found it more beneficial than feedback from at least some human reviewers. While our findings show that LLM-generated feedback can help researchers, we also identify several limitations. Preprint Describing Results Computer Science
A Large Language Model Approach to Educational Survey Feedback Analysis Research Article This paper assesses the potential for the large language models (LLMs) GPT-4 and GPT-3.5 to aid in deriving insight from education feedback surveys. Exploration of LLM use cases in education has focused on teaching and learning, with less exploration of capabilities in education feedback analysis. Survey analysis in education involves goals such as finding gaps in curricula or evaluating teachers, often requiring time-consuming manual processing of textual responses. LLMs have the potential to provide a flexible means of achieving these goals without specialized machine learning models or fine-tuning. We demonstrate a versatile approach to such goals by treating them as sequences of natural language processing (NLP) tasks including classification (multi-label, multi-class, and binary), extraction, thematic analysis, and sentiment analysis, each performed by LLM. We apply these workflows to a real-world dataset of 2500 end-of-course survey comments from biomedical science courses, and evaluate a zero-shot approach (i.e., requiring no examples or labeled training data) across all tasks, reflecting education settings, where labeled data is often scarce. By applying effective prompting practices, we achieve human-level performance on multiple tasks with GPT-4, enabling workflows necessary to achieve typical goals. We also show the potential of inspecting LLMs' chain-of-thought (CoT) reasoning for providing insight that may foster confidence in practice. Moreover, this study features development of a versatile set of classification categories, suitable for various course types (online, hybrid, or in-person) and amenable to customization. Our results suggest that LLMs can be used to derive a range of insights from survey text. Preprint Data Analysis Education
Original Paper The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation Research Article Background: Artificial intelligence (AI) has many applications in various aspects of our daily life, including health, criminal,education, civil, business, and liability law. One aspect of AI that has gained significant attention is natural language processing(NLP), which refers to the ability of computers to understand and generate human language.Objective: This study aims to examine the potential for, and concerns of, using AI in scientific research. For this purpose,high-impact research articles were generated by analyzing the quality of reports generated by ChatGPT and assessing theapplication’s impact on the research framework, data analysis, and the literature review. The study also explored concerns aroundownership and the integrity of research when using AI-generated text.Methods: A total of 4 articles were generated using ChatGPT, and thereafter evaluated by 23 reviewers. The researchersdeveloped an evaluation form to assess the quality of the articles generated. Additionally, 50 abstracts were generated usingChatGPT and their quality was evaluated. The data were subjected to ANOVA and thematic analysis to analyze the qualitativedata provided by the reviewers.Results: When using detailed prompts and providing the context of the study, ChatGPT would generate high-quality researchthat could be published in high-impact journals. However, ChatGPT had a minor impact on developing the research frameworkand data analysis. The primary area needing improvement was the development of the literature review. Moreover, reviewers expressed concerns around ownership and the integrity of the research when using AI-generated text. Nonetheless, ChatGPT has a strong potential to increase human productivity in research and can be used in academic writing.Conclusions: AI-generated text has the potential to improve the quality of high-impact research articles. The findings of this study suggest that decision makers and researchers should focus more on the methodology part of the research, which includes research design, developing research tools, and analyzing data in depth, to draw strong theoretical and practical implications, thereby establishing a revolution in scientific research in the era of AI. The practical implications of this study can be used indifferent fields such as medical education to deliver materials to develop the basic competencies for both medicine students andfaculty members Open Source Other Education, Other
ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing? (ver. 23Q3) Research Article Historically, proficient writing was deemed essential for human advancement, with creative expression viewed as one of the hallmarks of human achievement. However, recent advances in generative AI have marked an inflection point in this narrative, including for scientific writing. This article provides a comprehensive analysis of the capabilities and limitations of six AI chatbots in scholarly writing in the humanities and archaeology. The methodology was based on tagging AI generated content for quantitative accuracy and qualitative precision by human experts. Quantitative accuracy assessed the factual correctness, while qualitative precision gauged the scientific contribution. While the AI chatbots, especially ChatGPT-4, demonstrated proficiency in recombining existing knowledge, they failed in generating original scientific content. As a side note, our results also suggest that with ChatGPT-4 the size of the LLMs has plateaued. Furthermore, the paper underscores the intricate and recursive nature of human research. This process of transforming raw data into refined knowledge is computationally irreducible, which highlights the challenges AI chatbots face in emulating human originality in scientific writing. In conclusion, while large language models have revolutionised content generation, their ability to produce original scientific contributions in the humanities remains limited. We expect that this will change in the near future with the evolution of current LLM-based AI chatbots towards LLM-powered software. Preprint Describing Results Other
Diminished Diversity-of-Thought in a Standard Large Language Model Research Article We test whether Large Language Models (LLMs) can be used to simulate human participants in social-science studies. To do this, we run replications of 14 studies from the Many Labs 2 replication project with OpenAI's text-davinci-003 model, colloquially known as GPT3.5. Based on our pre-registered analyses, we find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results. However, we were unable to analyse the remaining six studies due to an unexpected phenomenon we call the "correct answer" effect. Different runs of GPT3.5 answered nuanced questions probing political orientation, economic preference, judgement, and moral philosophy with zero or near-zero variation in responses: with the supposedly "correct answer." In one exploratory follow-up study, we found that a "correct answer" was robust to changing the demographic details that precede the prompt. In another, we found that most but not all "correct answers" were robust to changing the order of answer choices. One of our most striking findings occurred in our replication of the Moral Foundations Theory survey results, where we found GPT3.5 identifying as a political conservative in 99.6% of the cases, and as a liberal in 99.3% of the cases in the reverse-order condition. However, both self-reported 'GPT conservatives' and 'GPT liberals' showed right-leaning moral foundations. Our results cast doubts on the validity of using LLMs as a general replacement for human participants in the social sciences. Our results also raise concerns that a hypothetical AI-led future may be subject to a diminished diversity-of-thought. Preprint Data Generation Computer Science
Using OpenAI models as a new tool for text analysis in political leaders’ unstructured discourse Research Article This study explores the application of Large Language Models (LLMs) and Automatic Speech Recognition (ASR) models in the analysis of right-wing unstructured political discourse in Peru, focusing on how the concept of freedom is framed. Three types of freedom are identified: personal autonomy, economic freedom, and civil liberties. Utilizing the transcription of OpenAI’s ASR Whisper and GPT-3.5 and GPT-4 models, interviews with three Peruvian right-wing political leaders are analyzed: Rafael López Aliaga, Hernando de Soto and Keiko Fujimori. The results show that GPT-4 beats GPT-3.5 in identifying dimensions of freedom, although there are discrepancies compared to human coding. Despite challenges in classifying abstract and ambiguous concepts, the findings demonstrate GPT-4's ability to classify complexities within political discourse at comparatively small costs and easy access. The research suggests the need for additional refinement, ethical consideration, and ongoing exploration in the analysis of political speeches through AI. Preprint Data Analysis Political Science
Extracting protest events from newspaper articles with ChatGPT Research Article This research note examines the abilities of a large language model (LLM), ChatGPT, to extract structured data on protest events from media accounts. Based on our analysis of 500 articles on Black Lives Matter protests, after an iterative process of prompt improvement on a training dataset, ChatGPT can produce data comparable to or better than a hand-coding method with an enormous reduction in time and minimal cost. While the technique has limitations, LLMs show promise and deserve further study for their use in protest event analysis. Preprint Data Collection Other