Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.
Title | Type of Resource | Link to Resource | Date Recorded | Open Science | Use of LLM | Research Discipline(s) | Description of Resource |
---|---|---|---|---|---|---|---|
Tweetorial Hooks: Generative AI Tools to Motivate Science on Social Media (science communication) | Research Article | SciComm | December 6, 2023 | Preprint | Science Communication | Computer Science | Communicating science and technology is essential for the public to understand and engage in a rapidly changing world. Tweetorials are an emerging phenomenon where experts explain STEM topics on social media in creative and engaging ways. However, STEM experts struggle to write an engaging "hook" in the first tweet that captures the reader's attention. We propose methods to use large language models (LLMs) to help users scaffold their process of writing a relatable hook for complex scientific topics. We demonstrate that LLMs can help writers find everyday experiences that are relatable and interesting to the public, avoid jargon, and spark curiosity. Our evaluation shows that the system reduces cognitive load and helps people write better hooks. Lastly, we discuss the importance of interactivity with LLMs to preserve the correctness, effectiveness, and authenticity of the writing. |
LLMs for Science: Usage for Code Generation and Data Analysis | Research Article | arxiv.org/abs/2311.16733 | November 30, 2023 | Preprint, Open Source, Open Data, Open Code | Data Generation, Data Analysis | Computer Science | Large language models (LLMs) have been touted to enable increased productivity in many areas of today's work life. Scientific research as an area of work is no exception: the potential of LLM-based tools to assist in the daily work of scientists has become a highly discussed topic across disciplines. However, we are only at the very onset of this subject of study. It is still unclear how the potential of LLMs will materialise in research practice. With this study, we give first empirical evidence on the use of LLMs in the research process. We have investigated a set of use cases for LLM-based tools in scientific research, and conducted a first study to assess to which degree current tools are helpful. In this paper we report specifically on use cases related to software engineering, such as generating application code and developing scripts for data analytics. While we studied seemingly simple use cases, results across tools differ significantly. Our results highlight the promise of LLM-based tools in general, yet we also observe various issues, particularly regarding the integrity of the output these tools provide. |
AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction | Research Article | LLM Surveys | November 28, 2023 | Preprint | Data Collection, Data Cleaning/Preparation | Computer Science | Large language models (LLMs) that produce human-like responses have begun to revolutionize research practices in the social sciences. This paper shows how we can integrate LLMs and social surveys to accurately predict individual responses to survey questions that were not asked before. We develop a novel methodological framework to personalize LLMs by considering the meaning of survey questions derived from their text, the latent beliefs of individuals inferred from their response patterns, and the temporal contexts across different survey periods through fine-tuning LLMs with survey data. Using the General Social Survey from 1972 to 2021, we show that the fine-tuned model based on Alpaca-7b can predict individual responses to survey questions that are partially missing as well as entirely missing. The remarkable prediction capabilities allow us to fill in missing trends with high confidence and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. We discuss practical constraints, socio-demographic representation, and ethical concerns regarding individual autonomy and privacy when using LLMs for opinion prediction. This study demonstrates that LLMs and surveys can mutually enhance each other's capabilities: LLMs broaden survey potential, while surveys improve the alignment of LLMs. |
Human-AI Collaboration in Thematic Analysis using ChatGPT: A User Study and Design Recommendations | Research Article | Thematic ANalysis | November 9, 2023 | Preprint | Data Analysis | Computer Science | Generative artificial intelligence (GenAI) offers promising potential for advancing human-AI collaboration in qualitative research. However, existing works focused on conventional machine-learning and pattern-based AI systems, and little is known about how researchers interact with GenAI in qualitative research. This work delves into researchers' perceptions of their collaboration with GenAI, specifically ChatGPT. Through a user study involving ten qualitative researchers, we found ChatGPT to be a valuable collaborator for thematic analysis, enhancing coding efficiency, aiding initial data exploration, offering granular quantitative insights, and assisting comprehension for non-native speakers and non-experts. Yet, concerns about its trustworthiness and accuracy, reliability and consistency, limited contextual understanding, and broader acceptance within the research community persist. We contribute five actionable design recommendations to foster effective human-AI collaboration. These include incorporating transparent explanatory mechanisms, enhancing interface and integration capabilities, prioritising contextual understanding and customisation, embedding human-AI feedback loops and iterative functionality, and strengthening trust through validation mechanisms. |
Leveraging generative artificial intelligence to simulate student learning behavior | Research Article | simulating students | October 31, 2023 | Preprint | Data Generation, Data Analysis | Data Science, Education | Student simulation presents a transformative approach to enhance learning outcomes, advance educational research, and ultimately shape the future of effective pedagogy. We explore the feasibility of using large language models (LLMs), a remarkable achievement in AI, to simulate student learning behaviors. Unlike conventional machine learning based prediction, we leverage LLMs to instantiate virtual students with specific demographics and uncover intricate correlations among learning experiences, course materials, understanding levels, and engagement. Our objective is not merely to predict learning outcomes but to replicate learning behaviors and patterns of real students. We validate this hypothesis through three experiments. The first experiment, based on a dataset of N = 145, simulates student learning outcomes from demographic data, revealing parallels with actual students concerning various demographic factors. The second experiment (N = 4524) results in increasingly realistic simulated behaviors with more assessment history for virtual students modelling. The third experiment (N = 27), incorporating prior knowledge and course interactions, indicates a strong link between virtual students' learning behaviors and fine-grained mappings from test questions, course materials, engagement and understanding levels. Collectively, these findings deepen our understanding of LLMs and demonstrate its viability for student simulation, empowering more adaptable curricula design to enhance inclusivity and educational effectiveness. |
Reaching the Gold Standard: Automated Text Analysis with Generative Pre-trained Transformers Matches Human-Level Performance | Research Article | Gold Standard | October 12, 2023 | Preprint | Data Analysis | Computer Science | Natural language is a vital source of evidence for the social sciences. Yet quantifying large volumes of text rigorously and precisely is extremely difficult, and automated methods have struggled to match the “gold standard” of human coding. The present work used GPT-4 to conduct an automated analysis of 1,356 essays, rating the authors’ spirituality on a continuous scale. This presents an especially challenging test for automated methods, due to the subtlety of the concept and the difficulty of inferring complex personality traits from a person’s writing. Nonetheless, we found that GPT-4’s ratings demonstrated excellent internal reliability, remarkable consistency with a human rater, and strong correlations with self-report measures and behavioral indicators of spirituality. These results suggest that, even on nuanced tasks requiring a high degree of conceptual sophistication, automated text analysis with Generative Pre-trained Transformers can match human-level performance. Hence, these results demonstrate the extraordinary potential for such tools to advance social scientific research. |
Can large language models provide useful feedback on research papers? A large-scale empirical analysis | Research Article | Improve Research Articles | October 4, 2023 | Preprint | Describing Results | Computer Science | Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production and intricate knowledge specialization challenge the conventional scientific feedback mechanisms. High-quality peer reviews are increasingly difficult to obtain. Researchers who are more junior or from under-resourced settings have especially hard times getting timely feedback. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback on research manuscripts. However, the utility of LLM-generated feedback has not been systematically studied. To address this gap, we created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. We evaluated the quality of GPT-4's feedback through two large-scale studies. We first quantitatively compared GPT-4's generated feedback with human peer reviewer feedback in 15 Nature family journals (3,096 papers in total) and the ICLR machine learning conference (1,709 papers). The overlap in the points raised by GPT-4 and by human reviewers (average overlap 30.85% for Nature journals, 39.23% for ICLR) is comparable to the overlap between two human reviewers (average overlap 28.58% for Nature journals, 35.25% for ICLR). The overlap between GPT-4 and human reviewers is larger for the weaker papers. We then conducted a prospective user study with 308 researchers from 110 US institutions in the field of AI and computational biology to understand how researchers perceive feedback generated by our GPT-4 system on their own papers. Overall, more than half (57.4%) of the users found GPT-4 generated feedback helpful/very helpful and 82.4% found it more beneficial than feedback from at least some human reviewers. While our findings show that LLM-generated feedback can help researchers, we also identify several limitations. |
A Large Language Model Approach to Educational Survey Feedback Analysis | Research Article | Class Feedback Analysis | October 2, 2023 | Preprint | Data Analysis | Education | This paper assesses the potential for the large language models (LLMs) GPT-4 and GPT-3.5 to aid in deriving insight from education feedback surveys. Exploration of LLM use cases in education has focused on teaching and learning, with less exploration of capabilities in education feedback analysis. Survey analysis in education involves goals such as finding gaps in curricula or evaluating teachers, often requiring time-consuming manual processing of textual responses. LLMs have the potential to provide a flexible means of achieving these goals without specialized machine learning models or fine-tuning. We demonstrate a versatile approach to such goals by treating them as sequences of natural language processing (NLP) tasks including classification (multi-label, multi-class, and binary), extraction, thematic analysis, and sentiment analysis, each performed by LLM. We apply these workflows to a real-world dataset of 2500 end-of-course survey comments from biomedical science courses, and evaluate a zero-shot approach (i.e., requiring no examples or labeled training data) across all tasks, reflecting education settings, where labeled data is often scarce. By applying effective prompting practices, we achieve human-level performance on multiple tasks with GPT-4, enabling workflows necessary to achieve typical goals. We also show the potential of inspecting LLMs' chain-of-thought (CoT) reasoning for providing insight that may foster confidence in practice. Moreover, this study features development of a versatile set of classification categories, suitable for various course types (online, hybrid, or in-person) and amenable to customization. Our results suggest that LLMs can be used to derive a range of insights from survey text. |
Original Paper The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation | Research Article | ChatGPT Papers | September 27, 2023 | Open Source | Other | Education, Other | Background: Artificial intelligence (AI) has many applications in various aspects of our daily life, including health, criminal,education, civil, business, and liability law. One aspect of AI that has gained significant attention is natural language processing(NLP), which refers to the ability of computers to understand and generate human language.Objective: This study aims to examine the potential for, and concerns of, using AI in scientific research. For this purpose,high-impact research articles were generated by analyzing the quality of reports generated by ChatGPT and assessing theapplication’s impact on the research framework, data analysis, and the literature review. The study also explored concerns aroundownership and the integrity of research when using AI-generated text.Methods: A total of 4 articles were generated using ChatGPT, and thereafter evaluated by 23 reviewers. The researchersdeveloped an evaluation form to assess the quality of the articles generated. Additionally, 50 abstracts were generated usingChatGPT and their quality was evaluated. The data were subjected to ANOVA and thematic analysis to analyze the qualitativedata provided by the reviewers.Results: When using detailed prompts and providing the context of the study, ChatGPT would generate high-quality researchthat could be published in high-impact journals. However, ChatGPT had a minor impact on developing the research frameworkand data analysis. The primary area needing improvement was the development of the literature review. Moreover, reviewers expressed concerns around ownership and the integrity of the research when using AI-generated text. Nonetheless, ChatGPT has a strong potential to increase human productivity in research and can be used in academic writing.Conclusions: AI-generated text has the potential to improve the quality of high-impact research articles. The findings of this study suggest that decision makers and researchers should focus more on the methodology part of the research, which includes research design, developing research tools, and analyzing data in depth, to draw strong theoretical and practical implications, thereby establishing a revolution in scientific research in the era of AI. The practical implications of this study can be used indifferent fields such as medical education to deliver materials to develop the basic competencies for both medicine students andfaculty members |
ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing? (ver. 23Q3) | Research Article | Scientific Writing | September 19, 2023 | Preprint | Describing Results | Other | Historically, proficient writing was deemed essential for human advancement, with creative expression viewed as one of the hallmarks of human achievement. However, recent advances in generative AI have marked an inflection point in this narrative, including for scientific writing. This article provides a comprehensive analysis of the capabilities and limitations of six AI chatbots in scholarly writing in the humanities and archaeology. The methodology was based on tagging AI generated content for quantitative accuracy and qualitative precision by human experts. Quantitative accuracy assessed the factual correctness, while qualitative precision gauged the scientific contribution. While the AI chatbots, especially ChatGPT-4, demonstrated proficiency in recombining existing knowledge, they failed in generating original scientific content. As a side note, our results also suggest that with ChatGPT-4 the size of the LLMs has plateaued. Furthermore, the paper underscores the intricate and recursive nature of human research. This process of transforming raw data into refined knowledge is computationally irreducible, which highlights the challenges AI chatbots face in emulating human originality in scientific writing. In conclusion, while large language models have revolutionised content generation, their ability to produce original scientific contributions in the humanities remains limited. We expect that this will change in the near future with the evolution of current LLM-based AI chatbots towards LLM-powered software. |