Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.
Title | Type of Resource | Link to Resource | Date Recorded | Open Science | Use of LLM | Research Discipline(s) | Description of Resource |
---|---|---|---|---|---|---|---|
Advancing Instrument Validation in Social Sciences: An AI-Powered Chatbot and Interactive Website based on Research Instrument Validation Framework (RIVF) | Research Article | Validity | September 22, 2024 | Open Source | Research Design, Data Collection | Other | Background: In social sciences, ensuring a high level of instrument validation is crucial for upholding the principles of scientific rigor and maintaining the overall quality of research. Objectives: To develop and evaluate an AI chatbot and website for instrument validation, assess their impact on instrument validity improvement, and analyze user perceptions. Methods: Adopting a quantitative design, the study was anchored on the developed Research Instrument Validation Framework (RIVF) of Villarino (2024). Moreover, it was evaluated through users' perceptions (n=100) by administering an online survey, whereby the employment of paired t-tests used contrasting instrument validity-pre-vs post-RIVF scores, and one-way ANOVA was used to determine if a relationship existed between users' perceptions and overall improvement in instrument validity. A G*Power analysis indicated that there was sufficient statistical power for the analyses: for paired t-tests, it was 99.73% (n = 100, dz = 0.5, α = 0.05), and for one-way ANOVA, 80.95% (n = 100, f = 0.25, α=0.05, four groups). All data were analyzed using IBM SPSS version 26. Results: Post-RIVF use, all the validity domains showed significant improvements (p<0.001), but the primary considerable improvement was in construct validity [Mean difference=1.20±0.60, t(49)=14.14]. Participants perceived the AI chatbot as more useful [4.30±0.70 vs. 3.80±0.80, p<0.001] compared to the RIFV website. Conclusion: This AI-powered milieu indicates a potential for increasing the validity of research instruments in RIVF, while an AI chatbot efficiently increments the construct validity. These findings would infer that using AI technologies potentially enhances the quality of research instruments in the social sciences alongside traditional validation methods. |
Let's Get to the Point: LLM-Supported Planning, Drafting, and Revising of Research-Paper Blog Posts | Research Article | paper blogs | September 22, 2024 | Preprint | Science Communication | Computer Science | Research-paper blog posts help scientists disseminate their work to a larger audience, but translating papers into this format requires substantial additional effort. Blog post creation is not simply transforming a long-form article into a short output, as studied in most prior work on human-AI summarization. In contrast, blog posts are typically full-length articles that require a combination of strategic planning grounded in the source document, well-organized drafting, and thoughtful revisions. Can tools powered by large language models (LLMs) assist scientists in writing research-paper blog posts? To investigate this question, we conducted a formative study (N=6) to understand the main challenges of writing such blog posts with an LLM: high interaction costs for 1) reviewing and utilizing the paper content and 2) recurrent sub-tasks of generating and modifying the long-form output. To address these challenges, we developed Papers-to-Posts, an LLM-powered tool that implements a new Plan-Draft-Revise workflow, which 1) leverages an LLM to generate bullet points from the full paper to help users find and select content to include (Plan) and 2) provides default yet customizable LLM instructions for generating and modifying text (Draft, Revise). Through a within-subjects lab study (N=20) and between-subjects deployment study (N=37 blog posts, 26 participants) in which participants wrote blog posts about their papers, we compared Papers-to-Posts to a strong baseline tool that provides an LLM-generated draft and access to free-form LLM prompting. Results show that Papers-to-Posts helped researchers to 1) write significantly more satisfying blog posts and make significantly more changes to their blog posts in a fixed amount of time without a significant change in cognitive load (lab) and 2) make more changes to their blog posts for a fixed number of writing actions (deployment). |
An Examination of the Use of Large Language Models to Aid Analysis of Textual Data | Research Article | Textual Data | September 22, 2024 | Open Source | Data Analysis | Statistics, Other | The increasing use of machine learning and Large Language Models (LLMs) opens up opportunities to use these artificially intelligent algorithms in novel ways. This article proposes a methodology using LLMs to support traditional deductive coding in qualitative research. We began our analysis with three different sample texts taken from existing interviews. Next, we created a codebook and inputted the sample text and codebook into an LLM. We asked the LLM to determine if the codes were present in a sample text provided and requested evidence to support the coding. The sample texts were inputted 160 times to record changes between iterations of the LLM response. Each iteration was analogous to a new coder deductively analyzing the text with the codebook information. In our results, we present the outputs for these recursive analyses, along with a comparison of the LLM coding to evaluations made by human coders using traditional coding methods. We argue that LLM analysis can aid qualitative researchers by deductively coding transcripts, providing a systematic and reliable platform for code identification, and offering a means of avoiding analysis misalignment. Implications of using LLM in research praxis are discussed, along with current limitations. |
ChatGPT for Education Research: Exploring the Potential of Large Language Models for Qualitative Codebook Development | Research Article | ChatGPT for Ed | September 18, 2024 | Data Analysis | Education | In qualitative data analysis, codebooks offer a systematic framework for establishing shared interpretations of themes and patterns. While the utility of codebooks is well-established in educational research, the manual process of developing and refining codes that emerge bottom-up from data presents a challenge in terms of time, effort, and potential for human error. This paper explores the potentially transformative role that could be played by Large Language Models (LLMs), specifically ChatGPT (GPT-4), in addressing these challenges by automating aspects of the codebook development process. We compare four approaches to codebook development – a fully manual approach, a fully automated approach, and two approaches that leverage ChatGPT within specific steps of the codebook development process. We do so in the context of studying transcripts from math tutoring lessons. The resultant four codebooks were evaluated in terms of whether the codes could reliably be applied to data by human coders, in terms of the human-rated quality of codes and codebooks, and whether different approaches yielded similar or overlapping codes. The results show that approaches that automate early stages of codebook development take less time to complete overall. Hybrid approaches (whether GPT participates early or late in the process) produce codebooks that can be applied more reliably and were rated as better quality by humans. Hybrid approaches and a fully human approach produce similar codebooks; the fully automated approach was an outlier. Findings indicate that ChatGPT can be valuable for improving qualitative codebooks for use in AIED research, but human participation is still essential. | |
From nCoder to ChatGPT: From Automated Coding to Refining Human Coding Conference paper | Research Article | nCoder to ChatGPT | September 18, 2024 | Data Analysis | Other | This paper investigates the potential of utilizing ChatGPT (GPT-4) as a tool for supporting coding processes for Quantitative Ethnography research. We compare the use of ChatGPT and nCoder, the most widely used automated coding tool in the QE community, on a dataset of press releases and public addresses delivered by governmental leaders from seven countries from late February to late March 2020. The study assesses the accuracy of the automated coding procedures between the two tools, and the role that ChatGPT’s explanations of its coding decisions can play in improving the consistency and construct validity of human-generated codes. Results suggest that both ChatGPT and nCoder have advantages and disadvantages depending on the context, nature of the data, and researchers’ goals. While nCoder is useful for straightforward coding schemes represented through regular expressions, ChatGPT can better capture a variety of language structures. ChatGPT's ability to provide explanations for its decisions can also help enhance construct validity, identify ambiguity in code definitions, and assist human coders in achieving high interrater reliability. Although we identify limitations of ChatGPT in coding constructs open to human interpretations and encompassing multiple concepts, we highlight opportunities and potential benefits provided by ChatGPT as a tool to support human researchers in their coding process. (qualitative) | |
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers | Research Article | Novel Research Ideas | September 9, 2024 | Preprint | Other | Computer Science | Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome. |
GPT Takes the SAT: Tracing Changes in Test Difficulty and Students' Math Performance | Research Article | GPT Takes the SAT | August 17, 2024 | Open Source | Data Generation | Education | Scholastic Aptitude Test (SAT) is crucial for college admissions but its effectiveness and relevance are increasingly questioned. This paper enhances Synthetic Control methods by introducing Transformed Control, a novel method that employs Large Language Models (LLMs) powered by Artificial Intelligence to generate control groups. We utilize OpenAI's API to generate a control group where GPT-4, or ChatGPT, takes multiple SATs annually from 2008 to 2023. This control group helps analyze shifts in SAT math difficulty over time, starting from the baseline year of 2008. Using parallel trends, we calculate the Average Difference in Scores (ADS) to assess changes in high school students' math performance. Our results indicate a significant decrease in the difficulty of the SAT math section over time, alongside a decline in students' math performance. The analysis shows a 71-point drop in the rigor of SAT math from 2008 to 2023, with student performance decreasing by 36 points, resulting in a 107-point total divergence in average student math performance. We investigate possible mechanisms for this decline in math proficiency, such as changing university selection criteria, increased screen time, grade inflation, and worsening adolescent mental health. Disparities among demographic groups show a 104-point drop for White students, 84 points for Black students, and 53 points for Asian students. Male students saw a 117-point reduction, while female students had a 100-point decrease. This research highlights the need to reconsider the SAT's role in admissions and to update educational strategies to enhance high school math performance. |
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [Github Repo] | Research Article, Use Case Example, Tutorial w/ Code, Application/Tool | AI Scientists | August 15, 2024 | Open Source | Data Generation, Data Analysis, Science Communication | Computer Science | One of the grand challenges of artificial intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used to aid human scientists, e.g. for brainstorming ideas or writing code, they still require extensive manual supervision or are heavily constrained to a specific task. We're excited to introduce The AI Scientist, the first comprehensive system for fully automatic scientific discovery, enabling Foundation Models such as Large Language Models (LLMs) to perform research independently. |
A Survey of Large Language Models | Research Article | https://arxiv.org/abs/2303.18223 | June 21, 2024 | Preprint | Other | Computer Science | Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pretraining Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP) tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., incontext learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g., containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Furthermore, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers. |
Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review | Research Article | https://arxiv.org/abs/2402.10350 | June 21, 2024 | Preprint | Other | Computer Science | This systematic literature review comprehensively examines the application of Large Language Models (LLMs) in forecasting and anomaly detection, highlighting the current state of research, inherent challenges, and prospective future directions. LLMs have demonstrated significant potential in parsing and analyzing extensive datasets to identify patterns, predict future events, and detect anomalous behavior across various domains. However, this review identifies several critical challenges that impede their broader adoption and effectiveness, including the reliance on vast historical datasets, issues with generalizability across different contexts, the phenomenon of model hallucinations, limitations within the models’ knowledge boundaries, and the substantial computational resources required. Through detailed analysis, this review discusses potential solutions and strategies to overcome these obstacles, such as integrating multimodal data, advancements in learning methodologies, and emphasizing model explainability and computational efficiency. Moreover, this review outlines critical trends that are likely to shape the evolution of LLMs in these fields, including the push toward real-time processing, the importance of sustainable modeling practices, and the value of interdisciplinary collaboration. Conclusively, this review underscores the transformative impact LLMs could have on forecasting and anomaly detection while emphasizing the need for continuous innovation, ethical considerations, and practical solutions to realize their full potential. |