Articles

Below are articles that use LLMs in their research workflows. You can use the Search option to find examples from your discipline, or for specific workflow applications you may be considering.

TitleType of ResourceLink to ResourceDate RecordedOpen ScienceUse of LLMResearch Discipline(s)Description of Resource
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers Research Article September 9, 2024 Preprint Other Computer Science Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.
GPT Takes the SAT: Tracing Changes in Test Difficulty and Students' Math Performance Research Article August 17, 2024 Open Source Data Generation Education Scholastic Aptitude Test (SAT) is crucial for college admissions but its effectiveness and relevance are increasingly questioned. This paper enhances Synthetic Control methods by introducing Transformed Control, a novel method that employs Large Language Models (LLMs) powered by Artificial Intelligence to generate control groups. We utilize OpenAI's API to generate a control group where GPT-4, or ChatGPT, takes multiple SATs annually from 2008 to 2023. This control group helps analyze shifts in SAT math difficulty over time, starting from the baseline year of 2008. Using parallel trends, we calculate the Average Difference in Scores (ADS) to assess changes in high school students' math performance. Our results indicate a significant decrease in the difficulty of the SAT math section over time, alongside a decline in students' math performance. The analysis shows a 71-point drop in the rigor of SAT math from 2008 to 2023, with student performance decreasing by 36 points, resulting in a 107-point total divergence in average student math performance. We investigate possible mechanisms for this decline in math proficiency, such as changing university selection criteria, increased screen time, grade inflation, and worsening adolescent mental health. Disparities among demographic groups show a 104-point drop for White students, 84 points for Black students, and 53 points for Asian students. Male students saw a 117-point reduction, while female students had a 100-point decrease. This research highlights the need to reconsider the SAT's role in admissions and to update educational strategies to enhance high school math performance.
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [Github Repo] Research Article, Use Case Example, Tutorial w/ Code, Application/Tool August 15, 2024 Open Source Data Generation, Data Analysis, Science Communication Computer Science One of the grand challenges of artificial intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used to aid human scientists, e.g. for brainstorming ideas or writing code, they still require extensive manual supervision or are heavily constrained to a specific task. We're excited to introduce The AI Scientist, the first comprehensive system for fully automatic scientific discovery, enabling Foundation Models such as Large Language Models (LLMs) to perform research independently.
A Survey of Large Language Models Research Article June 21, 2024 Preprint Other Computer Science Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pretraining Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP) tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., incontext learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g., containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Furthermore, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers.
Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review Research Article June 21, 2024 Preprint Other Computer Science This systematic literature review comprehensively examines the application of Large Language Models (LLMs) in forecasting and anomaly detection, highlighting the current state of research, inherent challenges, and prospective future directions. LLMs have demonstrated significant potential in parsing and analyzing extensive datasets to identify patterns, predict future events, and detect anomalous behavior across various domains. However, this review identifies several critical challenges that impede their broader adoption and effectiveness, including the reliance on vast historical datasets, issues with generalizability across different contexts, the phenomenon of model hallucinations, limitations within the models’ knowledge boundaries, and the substantial computational resources required. Through detailed analysis, this review discusses potential solutions and strategies to overcome these obstacles, such as integrating multimodal data, advancements in learning methodologies, and emphasizing model explainability and computational efficiency. Moreover, this review outlines critical trends that are likely to shape the evolution of LLMs in these fields, including the push toward real-time processing, the importance of sustainable modeling practices, and the value of interdisciplinary collaboration. Conclusively, this review underscores the transformative impact LLMs could have on forecasting and anomaly detection while emphasizing the need for continuous innovation, ethical considerations, and practical solutions to realize their full potential.
Experimental Evidence on Large Language Models Research Article May 19, 2024 Preprint Data Analysis Economics This paper investigate the formation of inflation expectations using Large Language Models (LLMs) based on different text data. Employing a new experimental design, I integrate generative AI with economic analysis to explore the impact of different information treatments on LLMs' responses. Results from six distinct knowledge sources reveal that the type of information accessible to an LLM significantly affects the variance of its generated expectations. LLMs with access to relevant economic documents exhibit lower variance compared to those with irrelevant information. Furthermore, information treatments, particularly the one related to mortgage rates, influence the updating of LLMs' prior inflation expectations, showing similar findings from human surveys. The findings underscore the importance of providing domain-specific knowledge to LLMs and showcase the potential of AI agents in studying expectation formation and decision-making processes in economics.
Is that a Guideline? Addressing Learning in Ethics Guidelines Through a PRISMA-ETHICS informed Scoping Review of Guidelines Research Article May 10, 2024 Preprint Other Other There have been recent calls for new ethics guidelines regarding the use of artificial intelligence in research. How should we go about developing such ethics guidance documents with respect to emerging contexts such as new technologies, and established domains such as research in education? This paper provides a PRISMA-ETHICS informed scoping review of approaches to ethics guideline development, the structures of ethics guidelines, and their audiences and purposes particularly in the context of education and AI.
Emergent autonomous scientific research capabilities of large language models Research Article February 9, 2024 Preprint Other Computer Science Transformer-based large language models are rapidly advancing in the field of machine learning research, with applications spanning natural language, biology, chemistry, and computer programming. Extreme scaling and reinforcement learning from human feedback have significantly improved the quality of generated text, enabling these models to perform various tasks and reason about their choices. In this paper, we present an Intelligent Agent system that combines multiple large language models for autonomous design, planning, and execution of scientific experiments. We showcase the Agent's scientific research capabilities with three distinct examples, with the most complex being the successful performance of catalyzed cross-coupling reactions. Finally, we discuss the safety implications of such systems and propose measures to prevent their misuse.
Machine Learning as a Tool for Hypothesis Generation Research Article January 17, 2024 Preprint Other Economics While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a procedure that uses machine learning algorithms—and their capacity to notice patterns people might not—to generate novel hypotheses about human behavior. We illustrate the procedure with a concrete application: judge decisions. We begin with a striking fact: up to half of the predictable variation in who judges jail is explained solely by the pixels in the defendant’s mugshot—that is, the predictions from an algorithm built using just facial images. We develop a procedure that allows human subjects to interact with this black-box algorithm to produce hypotheses about what in the face influences judge decisions. The procedure generates hypotheses that are both interpretable and novel: They are not explained by factors implied by existing research (demographics, facial features emphasized by previous psychology studies), nor are they already known (even if just tacitly) to people or even experts. Though these results are specific, our procedure is general. It provides a way to produce novel, interpretable hypotheses from any high-dimensional dataset (e.g. cell phones, satellites, online behavior, news headlines, corporate filings, and high-frequency time series). A central tenet of our paper is that hypothesis generation is in and of itself a valuable activity, and hope this encourages future work in this largely “pre-scientific” stage of science.
Mathematical discoveries from program search with large language models Research Article December 15, 2023 Open Source Data Generation, Data Analysis, Other Math Large Language Models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations) which can result in them making plausible but incorrect statements [1,2]. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pre-trained LLM with a systematic evaluator. We demonstrate the effectiveness of this approach to surpass the best known results in important problems, pushing the boundary of existing LLM-based approaches [3]. Applying FunSearch to a central problem in extremal combinatorics — the cap set problem — we discover new constructions of large cap sets going beyond the best known ones, both in finite dimensional and asymptotic cases. This represents the first discoveries made for established open problems using LLMs. We showcase the generality of FunSearch by applying it to an algorithmic problem, online bin packing, finding new heuristics that improve upon widely used baselines. In contrast to most computer search approaches, FunSearch searches for programs that describe how to solve a problem, rather than what the solution is. Beyond being an effective and scalable strategy, discovered programs tend to be more interpretable than raw solutions, enabling feedback loops between domain experts and FunSearch, and the deployment of such programs in real-world applications.