As a starting place, here are initial recommendations of questions to be asked when reviewing research in which an LLM was used as part of their scientific research workflow.

  • Was an initial context or ‘seed’ used, and if so is it available?
  • Was the prompt history empty when initial prompts were queried?
  • Were multiple prompts created, tested, or used (i.e., prompt engineering)?
  • Was a data file(s) uploaded, and is a exact copy of that file available?
  • Is the complete history of the prompting available?
  • Are the dates/times of the prompts included with the history?
  • Were¬†completion parameters (e.g., temperature, presence penalty, frequency penalty, max tokens, logit bias) used and are those provided? [typically available through API only]
  • Did completion parameters vary among prompts, and if so are those provided for each prompt?
  • Were multiple combinations of completion parameters tested?
  • Were quality review checks performed on LLM-generated results?
  • Did the researcher(s) validate the LLM-generated results through experimentation or simulation?
  • Is the code available?