As a starting place, here are initial recommendations of questions to be asked when reviewing research in which an LLM was used as part of their scientific research workflow.
- Which language model was fine tuned (e.g., OpenAI’s GPT-3.5 model)?
- Were multiple language models tested for performance before selecting?
- What tool(s) were used for fine tuning to model (e.g., LoRA, PEFT, OpenAI tools)?
- Which data were used for fine tuning?
- Was splitting (training/testing) used, and if so what proportions (e.g., 80/20)?
- Which (if any) evaluation libraries were used to assess the fine tuned model?
- Did the researcher(s) evaluate the LLM’s performance against other benchmarks or standards?
- Is the code available?
* Note that at this time there are no standards for setting completion parameters (such as temperature). As standards come available we will post updates.