Fine Tuning

As a starting place, here are initial recommendations of questions to be asked when reviewing research in which an LLM was used as part of their scientific research workflow.

  • Which language model was fine tuned (e.g., OpenAI’s GPT-3.5 model)?
  • Which (if any) packages were used (e.g., DSPy, RAGAS, etc.)?
  • Were multiple language models tested for performance before selecting?
  • What tool(s) were used for fine tuning to model (e.g., LoRA, PEFT, OpenAI tools)?
  • Which data were used for fine tuning?
  • Was splitting (training/testing) used, and if so what proportions (e.g., 80/20)?
  • Which (if any) evaluation libraries were used to assess the fine tuned model?
  • Did the researcher(s) evaluate the LLM’s performance against other benchmarks or standards?
  • Is the code available?

* Note that at this time there are no standards for setting completion parameters (such as temperature). As standards come available we will post updates.

** Fine Tuning vs. Embedding