Fine Tuning

As a starting place, here are initial recommendations of questions to be asked when reviewing research in which an LLM was used as part of their scientific research workflow.

Which language model was fine tuned (e.g., OpenAI’s GPT-3.5 model)?

Which (if any) packages were used (e.g., DSPy, RAG AS, etc.)?

Were multiple language models tested for performance before selecting?

What tool(s) were used for fine tuning to model (e.g., LoR A, PEFT, OpenAI tool s)?

Which data were used for fine tunin g?

Was splitting (training/testing) used, and if so what proportions (e.g., 80/20)?

Which (if any) evaluation libraries were used to assess the fine tuned model?

Did the researcher(s) evaluate the LLM’s performance against other benchmarks or standards?

Is the code available?

* Note that at this time there are no standards for setting completion parameters (such as temperature). As standards come available we will post updates.

** Fine Tuning vs. Embeddin g

Archives

Categories