As a starting place, here are initial recommendations of questions to be asked when considering the use of an LLM as part of your scientific research workflow.
- What protocols (guidelines) have been developed to ensure systematic use of LLMs?
- What completion parameters (e.g., temperature, presence penalty, frequency penalty, max tokens, logit bias) will be used?
- Are pre-established decisions related to completion parameters readily available to those prompting the LLM?
- Will embeddings (RAG) be utilized in the research?
- What size chunks will be used in creating embeddings?
- What size of overlap permitted when creating chunks provided?
- What tool(s) will be used for similarity matching (i.e., vector database) provided and described (e.g., FAISS)?
- What retrieval tools/techniques will be used (e.g., compression, context, rerank)?
- Will embedding files be publicly available after the research is complete?
- Will LLM agent(s) be used in the research?
- Will LLM models be fine-tuned (e.g. LoRA)?
- Will fine-tuned models be quantized (e.g., 8-bit or 4-bit)?
- Will any code associated with the use of LLMs documented?
- Will the code be publicly available ?
- Will LLM responses be systematically checked for accuracy, bias, and other limitations?
- What data management systems will put in place to secure the data (inputs and outputs) of the LLMs?