Large-language-model as a Judge

It is amazing what large language models can do with text, and with libraries like LangChain it is trivial to create chatbots. You can create a small application that uses your internal documentation to answer questions in few hours. I did it with my docs, and works pretty well even if I am not an expert of LLM and machine learning. It is robust, without any tuning you get something working.

But suddenly you start to see something wrong. For instance, I stated in the prompt that I wanted to know the source of every answer. Nothing bad in it, no? To simple questions like “what is the XYZ framework” the creative LLM machine started replying “The XYZ framework does this and that and has been created by Mohammed…”. Ohhh, the word SOURCE means also author, and in each page I have the AUTHOR. I don’t want that the tool starts finger-pointing people. Same happened with an incident post-mortem page: “According to Sebastien, you should check the metric abc-def to verify if the system is slowing down…”

Given the context the answers were correct, but somebody may not be happy of what the machine can tell, and I don’t want to deal with personal identification information at all. Mine is an internal tool, if it was a public site it would have been a real problem.

Now the question are: how can I be sure I have solved this issue? how can I be sure I won’t introduce them again? In the end the output of the chatbot is random and is in natural language so it is not a trivial problem where you can say for question X the answer is exactly Y.

An idea would be to create a panel of questions and prompts to validate the answers: so I won’t evaluate each single answer myself but I will trust the LLM to be able to decide on its own output. This because of course I do not want to spend my time in checking hundreds of questions each time I change a parameter.

Actually evaluating a chatbot is much more complex: I started looking for articles about that and I have found the Ragas framework and this article:

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks. Lianmin Zheng et al.

The authors explored the idea of using LLM as a judge, comparing the answers of different LLM. In their case they compare different models, but I think this could be useful also to evaluate the answers of different versions of your own chatbot.

In their case they have some questions and ask ChatGPT 4 to decide if the answer of system X is better than that of system Y. In my case I could change some parameter of my chatbot, for instance the number of context document used, and decide which is the best setting.

In theirs article they describe various approaches:

Pairwise comparison. An LLM judge is presented with a question and two answers, and tasked to determine which one is better or declare a tie.

Single answer grading. The LLM judge is asked to directly assign a score to an answer

Reference-guided grading. In certain cases, it may be beneficial to provide a reference solution if applicable.

I do not like much the second and third approaches: asking to get a vote can result in high variability in the answers, and providing reference solutions is hard because you need to prepare these solutions.

Concerning the pairwise comparison the authors highlight some nontrivial issues:

The LLM can prefer the first answer to the second one just because of the position. So you should accept that X is better of Y only if swapping the answers position leads to a consistent result.

LLM could prefer a verbose answer… the more content the better! this seems difficult to address

Self-enhancement bias. for instance ChatGPT 4 favors itself over other LLM in 10% of the cases. This won’t affect much my chatbot as I won’t switch from one model to another. I have just ChatGPT 3.5 or 4 to choose and it is evident which one is the best.

Limited logic math capabilities… LLM are still limited in these domains, the judge will not be good enough in this case.

In the end ChatGPT 4 seems a good enough judge so I plan to use pairwise comparison in future as a method to assess the quality of a new chatbot release.

Written by Giovanni

May 5, 2024 at 10:12 pm

Posted in Varie

Tagged with ai, artificial-intelligence, chatgpt, llm

Giovanni Bricconi