Posts Tagged ‘llm’
Large-language-model as a Judge

It is amazing what large language models can do with text, and with libraries like LangChain it is trivial to create chatbots. You can create a small application that uses your internal documentation to answer questions in few hours. I did it with my docs, and works pretty well even if I am not an expert of LLM and machine learning. It is robust, without any tuning you get something working.
But suddenly you start to see something wrong. For instance, I stated in the prompt that I wanted to know the source of every answer. Nothing bad in it, no? To simple questions like “what is the XYZ framework” the creative LLM machine started replying “The XYZ framework does this and that and has been created by Mohammed…”. Ohhh, the word SOURCE means also author, and in each page I have the AUTHOR. I don’t want that the tool starts finger-pointing people. Same happened with an incident post-mortem page: “According to Sebastien, you should check the metric abc-def to verify if the system is slowing down…”
Given the context the answers were correct, but somebody may not be happy of what the machine can tell, and I don’t want to deal with personal identification information at all. Mine is an internal tool, if it was a public site it would have been a real problem.
Now the question are: how can I be sure I have solved this issue? how can I be sure I won’t introduce them again? In the end the output of the chatbot is random and is in natural language so it is not a trivial problem where you can say for question X the answer is exactly Y.
An idea would be to create a panel of questions and prompts to validate the answers: so I won’t evaluate each single answer myself but I will trust the LLM to be able to decide on its own output. This because of course I do not want to spend my time in checking hundreds of questions each time I change a parameter.
Actually evaluating a chatbot is much more complex: I started looking for articles about that and I have found the Ragas framework and this article:
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks. Lianmin Zheng et al.
The authors explored the idea of using LLM as a judge, comparing the answers of different LLM. In their case they compare different models, but I think this could be useful also to evaluate the answers of different versions of your own chatbot.
In their case they have some questions and ask ChatGPT 4 to decide if the answer of system X is better than that of system Y. In my case I could change some parameter of my chatbot, for instance the number of context document used, and decide which is the best setting.
In theirs article they describe various approaches:
- Pairwise comparison. An LLM judge is presented with a question and two answers, and tasked to determine which one is better or declare a tie.
- Single answer grading. The LLM judge is asked to directly assign a score to an answer
- Reference-guided grading. In certain cases, it may be beneficial to provide a reference solution if applicable.
I do not like much the second and third approaches: asking to get a vote can result in high variability in the answers, and providing reference solutions is hard because you need to prepare these solutions.
Concerning the pairwise comparison the authors highlight some nontrivial issues:
The LLM can prefer the first answer to the second one just because of the position. So you should accept that X is better of Y only if swapping the answers position leads to a consistent result.
LLM could prefer a verbose answer… the more content the better! this seems difficult to address
Self-enhancement bias. for instance ChatGPT 4 favors itself over other LLM in 10% of the cases. This won’t affect much my chatbot as I won’t switch from one model to another. I have just ChatGPT 3.5 or 4 to choose and it is evident which one is the best.
Limited logic math capabilities… LLM are still limited in these domains, the judge will not be good enough in this case.
In the end ChatGPT 4 seems a good enough judge so I plan to use pairwise comparison in future as a method to assess the quality of a new chatbot release.
FAISS, Chroma and Pinecone in action
Developing RAG (retrieval augmented generation) applications requires you to process your documents and store them along with embedding vectors in a sort of database. You need to store the document, because you want to retrieve the exact text that will help the LLM crafting an answer. Nowadays you may not want to do textual search on your documents: leveraging some LLM model you can map a text to a vector of real numbers, that represents the document position in a sort of euclidean brain space. In this space “the bank of the river” and “the bank downtown opens at 09:00” will be mapped to different places, and will be clear that the 2 phrases speaks about different things even though the word “bank” appears in both. One of such embedding model is text-embedding-ada-002.
When the user asks a question, the question is mapped to a real number vector and you have to search the documents more similar to that vector. This requires some sort of specialized database. Looking at LangChain documentation I found that there are at least 3 different databases to explore: FAISS, Chroma and Pinecone.
I searched for a comparison between them: I did not find it but I come across this interesting video Vector Databases with FAISS, Chromadb, and Pinecone: A comprehensive guide. It will not teach you about the differences between the databases, but it will show you how to use theirs API, and tell you about the difficulties you may have installing them. Happy watching!
RAG Applications

As non English native speaker, it is always funny to try to understand the meaning of a new acronym, this time it is the turn of RAG Applications. I knew about “rag time” music, like this Ragtime Sea Shanty, and searching on the dictionary I saw that RAG stands for a piece of old cloth, especially one torn from a larger piece, used typically for cleaning things. Funny to see the above kind of picture associated to Research Augmented Generation (RAG).
What does Research Augmented Generation actually means? It is a technique to generate content using AI tools like chat GPT: instead of just asking the LLM “write a documentation page on how to shut down a nuclear reactor” you propose a prompt like this “write a documentation page on how to shut down a nuclear reactor rephrasing the content of these 10 pages: page 1. The supermega control panel has 5 buttons…”.
What is the advantage of including in the prompt these examples? The large language model (LLM) may not know anything of your domain for various reasons: maybe you want it to write about something to recent, that has not been used to train it. Another reason may be that you have a very specific domain, where the documentation is not public and therefore the LLM has no knowledge of that.
In my case my team has a private web site where we host our internal documentation: we describe there API, troubleshooting guides, internal procedure, meeting minutes, etc. The LLM does not know anything about our project, so it is impossible that it can provide detailed answers about it.
But how can you retrieve the right 10 pages to embed them in the LLM prompt? If you are familiar with search engines like Solr or ElasticSearch you may have many ideas right now: just export from the site the html text, preprocess it with beautifulsoup to convert it into text and then index it…
Today you can do better than that. Now you have not just LLM that answer questions, you have also models that compute the embeddings of a text like these from OpenAI. You take an html page from your site, you split it in multiple pieces, you ask to get the embeddings of this text and you store them into a specialized engine like ChromaDB. When a question is asked, you compute the question embeddings and ask ChromaDB to give you the 10/20 most similar contents it knows and you format a prompt for ChatGPT that contains those snippets.
It is amazing: we started chatting with our documentation! We can ask how to get a security token, which are the mandatory parameters of a specific API, on which storage accounts we are putting that kind of data…
Why do you need to split your contents in snippets? Simple, an LLM prompt has a size limit, you cannot just write each time a prompt encyclopedia and have an answer… chatgpt 3.5 accept fewer words but is very fast, when trying the 4 we had to wait a lot before seeing the full response. Also you need to consider the price of using the 4 instead of the 3.5.
Will you trust the LLM company for not collecting your private material and using it later as they want? It is much better if you read the user agreement! At least you do not sent the whole material at once to fine tune a new LLM model.
Some frameworks can help you doing this kind of application: Microsoft’s Semantic Kernel and LangChain. I tried the first one, but personally I am not very happy of Python support, maybe better to start with LangChain the next time.