RAG Applications

As non English native speaker, it is always funny to try to understand the meaning of a new acronym, this time it is the turn of RAG Applications. I knew about “rag time” music, like this Ragtime Sea Shanty, and searching on the dictionary I saw that RAG stands for a piece of old cloth, especially one torn from a larger piece, used typically for cleaning things. Funny to see the above kind of picture associated to Research Augmented Generation (RAG).
What does Research Augmented Generation actually means? It is a technique to generate content using AI tools like chat GPT: instead of just asking the LLM “write a documentation page on how to shut down a nuclear reactor” you propose a prompt like this “write a documentation page on how to shut down a nuclear reactor rephrasing the content of these 10 pages: page 1. The supermega control panel has 5 buttons…”.
What is the advantage of including in the prompt these examples? The large language model (LLM) may not know anything of your domain for various reasons: maybe you want it to write about something to recent, that has not been used to train it. Another reason may be that you have a very specific domain, where the documentation is not public and therefore the LLM has no knowledge of that.
In my case my team has a private web site where we host our internal documentation: we describe there API, troubleshooting guides, internal procedure, meeting minutes, etc. The LLM does not know anything about our project, so it is impossible that it can provide detailed answers about it.
But how can you retrieve the right 10 pages to embed them in the LLM prompt? If you are familiar with search engines like Solr or ElasticSearch you may have many ideas right now: just export from the site the html text, preprocess it with beautifulsoup to convert it into text and then index it…
Today you can do better than that. Now you have not just LLM that answer questions, you have also models that compute the embeddings of a text like these from OpenAI. You take an html page from your site, you split it in multiple pieces, you ask to get the embeddings of this text and you store them into a specialized engine like ChromaDB. When a question is asked, you compute the question embeddings and ask ChromaDB to give you the 10/20 most similar contents it knows and you format a prompt for ChatGPT that contains those snippets.
It is amazing: we started chatting with our documentation! We can ask how to get a security token, which are the mandatory parameters of a specific API, on which storage accounts we are putting that kind of data…
Why do you need to split your contents in snippets? Simple, an LLM prompt has a size limit, you cannot just write each time a prompt encyclopedia and have an answer… chatgpt 3.5 accept fewer words but is very fast, when trying the 4 we had to wait a lot before seeing the full response. Also you need to consider the price of using the 4 instead of the 3.5.
Will you trust the LLM company for not collecting your private material and using it later as they want? It is much better if you read the user agreement! At least you do not sent the whole material at once to fine tune a new LLM model.
Some frameworks can help you doing this kind of application: Microsoft’s Semantic Kernel and LangChain. I tried the first one, but personally I am not very happy of Python support, maybe better to start with LangChain the next time.
Leave a comment