LLMs can only answer questions if they have seen the answer before and know the context. Many use cases, such as a Q&A application based on usersโ€™ bespoke content, will require the AI models to be trained with the userโ€™s content. However, training a custom model may not always be an option due to costs, privacy constraints, and complexity challenges. The solution? We can use Retrieval Augmented Generation (RAG) to achieve the goal without breaking the bank.

For the AI model to answer questions based on your exclusive context and knowledge, you first find (retrieve) the relevant documents - usually unstructured data - and then feed them and the question to the model in a single prompt (often uses zero-shot training). The model can then generate a response, be it answering a question, summarising, or making recommendations, to name a few.

The first step of RAG is to prepare the docs containing the appropriate context and knowledge. The standard approach is to chunk the data into small pieces and run them through a Text Embedding Model (TEM), which returns vector indexes indicating the relevance approximately between the chucks. It is essentially a compressed representation of the documentsโ€™ semantic context. Given the speed benefits of the vector search, the price of losing some semantic context is often worth paying. Some models can use connectors to link to your private data sources, such as internal documents, databases or the Internet.

When a question is received, your application looks up the index datastore to find the closest (most relevant) documents. Then, it constructs a single prompt combining all the document chucks, user input, and instructions (system message) to the AI model. This combined prompt will be the final input sent to the LLM to generate responses.

References