
Content
Mastering RAG: Advanced Chunking Techniques for LLM Applications
Feb 23, 2024
Explore our research-backed evaluation metric for RAG – read our paper on Chainpoll.
What is chunking?
Chunking involves breaking down texts into smaller, manageable pieces called "chunks." Each chunk becomes a unit of information that is vectorized and stored in a database, fundamentally shaping the efficiency and effectiveness of natural language processing tasks.
Here is an example of how the character splitter splits the above paragraph into chunks using Chunkviz.

Impact of chunking
Chunking plays a central role in various aspects of RAG systems, exerting influence not only on retrieval quality but also on response. Let's understand these aspects in more detail.
Retrieval quality
The primary objective of chunking is to enhance the retrieval quality of information from vector databases. By defining the unit of information that is stored, chunking allows for retrieval of the most relevant information needed for the task.
Vector database cost
Efficient chunking techniques help optimize storage by balancing granularity. The cost of storage grows linearly with the number of chunks, therefore chunks should be as large as possible to keep costs low.
Vector database query latency
Maintaining low latency is essential for real-time applications. Minimizing the number of chunks reduces latency.
LLM latency and cost
The mind-blowing capabilities of LLMs come at a considerable price. Improved context from larger chunk sizes increases latency and serving costs.
LLM hallucinations
While adding more context may seem better, excessive context can lead to hallucinations in LLMs. Choosing the right chunking size for the job plays a large role in determining the quality of generated content. Balancing contextual richness with retrieval precision is essential to prevent hallucinatory outputs and ensure a coherent and accurate generation process.
Factors influencing chunking
We understand the importance of taking chunking seriously, but what factors influence it? A better understanding of these parameters will enable us to select an appropriate strategy.
Text structure
The text structure, whether it's a sentence, paragraph, code, table, or transcript, significantly impacts the chunk size. Understanding how structure relates to the type of content will help influence chunking strategy.
Embedding model
The capabilities and limitations of the embedding model play a crucial role in defining chunk size. Factors such as the model's context input length and its ability to maintain high-quality embeddings guide the optimal chunking strategy.
LLM context length
LLMs have finite context windows. Chunk size directly affects how much context can be fed into the LLM. Due to context length limitations, large chunks force the user to keep the top k in retrieval as low as possible.
Type of questions
The questions that users will be asking helps determine the chunking techniques best suited for your use case. Specific factual questions, for instance, may require a different chunking approach than complex questions which will require information from multiple chunks.
Types of chunking
As you see, selecting the right chunk size involves a delicate balance of multiple factors. There is no one-size-fits-all approach, emphasizing the importance of finding a chunking technique tailored to the RAG application's specific needs.
Let's explore common chunking techniques to help AI builders optimize their RAG performance.

Text splitter
Let's first understand the base class used by all Langchain splitters. The _merge_splits method of the TextSplitter class is responsible for combining smaller pieces of text into medium-sized chunks. It takes a sequence of text splits and a separator, and then iteratively merges these splits into chunks, ensuring that the combined size of the chunks is within specified limits.
The method uses chunk_size and chunk_overlap to determine the maximum size of the resulting chunks and their allowed overlap. It also considers factors such as the length of the separator and whether to strip whitespace from the chunks.
The logic maintains a running total of the length of the chunks and the separator. As splits are added to the current chunk, the method checks if adding a new split would exceed the specified chunk size. If so, it creates a new chunk, considering the chunk overlap, and removes splits from the beginning to meet size constraints.
This process continues until all splits are processed, resulting in a list of merged chunks. The method ensures that the chunks are within the specified size limits and handles edge cases such as chunks longer than the specified size by issuing a warning.
Source: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/text_splitter.py#L99
Character splitter
Langchain’s CharacterTextSplitter class is responsible for breaking down a given text into smaller chunks. It uses a separator such as "\n" to identify points where the text should be split.
The method first splits the text using the specified separator and then merges the resulting splits into a list of chunks. The size of these chunks is determined by parameters like chunk_size and chunk_overlap defined in the parent class TextSplitter.
Source: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/text_splitter.py#L289
Let's analyze its behavior with an example text. The empty string separator “” treats each character as a splitter. Subsequently, it combines the splits according to chunk size and chunk overlap.
Recursive Character Splitter
Langchain’s RecursiveCharacterTextSplitter class is designed to break down a given text into smaller chunks by recursively attempting to split it using different separators. This class is particularly useful when a single separator may not be sufficient to identify the desired chunks.
The method starts by trying to split the text using a list of potential separators specified in the _separators attribute. It iteratively checks each separator to find the one that works for the given text. If a separator is found, the text is split, and the process is repeated recursively on the resulting chunks until the chunks are of a manageable size.
The separators are listed in descending order of preference, and the method attempts to split the text using the most specific ones first. For example, in the context of the Python language, it tries to split along class definitions ("\nclass "), function definitions ("\ndef "), and other common patterns. If a separator is found, it proceeds to split the text recursively.
The resulting chunks are then merged and returned as a list. The size of the chunks is determined by parameters like chunk_size and chunk_overlap defined in the parent class TextSplitter. This approach allows for a more flexible and adaptive way of breaking down a text into meaningful sections.
Source: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/text_splitter.py#L851
In this example, the text is initially divided based on "\n\n" and subsequently on "" to adhere to the specified chunk size and overlap. The separator "\n\n" is removed from the text.
You can also employ the above method to chunk code from various languages by utilizing well-defined separators specific to each language.
Sentence splitter
Character splitting poses an issue as it tends to cut sentences midway. Despite attempts to address this using chunk size and overlap, sentences can still be cut off prematurely. Let's explore a novel approach that considers sentence boundaries instead.
The SpacySentenceTokenizer takes a piece of text and divides it into smaller chunks, with each chunk containing a certain number of sentences. It uses the Spacy library to analyze the input text and identify individual sentences.
The method allows you to control the size of the chunks by specifying the stride and overlap parameters. The stride determines how many sentences are skipped between consecutive chunks, and the overlap determines how many sentences from the previous chunk are included in the next one.
The example below shows how a text with pronouns like “they” requires the context of the previous sentence to make sense. Our brute force overlap approach helps here but is also redundant at some places and leads to longer chunks 🙁
Semantic splitting
This brings us to my favorite splitting method!
The SimilarSentenceSplitter takes a piece of text and divides it into groups of sentences based on their similarity. It utilizes a similarity model to measure how similar each sentence is to its neighboring sentences. The method uses a sentence splitter to break the input text into individual sentences.
The goal is to create groups of sentences where each group contains related sentences, according to the specified similarity model. The method starts with the first sentence in the first group and then iterates through the remaining sentences. It decides whether to add a sentence to the current group based on its similarity to the previous sentence.
The group_max_sentences parameter controls the maximum number of sentences allowed in each group. If a group reaches this limit, a new group is started. Additionally, a new group is initiated if the similarity between consecutive sentences falls below a specified similarity_threshold.
In simpler terms, this method organizes a text into clusters of sentences, where sentences within each cluster are considered similar to each other. It's useful for identifying coherent and related chunks of information within a larger body of text.
Credits: https://github.com/agamm/semantic-split
Now, let's apply semantic chunking to the text. The model appropriately connects sentences related to the dog and robot, referred to as "they" in the sentences, factoring in previous context. This approach allows us to overcome various challenges, including setting chunk size, managing overlap, and avoiding cropping in the middle.
This covers all the major practical text chunking methods commonly used in production. Now let’s look at one final approach to chunking for complicated documents. Or, you can download our comprehensive, end-to-end guide on building enterprise RAG systems.

LLM based chunking
These popular methods are all fine and good, but can we push them further? Let’s use the power of LLMs to go beyond traditional chunking!
Propositions
Unlike the conventional use of passages or sentences, a new paper Dense X Retrieval: What Retrieval Granularity Should We Use? introduces a novel retrieval unit for dense retrieval called "propositions." Propositions are atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format.
The three principles below define propositions as atomic expressions of meanings in text:
Each proposition should represent a distinct piece of meaning in the text, collectively embodying the semantics of the entire text.
A proposition must be minimal and cannot be further divided into separate propositions.
A proposition should contextualize itself and be self-contained, encompassing all the necessary context from the text (e.g., coreference) to interpret its meaning.

Let's put this novel approach into practice. The authors themselves trained the propositionizer model used in the code snippet below. As evident in the output, the model successfully performs coreference resolution, ensuring each sentence is self-sufficient.
Multi-vector indexing
Another approach involves multi-vector indexing, where semantic search is performed for a vector derived from something other than the raw text. There are various methods to create multiple vectors per document.
Smaller chunks
Divide a document into smaller chunks and embed them (referred to as ParentDocumentRetriever).
Summary
Generate a summary for each document and embed it along with, or instead of, the document.
Hypothetical questions
Form hypothetical questions that each document would be appropriate to answer, and embed them along with, or instead of, the document.
Each of these utilizes either a text2text or an LLM with a prompt to obtain the necessary chunk. The system then indexes both the newly generated chunk and the original text, improving the recall of the retrieval system. You can find more details of these techniques in Langchain’s official documentation.
Document specific splitting
Is parsing documents really chunking? Despite my hesitation, let’s cover this related topic.
Remember the buzz around "Chat with PDF"? Things are easy with plain text, but once we introduce tables, images, and unconventional formatting things get a lot more complicated. And what about .doc, .epub, and .xml formats as well? This leads us to a very powerful library…
Unstructured, with its diverse set of document type support and flexible partitioning strategies, offers several benefits for reading documents efficiently.
Supports all major document types
Unstructured supports a wide range of document types, including .pdf, .docx, .doc, .odt, .pptx, .ppt, .xlsx, .csv, .tsv, .eml, .msg, .rtf, .epub, .html, .xml, .png, .jpg, and .txt files. This ensures that users can seamlessly work with different file formats within a unified framework.
Adaptive partitioning
The "auto" strategy in Unstructured provides an adaptive approach to partitioning. It automatically selects the most suitable partitioning strategy based on the characteristics of the document. This feature simplifies the user experience and optimizes the processing of documents without the need for manual intervention in selecting partitioning strategies.
Specialized strategies for varied use cases
Unstructured provides specific strategies for different needs. The "fast" strategy quickly extracts information using traditional NLP techniques, "hi_res" ensures precise classification using detectron2 and document layout, and "ocr_only" is designed specifically for Optical Character Recognition in image-based files. These strategies accommodate various use cases, offering users flexibility and precision in their document processing workflows.
Unstructured's comprehensive document type support, adaptive partitioning strategies, and customization options make it a powerful tool for efficiently reading and processing a diverse range of documents.
Let's run an example and see how it partitions the Gemini 1.5 technical report.
It effectively extracted various sections from the PDF and organized them into distinct elements. Now, let's examine the data to confirm if we successfully parsed the table below.

Look how similar the two tables are! It is able to identify the columns & rows to generate the table in HTML format. This makes it easier for us to do tabular QnA!

How to measure chunking effectiveness

To optimize RAG performance, improving retrieval with effective chunking is crucial. Here are two chunk evaluation metrics to help you debug RAG faster.
Chunk Attribution
Chunk attribution evaluates whether each retrieved chunk influences the model's response. It employs a binary metric, categorizing each chunk as either Attributed or Not Attributed. Chunk Attribution is closely linked to Chunk Utilization (see below), with Attribution determining if a chunk impacted the response and Utilization measuring the extent of the chunk's text involved in the effect. Only Attributed chunks can have Utilization scores greater than zero.
Chunk Attribution helps pinpoint areas for improvement in RAG systems, such as adjusting the number of retrieved chunks. If the system provides satisfactory responses but has many Not Attributed chunks, reducing the retrieved chunks per example may enhance system efficiency, leading to lower costs and latency.
Additionally, when investigating individual examples with unusual or unsatisfactory model behavior, Attribution helps identify the specific chunks influencing the response, facilitating quicker troubleshooting.
Chunk Utilization
Chunk Utilization gauges the fraction of text in each retrieved chunk that impacts the model's response. This metric ranges from 0 to 1, where a value of 1 indicates the entire chunk affected the response, while a lower value, such as 0.5, signifies the presence of "extraneous" text that did not impact the response. Chunk Utilization is intricately connected to Chunk Attribution, as Attribution determines if a chunk affected the response, while Utilization measures the portion of the chunk text involved in the effect. Only Attributed chunks can have Utilization scores greater than zero.
Low Chunk Utilization scores suggest that chunks may be longer than necessary. In such cases, reducing the parameters that control the chunk size is recommended to enhance system efficiency by lowering costs and latency.

Experiment
Learn more about how to utilize metrics in this webinar where we build a Q&A over earning call transcripts using Langchain. If you are looking for an end2end example with code, have a look at this blog.
Conclusion
In summary, effective chunking is crucial for optimizing RAG systems. It ensures accurate information retrieval and influences factors like response latency and storage costs. Choosing the right chunking strategy involves considering multiple aspects but can be done easily with metrics like Chunk Attribution and Chunk Utilization.
Galileo's RAG analytics offer a transformative approach, providing unparalleled visibility into RAG systems and simplifying evaluation to improve RAG performance. Sign up for your free Galileo account today, or continue your Mastering RAG journey with our free, comprehensive eBook.

Share this post