Chunking Strategy
Simple Chat processes a wide range of document types using a consistent chunking strategy to enable optimal context handling and LLM performance. Below is an overview of how we chunk content depending on the file type:
General Principles
- Chunk Size: Targeting ~400 words of meaningful content per chunk. Some formats (HTML, Markdown) require larger text windows (up to 1200 words) due to formatting overhead that inflates token count.
- Minimum Chunk Size: Chunks are merged if they contain fewer than 600 words, ensuring minimal context fragmentation.
- Table Handling: When chunking might split tables, we replicate the table header in each chunk to preserve readability.
- Code Blocks (Markdown): Ensure any split code blocks retain full formatting (`````) in each chunk to maintain integrity.
File-Type Specific Chunking
PDF
- Sent to Document Intelligence for OCR + layout parsing.
- If PDF is more than 500 MB or 2000 pages, it is broken into 500‐page parts.
- Each part is sent separately to Document Intelligence.
- All chunks from each part are saved under the original document in AI Search; parts exist only to work around service limits.
- Chunked by page.
DOCX
- Sent to Document Intelligence.
- Chunked by ~400 words, approximating an A4 page.
DOC / DOCM
- Processed with Python package docx2txt
- Chunked by ~400 words, approximating an A4 page.
PPTX
- Sent to Document Intelligence.
- Chunked by slide (page).
Images (.jpg, .jpeg, .png, .bmp, .tiff, .tif, .heif)
- Sent to Document Intelligence for OCR.
- One chunk per image.
TXT
- Processed using regex word splitting.
- Chunked by 400 words.
- See
process_txt_file.
HTML
- Uses
RecursiveCharacterTextSplitter.
- Header-based chunking:
- Initially split by
<h1> tags.
- Chunks >1200 words are recursively split using
<h2> → <h3> → ... <h5>.
- Tables: If a table spans chunks, ensure headers are repeated per chunk.
- Minimum chunk size: Merge chunks <600 words into preceding ones.
- Goal: Maintain 400 words of informational content per chunk, accounting for token inflation from HTML tags.
- See
process_html_file.
Markdown (.md)
- Uses
MarkdownHeaderTextSplitter.
- Initial split by
# headers (h1 → h5).
- Chunks >1200 words undergo recursive splitting.
- Table & Code Block Handling:
- Tables: Re-add headers if split.
- Code: Wrap with code block syntax (`````) if a split occurs.
- Minimum 600-word chunks enforced.
- See
process_md_file.
JSON
- Uses
RecursiveJsonSplitter - a specialized splitter designed for JSON data structures.
- Structural splitting:
- Understands JSON objects, arrays, and nesting
max_chunk_size=4000 characters
convert_lists=True - intelligently handles JSON arrays
- Maintains validity: Each chunk is valid, parseable JSON.
- Empty chunk filtering: Skips trivial chunks like
{}, [], or empty strings.
- Goal: Preserve JSON structure while creating semantically meaningful chunks that respect object/array boundaries.
- See
process_json.
XML
- Uses
RecursiveCharacterTextSplitter with XML-aware separators.
- Structure-preserving chunking:
- Separators prioritized:
\n\n → \n → > (end of XML tags) → space → character
- Splits at logical boundaries to maintain tag integrity
- Chunked by 4000 characters.
- Goal: Preserve XML structure by splitting at tag boundaries rather than mid-element, ensuring chunks are more semantically meaningful for LLM processing.
- See
process_xml.
YAML / YML
- Uses
RecursiveCharacterTextSplitter with YAML-aware separators.
- Structure-preserving chunking:
- Separators prioritized:
\n\n → \n → - (YAML list items) → space → character
- Splits at logical boundaries to maintain YAML structure
- Chunked by 4000 characters.
- Goal: Preserve YAML hierarchy and list structures by splitting at section boundaries and list items rather than mid-key or mid-value.
- See
process_yaml.
LOG
- Processed using line-based chunking to maintain log record integrity.
- Never splits mid-line to preserve complete log entries.
- Line-Level Chunking:
- Split file by lines using
splitlines(keepends=True) to preserve line endings.
- Accumulate complete lines until reaching target word count ≈1000 words.
- When adding next line would exceed target AND chunk already has content:
- Finalize current chunk
- Start new chunk with current line
- If single line exceeds target, it gets its own chunk to prevent infinite loops.
- Emit chunks with complete log records.
- Goal: Provide substantial log context (1000 words) while ensuring no log entry is split across chunks.
- See
process_log.
Tabular (CSV, XLSX, XLS, XLSM)
- CSV:
- Parsed with pandas.
- Chunked by rows into ~800-character chunks (1+ full rows per chunk).
- Header row prepended to each chunk.
- XLSX / XLS:
- Each worksheet is treated as a separate file:
- Filename format:
filename-tabname.ext
- Chunked by rows (~800 characters).
- Header rows are preserved.
- Excel formulas are evaluated to return only the computed value.
- See
process_tabular_file.
Video Files
- Transcript Extraction:
- Use Azure Video Indexer to generate a full transcript with timestamps and, if needed, confidence scores.
- Retrieve the index JSON via the Get Video Index API; inspect the
insights.transcript array for line segments with instances[start/end].
- Line-Level Chunking (preferred for simplicity):
-
Iterate insights.transcript, splitting each segment’s text into words.
-
Accumulate segments until the aggregate is approx 30 seconds long.
-
Emit a chunk with:
startTime = first segment’s instances[0].start.
text = concatenation of segment texts.
-
Reset word count and continue.
-
See process_video_file.
Audio Files
Chunking Strategy
Simple Chat processes a wide range of document types using a consistent chunking strategy to enable optimal context handling and LLM performance. Below is an overview of how we chunk content depending on the file type:
General Principles
File-Type Specific Chunking
PDF
DOCX
DOC / DOCM
PPTX
Images (
.jpg,.jpeg,.png,.bmp,.tiff,.tif,.heif)TXT
process_txt_file.HTML
RecursiveCharacterTextSplitter.<h1>tags.<h2>→<h3>→ ...<h5>.process_html_file.Markdown (.md)
MarkdownHeaderTextSplitter.#headers (h1 → h5).process_md_file.JSON
RecursiveJsonSplitter- a specialized splitter designed for JSON data structures.max_chunk_size=4000charactersconvert_lists=True- intelligently handles JSON arrays{},[], or empty strings.process_json.XML
RecursiveCharacterTextSplitterwith XML-aware separators.\n\n→\n→>(end of XML tags) → space → characterprocess_xml.YAML / YML
RecursiveCharacterTextSplitterwith YAML-aware separators.\n\n→\n→-(YAML list items) → space → characterprocess_yaml.LOG
splitlines(keepends=True)to preserve line endings.process_log.Tabular (CSV, XLSX, XLS, XLSM)
filename-tabname.extprocess_tabular_file.Video Files
insights.transcriptarray for line segments withinstances[start/end].Iterate
insights.transcript, splitting each segment’stextinto words.Accumulate segments until the aggregate is approx 30 seconds long.
Emit a chunk with:
startTime= first segment’sinstances[0].start.text= concatenation of segment texts.Reset word count and continue.
See
process_video_file.Audio Files
Transcript Extraction (Azure Speech Services):
Line-Level Chunking:
text,offset(start time), andduration.textinto words.startTime= the first segment’soffsetformatted asHH:MM:SS.sss.text= concatenation of segment texts.process_audio_file.