Unstructured

Package: unstructured · 8 nodes · Document ingestion

Parse PDF, DOCX, HTML, and image files into structured text elements. Chunk, filter, and extract content for downstream processing.

Node Reference

Node Type Inputs Outputs
Partition statement File Path (str) Elements (unstruct.Elements)
Partition Pdf statement File Path (str) Elements (unstruct.Elements)
Chunk statement Elements (unstruct.Elements) Chunks (unstruct.Elements)
Filter Category statement Elements (unstruct.Elements) Filtered (unstruct.Elements)
To Texts expression Elements (unstruct.Elements) Texts (list)
To Dicts expression Elements (unstruct.Elements) Dicts (list)
Element Count expression Elements (unstruct.Elements) Count (int)
Get Metadata expression Elements (unstruct.Elements) Metadata (list)

Typical Pipeline

Partition → Chunk → To Texts → feed into ChromaDB or LLM context.