Unstructured

Package: unstructured · 8 nodes · Document ingestion

Parse PDF, DOCX, HTML, and image files into structured text elements. Chunk, filter, and extract content for downstream processing.

Node Reference

NodeTypeInputsOutputs
PartitionstatementFile Path (str)Elements (unstruct.Elements)
Partition PdfstatementFile Path (str)Elements (unstruct.Elements)
ChunkstatementElements (unstruct.Elements)Chunks (unstruct.Elements)
Filter CategorystatementElements (unstruct.Elements)Filtered (unstruct.Elements)
To TextsexpressionElements (unstruct.Elements)Texts (list<any>)
To DictsexpressionElements (unstruct.Elements)Dicts (list<any>)
Element CountexpressionElements (unstruct.Elements)Count (int)
Get MetadataexpressionElements (unstruct.Elements)Metadata (list<any>)

Typical Pipeline

Partition → Chunk → To Texts → feed into ChromaDB or LLM context.