Unstructured
Package: unstructured · 8 nodes · Document ingestion
Parse PDF, DOCX, HTML, and image files into structured text elements. Chunk, filter, and extract content for downstream processing.
Node Reference
| Node | Type | Inputs | Outputs |
|---|---|---|---|
| Partition | statement | File Path (str) | Elements (unstruct.Elements) |
| Partition Pdf | statement | File Path (str) | Elements (unstruct.Elements) |
| Chunk | statement | Elements (unstruct.Elements) | Chunks (unstruct.Elements) |
| Filter Category | statement | Elements (unstruct.Elements) | Filtered (unstruct.Elements) |
| To Texts | expression | Elements (unstruct.Elements) | Texts (list<any>) |
| To Dicts | expression | Elements (unstruct.Elements) | Dicts (list<any>) |
| Element Count | expression | Elements (unstruct.Elements) | Count (int) |
| Get Metadata | expression | Elements (unstruct.Elements) | Metadata (list<any>) |
Typical Pipeline
Partition → Chunk → To Texts → feed into ChromaDB or LLM context.