Data Extraction¶

KARLI-hosted data-extraction models turn uploaded files into structured text that downstream components can consume. They are selected from the Read File component when its Extraction Backend is set to karli.

KARLI is currently the only provider offering this category of model.

Available Models¶

Model	Accepts	Notes
`karli/default-data-extraction`	Any	KARLI-managed default; routes the file to a sensible extractor.
`karli/data-extraction-moe-latest`	Any	Mixture-of-Experts router — picks the optimal extractor per file type (and per page for PDFs). See details below.
`docling-project/docling`	Documents	Docling, run server-side by KARLI.
`datalab-to/marker`	Documents	Marker.
`opendatalab/MinerU`	Documents	MinerU.
`karli/multimodal-data-extraction`	Documents	Multimodal hybrid pipeline.
`openai/whisper-large-v3`	Audio	Audio transcription via Whisper.

The Read File component validates the uploaded file against the chosen model's accepted type before uploading — submitting, for example, a PDF to the Whisper model produces an error rather than an upload.

Mixture-of-Experts Routing¶

The karli/data-extraction-moe-latest model automatically selects the best extractor for each file type:

File Category	Extensions	Extractor
PDF	pdf	Per-page router (see below)
Word documents	doc, docx	Docling
Presentations	ppt, pptx	Docling
Spreadsheets / tables	xls, xlsx, csv	Default
HTML	htm, html, xhtml	MarkItDown
Images	png, jpg, jpeg, gif, bmp, tiff, tif, webp	Vision
Audio	aac, mpeg, wav, webm, mp3, mp4	Audio (Whisper)
Email	eml, msg, pst	Default
Plain text	txt	Default

Per-page PDF routing¶

For PDFs the MoE model inspects each page individually, classifies its complexity, and dispatches it to the most suitable extractor:

Page Category	Extractor	Needs Vision
Scanned page	Marker	Yes
Handwriting	Marker	Yes
Image-dominant	Marker	Yes
Complex layout	Marker	No
Plain text	MinerU	No
Mixed content	Marker	No

This means a single PDF can be processed by multiple extractors — for example, a text-heavy page goes through MinerU while a scanned page is routed to Marker with vision support.

Request Shape¶

When a file is sent for extraction, the component issues a POST to {KARLI_BASE_URL}/data-extraction/extract as a multipart upload:

Form field extractorModel carries the selected model (mapped to its KARLI identifier).
The file part carries the document or audio file.
Authentication follows the priority described below.

The response is a JSON object whose segments are concatenated into a single text payload; segments with a title are emitted as ## <title> Markdown headers.

Authentication¶

Extraction requests authenticate using the following priority:

JWT (if a Karli Studio session is active) — sent as Authorization: Bearer <token>.
KARLI_API_KEY (from the component attribute, provider variables, or the KARLI_API_KEY environment variable) — sent as X-API-Key: <key>.
Error — if neither is available, a ValueError is raised instructing the user to configure KARLI_API_KEY or access via Karli Studio.

External API users

If you call the Agentlab API directly and your flows include a FileComponent with document extraction, you must configure the KARLI_API_KEY provider variable or set the KARLI_API_KEY environment variable on the server. See the Model Providers overview for details.

See Document Extraction for how the Read File component uses these models in practice, including the downstream Data shape.