Turn any documentinto structured data.
A purpose-built extraction engine — not a general-purpose LLM squinting at a PDF. Layout-aware computer vision plus vision-language correction. Built to win on the documents general models break on: scanned, stamped, multi-column, handwritten, rotated, half-readable.
Four operations. One pipeline.
Parse, Extract, Split, Redact — composable as a single chain or callable independently. The same engine powers all four, so the accuracy you see in extraction holds for redaction.
Layout-grounded reading order.
Computer-vision pass first: tables, multi-column layouts, rotated scans, stamps, signatures, marks, handwriting — captured with positional accuracy before any text is generated. Then a vision-language pass corrects edge cases.
Field-level structured output.
Auto-detected schemas or your own templates. Every field comes with a confidence score and a bounding box. Markdown for LLMs; JSON for warehouses; raw text when you need it.
Multi-document packets, untangled.
100-page PDFs that contain twelve documents are split into the right twelve documents — automatically classified, indexed, and queued. Reviewer overrides supported.
Remove what shouldn’t exist.
Pixel-level redaction overlays driven by the same detection engine that powers the Vault. Output: a redacted PDF you can send to a partner, plus a tokenized mirror you can reverse with policy.
Vision first. Language as the corrector.
We don't hand a PDF to an LLM and pray. A purpose-built computer-vision pipeline does the heavy lifting first; a vision-language model only enters where it's most useful — correcting low-confidence cells and inferring context.
Layout pass
A trained CV model reads the document like a human: column order, table boundaries, stamps, signatures, handwriting locations, rotation.
OCR + structure
Per-region OCR with structure preserved. Tables stay tables. Forms stay forms. Confidence scored at the cell level.
VLM correction
A vision-language model reviews low-confidence regions and reasons over context (e.g. 'is that a 0 or an O?'), pushing the field-level score higher.
Pick the format your downstream stack wants.
Every extraction returns the same underlying graph; you choose how it materializes. Mix and match in one request.
Markdown
LLM-ready. Headings, lists, tables, key-value blocks. Ideal as direct input to any RAG or agent pipeline.
JSON (typed fields)
Strict schema with confidence scores and bounding boxes. Drop straight into your warehouse or pipeline.
Redaction overlays
Pixel-coordinate redaction maps. Render a sanitized PDF for sharing while keeping a tokenized original.
Tables (CSV / Parquet)
Tables extracted as actual tables — column headers, row alignment, cell-level confidence. Not just markdown.
One call. Markdown, JSON, tokens.
Single sync request. No polling, no two-step submit/retrieve dance. Optional inline tokenization wires extraction into the Privacy Vault in the same call.
curl -X POST https://api.logikol.com/v1/extract \
-H "Authorization: Bearer $LOGIKOL_KEY" \
-F "file=@statement.pdf" \
-F 'options={
"output": ["markdown", "fields", "tables"],
"schema": "auto",
"tokenize_pii": true
}'{
"markdown": "# Account Statement\n\n**Holder:** [name_a3f]\n...",
"fields": {
"account_number": { "value": "[acct_4de]", "confidence": 0.99, "bbox": [...] },
"balance": { "value": 48200.00, "confidence": 1.00, "bbox": [...] },
"statement_date": { "value": "2026-04-30", "confidence": 1.00, "bbox": [...] }
},
"tables": [ { "rows": [...], "headers": [...] } ],
"tokens": { "[name_a3f]": "vault_ref://...", "[acct_4de]": "vault_ref://..." },
"pages": 4,
"ms": 1840
}On the documents general models actually fail on.
Real-world documents aren't clean. They're scanned, rotated, stamped, marked up, multi-language, sometimes hand-fed into an old fax machine before they ever reached your inbox. Logikol is built for those.
Scanned + rotated documents
Auto-deskew, auto-rotate, auto-orient. The vision pass doesn't care which way up the page came in.
Multi-column + nested tables
Reading order is layout-derived, not heuristic. Nested tables stay nested.
Stamps, signatures, handwriting
Detected as objects, scored separately, surfaced as fields you can route for review.
100+ languages
Including RTL scripts and CJK. Same accuracy posture across the board.
Document Intelligence + Privacy Vault = parse and protect in one call.
Pass tokenize_pii: true on any extraction. Sensitive fields come back as format-preserving tokens; the originals are encrypted under your keys before the response ever leaves the gateway.