Supported document formats and conversions

This document is the single source of truth for which document conversions are supported vs return 501 in this deployment. Do not advertise conversions that return 501 as supported.

Document understanding pipeline

A single pipeline coordinates layout, classification, and extraction:

  1. Layout (OCR/structure) – Optional. Runs locally via the Donut model (medha_os.core.pdf_engine.layout_extractor, medha_os.services.document_understanding_pipeline). PDFs are converted to images per page; each page (and standalone images) is processed by Donut for structure (titles, paragraphs, tables). No external OCR service is required when Donut is available.
  2. Classification – Optional. Document type/category can be derived from layout output or from text using tenant classifiers.
  3. Extraction – Optional. Schema-based or intelligent extraction runs on the text/layout output.

For image-heavy documents, an optional vision-capable LLM step can be configured (e.g. GPT-4V, Claude vision). When the router selects a vision-capable backend and image bytes are provided, they can be sent in the payload; support depends on the backend implementation.

Pipeline entry point: medha_os.services.document_understanding_pipeline.DocumentUnderstandingPipeline (e.g. run_layout, layout_then_classify). The AI service uses the same layout extractor for /api/v1/ai/layout/analyze and for document classification.

Supported

  • Core PDF ingest – PDF upload, indexing, and core document operations.
  • Layout extraction – Local Donut for PDFs and images (PNG, JPEG, etc.); structure output (titles, paragraphs, tables).
  • OCR – Where implemented and dependencies (e.g. ocrmypdf, Tesseract) are available. Donut provides OCR-free document understanding when used for layout.
  • PDF to HTML – Where local conversion is available (otherwise returns 501).
  • PDF to Markdown – Where implemented (otherwise returns 501).
  • HTML to PDF – Where implemented (otherwise returns 501).
  • Markdown to PDF – Where implemented (otherwise returns 501).
  • ZIP to PDF – Where implemented (otherwise returns 501).

Specific endpoints may return 200 or 501 depending on deployment and available tools.

Return 501 (not available in this deployment)

The following conversions may return 501 Not Implemented with a detail such as "… conversion not available". Do not advertise these as supported unless your deployment implements them.

  • PDF to vector – Vector conversion not available.
  • ZIP – ZIP conversion not available (when not implemented).
  • Markdown – Markdown conversion not available (when not implemented).
  • EML – EML conversion not available.
  • Office (DOCX, XLSX, etc.) – Office conversion not available.
  • HTML – HTML conversion not available (when not implemented).
  • Comic – Comic conversion not available.
  • PS/EPS – PS/EPS conversion not available.
  • Local conversion – Various endpoints return "Local conversion not available" when the backing service or tool is not available.

API behavior

Endpoints that can return 501 document this in OpenAPI (responses={501: {"description": "This conversion is not available in this deployment."}}). The 501 response is intentional for non-core features and must not be advertised as supported in marketing or public docs.