Files

T

History

lesyk c6308dc822 [MS] Add OCR layer service for embedded images and PDF scans (#1541 )

* Add OCR test data and implement tests for various document formats

- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

* Enhance OCR functionality and validation in document converters

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.

* Add support for scanned PDFs with full-page OCR fallback and implement tests

* Bump version to 0.1.6b1 in __about__.py

* Refactor OCR services to support LLM Vision, update README and tests accordingly

* Add OCR-enabled converters and ensure consistent OCR format across document types

* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters

* Refactor exception imports for consistency across converters and tests

* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling

* Bump version to 0.1.6b1 in __about__.py

* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

* Add comprehensive OCR test suite for various document formats

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.

* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs

* Revert

* Update REDMEs

* Refactor import statements for consistency and improve formatting in converter and test files

2026-03-10 09:17:17 -07:00

src/markitdown_ocr

[MS] Add OCR layer service for embedded images and PDF scans (#1541 )

2026-03-10 09:17:17 -07:00

tests

[MS] Add OCR layer service for embedded images and PDF scans (#1541 )

2026-03-10 09:17:17 -07:00

LICENSE

[MS] Add OCR layer service for embedded images and PDF scans (#1541 )

2026-03-10 09:17:17 -07:00

pyproject.toml

[MS] Add OCR layer service for embedded images and PDF scans (#1541 )

2026-03-10 09:17:17 -07:00

README.md

[MS] Add OCR layer service for embedded images and PDF scans (#1541 )

2026-03-10 09:17:17 -07:00

README.md

MarkItDown OCR Plugin

LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files.

Uses the same llm_client / llm_model pattern that MarkItDown already supports for image descriptions — no new ML libraries or binary dependencies required.

Features

Enhanced PDF Converter: Extracts text from images within PDFs, with full-page OCR fallback for scanned documents
Enhanced DOCX Converter: OCR for images in Word documents
Enhanced PPTX Converter: OCR for images in PowerPoint presentations
Enhanced XLSX Converter: OCR for images in Excel spreadsheets
Context Preservation: Maintains document structure and flow when inserting extracted text

Installation

pip install markitdown-ocr

The plugin uses whatever OpenAI-compatible client you already have. Install one if you don't have it yet:

pip install openai

Usage

Command Line

markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o

Python API

Pass llm_client and llm_model to MarkItDown() exactly as you would for image descriptions:

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)

If no llm_client is provided the plugin still loads, but OCR is silently skipped — falling back to the standard built-in converter.

Custom Prompt

Override the default extraction prompt for specialized documents:

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
    llm_prompt="Extract all text from this image, preserving table structure.",
)

Any OpenAI-Compatible Client

Works with any client that follows the OpenAI API:

from openai import AzureOpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=AzureOpenAI(
        api_key="...",
        azure_endpoint="https://your-resource.openai.azure.com/",
        api_version="2024-02-01",
    ),
    llm_model="gpt-4o",
)

How It Works

When MarkItDown(enable_plugins=True, llm_client=..., llm_model=...) is called:

MarkItDown discovers the plugin via the markitdown.plugin entry point group
It calls register_converters(), forwarding all kwargs including llm_client and llm_model
The plugin creates an LLMVisionOCRService from those kwargs
Four OCR-enhanced converters are registered at priority -1.0 — before the built-in converters at priority 0.0

When a file is converted:

The OCR converter accepts the file
It extracts embedded images from the document
Each image is sent to the LLM with an extraction prompt
The returned text is inserted inline, preserving document structure
If the LLM call fails, conversion continues without that image's text

Supported File Formats

PDF

Embedded images are extracted by position (via page.images / page XObjects) and OCR'd inline, interleaved with the surrounding text in vertical reading order.
Scanned PDFs (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the LLM as a full-page image.
Malformed PDFs that pdfplumber/pdfminer cannot open (e.g. truncated EOF) are retried with PyMuPDF page rendering, so content is still recovered.

DOCX

Images are extracted via document part relationships (doc.part.rels).
OCR is run before the DOCX→HTML→Markdown pipeline executes: placeholder tokens are injected into the HTML so that the markdown converter does not escape the OCR markers, and the final placeholders are replaced with the formatted *[Image OCR]...[End OCR]* blocks after conversion.
Document flow (headings, paragraphs, tables) is fully preserved around the OCR blocks.

PPTX

Picture shapes, placeholder shapes with images, and images inside groups are all supported.
Shapes are processed in top-to-left reading order per slide.
If an llm_client is configured, the LLM is asked for a description first; OCR is used as the fallback when no description is returned.

XLSX

Images embedded in worksheets (sheet._images) are extracted per sheet.
Cell position is calculated from the image anchor coordinates (column/row → Excel letter notation).
Images are listed under a ### Images in this sheet: section after the sheet's data table — they are not interleaved into the table rows.

Output format

Every extracted OCR block is wrapped as:

*[Image OCR]
<extracted text>
[End OCR]*

Troubleshooting

OCR text missing from output

The most likely cause is a missing llm_client or llm_model. Verify:

from openai import OpenAI
from markitdown import MarkItDown

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),   # required
    llm_model="gpt-4o",    # required
)

Plugin not loading

Confirm the plugin is installed and discovered:

markitdown --list-plugins   # should show: ocr

API errors

The plugin propagates LLM API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs.

Development

Running Tests

cd packages/markitdown-ocr
pytest tests/ -v

Building from Source

git clone https://github.com/microsoft/markitdown.git
cd markitdown/packages/markitdown-ocr
pip install -e .

Contributing

Contributions are welcome! See the MarkItDown repository for guidelines.

License

MIT — see LICENSE.

Changelog

0.1.0 (Initial Release)

LLM Vision OCR for PDF, DOCX, PPTX, XLSX
Full-page OCR fallback for scanned PDFs
Context-aware inline text insertion
Priority-based converter replacement (no code changes required)