c6308dc822
* Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files
201 lines
5.8 KiB
Markdown
201 lines
5.8 KiB
Markdown
# MarkItDown OCR Plugin
|
|
|
|
LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files.
|
|
|
|
Uses the same `llm_client` / `llm_model` pattern that MarkItDown already supports for image descriptions — no new ML libraries or binary dependencies required.
|
|
|
|
## Features
|
|
|
|
- **Enhanced PDF Converter**: Extracts text from images within PDFs, with full-page OCR fallback for scanned documents
|
|
- **Enhanced DOCX Converter**: OCR for images in Word documents
|
|
- **Enhanced PPTX Converter**: OCR for images in PowerPoint presentations
|
|
- **Enhanced XLSX Converter**: OCR for images in Excel spreadsheets
|
|
- **Context Preservation**: Maintains document structure and flow when inserting extracted text
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
pip install markitdown-ocr
|
|
```
|
|
|
|
The plugin uses whatever OpenAI-compatible client you already have. Install one if you don't have it yet:
|
|
|
|
```bash
|
|
pip install openai
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Command Line
|
|
|
|
```bash
|
|
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
|
|
```
|
|
|
|
### Python API
|
|
|
|
Pass `llm_client` and `llm_model` to `MarkItDown()` exactly as you would for image descriptions:
|
|
|
|
```python
|
|
from markitdown import MarkItDown
|
|
from openai import OpenAI
|
|
|
|
md = MarkItDown(
|
|
enable_plugins=True,
|
|
llm_client=OpenAI(),
|
|
llm_model="gpt-4o",
|
|
)
|
|
|
|
result = md.convert("document_with_images.pdf")
|
|
print(result.text_content)
|
|
```
|
|
|
|
If no `llm_client` is provided the plugin still loads, but OCR is silently skipped — falling back to the standard built-in converter.
|
|
|
|
### Custom Prompt
|
|
|
|
Override the default extraction prompt for specialized documents:
|
|
|
|
```python
|
|
md = MarkItDown(
|
|
enable_plugins=True,
|
|
llm_client=OpenAI(),
|
|
llm_model="gpt-4o",
|
|
llm_prompt="Extract all text from this image, preserving table structure.",
|
|
)
|
|
```
|
|
|
|
### Any OpenAI-Compatible Client
|
|
|
|
Works with any client that follows the OpenAI API:
|
|
|
|
```python
|
|
from openai import AzureOpenAI
|
|
|
|
md = MarkItDown(
|
|
enable_plugins=True,
|
|
llm_client=AzureOpenAI(
|
|
api_key="...",
|
|
azure_endpoint="https://your-resource.openai.azure.com/",
|
|
api_version="2024-02-01",
|
|
),
|
|
llm_model="gpt-4o",
|
|
)
|
|
```
|
|
|
|
## How It Works
|
|
|
|
When `MarkItDown(enable_plugins=True, llm_client=..., llm_model=...)` is called:
|
|
|
|
1. MarkItDown discovers the plugin via the `markitdown.plugin` entry point group
|
|
2. It calls `register_converters()`, forwarding all kwargs including `llm_client` and `llm_model`
|
|
3. The plugin creates an `LLMVisionOCRService` from those kwargs
|
|
4. Four OCR-enhanced converters are registered at **priority -1.0** — before the built-in converters at priority 0.0
|
|
|
|
When a file is converted:
|
|
|
|
1. The OCR converter accepts the file
|
|
2. It extracts embedded images from the document
|
|
3. Each image is sent to the LLM with an extraction prompt
|
|
4. The returned text is inserted inline, preserving document structure
|
|
5. If the LLM call fails, conversion continues without that image's text
|
|
|
|
## Supported File Formats
|
|
|
|
### PDF
|
|
|
|
- Embedded images are extracted by position (via `page.images` / page XObjects) and OCR'd inline, interleaved with the surrounding text in vertical reading order.
|
|
- **Scanned PDFs** (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the LLM as a full-page image.
|
|
- **Malformed PDFs** that pdfplumber/pdfminer cannot open (e.g. truncated EOF) are retried with PyMuPDF page rendering, so content is still recovered.
|
|
|
|
### DOCX
|
|
|
|
- Images are extracted via document part relationships (`doc.part.rels`).
|
|
- OCR is run before the DOCX→HTML→Markdown pipeline executes: placeholder tokens are injected into the HTML so that the markdown converter does not escape the OCR markers, and the final placeholders are replaced with the formatted `*[Image OCR]...[End OCR]*` blocks after conversion.
|
|
- Document flow (headings, paragraphs, tables) is fully preserved around the OCR blocks.
|
|
|
|
### PPTX
|
|
|
|
- Picture shapes, placeholder shapes with images, and images inside groups are all supported.
|
|
- Shapes are processed in top-to-left reading order per slide.
|
|
- If an `llm_client` is configured, the LLM is asked for a description first; OCR is used as the fallback when no description is returned.
|
|
|
|
### XLSX
|
|
|
|
- Images embedded in worksheets (`sheet._images`) are extracted per sheet.
|
|
- Cell position is calculated from the image anchor coordinates (column/row → Excel letter notation).
|
|
- Images are listed under a `### Images in this sheet:` section after the sheet's data table — they are not interleaved into the table rows.
|
|
|
|
### Output format
|
|
|
|
Every extracted OCR block is wrapped as:
|
|
|
|
```text
|
|
*[Image OCR]
|
|
<extracted text>
|
|
[End OCR]*
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### OCR text missing from output
|
|
|
|
The most likely cause is a missing `llm_client` or `llm_model`. Verify:
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
from markitdown import MarkItDown
|
|
|
|
md = MarkItDown(
|
|
enable_plugins=True,
|
|
llm_client=OpenAI(), # required
|
|
llm_model="gpt-4o", # required
|
|
)
|
|
```
|
|
|
|
### Plugin not loading
|
|
|
|
Confirm the plugin is installed and discovered:
|
|
|
|
```bash
|
|
markitdown --list-plugins # should show: ocr
|
|
```
|
|
|
|
### API errors
|
|
|
|
The plugin propagates LLM API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs.
|
|
|
|
## Development
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
cd packages/markitdown-ocr
|
|
pytest tests/ -v
|
|
```
|
|
|
|
### Building from Source
|
|
|
|
```bash
|
|
git clone https://github.com/microsoft/markitdown.git
|
|
cd markitdown/packages/markitdown-ocr
|
|
pip install -e .
|
|
```
|
|
|
|
## Contributing
|
|
|
|
Contributions are welcome! See the [MarkItDown repository](https://github.com/microsoft/markitdown) for guidelines.
|
|
|
|
## License
|
|
|
|
MIT — see [LICENSE](LICENSE).
|
|
|
|
## Changelog
|
|
|
|
### 0.1.0 (Initial Release)
|
|
|
|
- LLM Vision OCR for PDF, DOCX, PPTX, XLSX
|
|
- Full-page OCR fallback for scanned PDFs
|
|
- Context-aware inline text insertion
|
|
- Priority-based converter replacement (no code changes required)
|