[MS] Add OCR layer service for embedded images and PDF scans (#1541)
* Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files
This commit is contained in:
@@ -0,0 +1,68 @@
|
||||
"""
|
||||
Plugin registration for markitdown-ocr.
|
||||
Registers OCR-enhanced converters with priority-based replacement strategy.
|
||||
"""
|
||||
|
||||
from typing import Any
|
||||
from markitdown import MarkItDown
|
||||
|
||||
from ._ocr_service import LLMVisionOCRService
|
||||
from ._pdf_converter_with_ocr import PdfConverterWithOCR
|
||||
from ._docx_converter_with_ocr import DocxConverterWithOCR
|
||||
from ._pptx_converter_with_ocr import PptxConverterWithOCR
|
||||
from ._xlsx_converter_with_ocr import XlsxConverterWithOCR
|
||||
|
||||
|
||||
__plugin_interface_version__ = 1
|
||||
|
||||
|
||||
def register_converters(markitdown: MarkItDown, **kwargs: Any) -> None:
|
||||
"""
|
||||
Register OCR-enhanced converters with MarkItDown.
|
||||
|
||||
This plugin provides OCR support for PDF, DOCX, PPTX, and XLSX files.
|
||||
The converters are registered with priority -1.0 to run BEFORE built-in
|
||||
converters (which have priority 0.0), effectively replacing them when
|
||||
the plugin is enabled.
|
||||
|
||||
Args:
|
||||
markitdown: MarkItDown instance to register converters with
|
||||
**kwargs: Additional keyword arguments that may include:
|
||||
- llm_client: OpenAI-compatible client for LLM-based OCR (required for OCR to work)
|
||||
- llm_model: Model name (e.g., 'gpt-4o')
|
||||
- llm_prompt: Custom prompt for text extraction
|
||||
"""
|
||||
# Create OCR service — reads the same llm_client/llm_model kwargs
|
||||
# that MarkItDown itself already accepts for image descriptions
|
||||
llm_client = kwargs.get("llm_client")
|
||||
llm_model = kwargs.get("llm_model")
|
||||
llm_prompt = kwargs.get("llm_prompt")
|
||||
|
||||
ocr_service: LLMVisionOCRService | None = None
|
||||
if llm_client and llm_model:
|
||||
ocr_service = LLMVisionOCRService(
|
||||
client=llm_client,
|
||||
model=llm_model,
|
||||
default_prompt=llm_prompt,
|
||||
)
|
||||
|
||||
# Register converters with priority -1.0 (before built-ins at 0.0)
|
||||
# This effectively "replaces" the built-in converters when plugin is installed
|
||||
# Pass the OCR service to each converter's constructor
|
||||
PRIORITY_OCR_ENHANCED = -1.0
|
||||
|
||||
markitdown.register_converter(
|
||||
PdfConverterWithOCR(ocr_service=ocr_service), priority=PRIORITY_OCR_ENHANCED
|
||||
)
|
||||
|
||||
markitdown.register_converter(
|
||||
DocxConverterWithOCR(ocr_service=ocr_service), priority=PRIORITY_OCR_ENHANCED
|
||||
)
|
||||
|
||||
markitdown.register_converter(
|
||||
PptxConverterWithOCR(ocr_service=ocr_service), priority=PRIORITY_OCR_ENHANCED
|
||||
)
|
||||
|
||||
markitdown.register_converter(
|
||||
XlsxConverterWithOCR(ocr_service=ocr_service), priority=PRIORITY_OCR_ENHANCED
|
||||
)
|
||||
Reference in New Issue
Block a user