Files
lesyk c6308dc822 [MS] Add OCR layer service for embedded images and PDF scans (#1541)
* Add OCR test data and implement tests for various document formats

- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

* Enhance OCR functionality and validation in document converters

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.

* Add support for scanned PDFs with full-page OCR fallback and implement tests

* Bump version to 0.1.6b1 in __about__.py

* Refactor OCR services to support LLM Vision, update README and tests accordingly

* Add OCR-enabled converters and ensure consistent OCR format across document types

* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters

* Refactor exception imports for consistency across converters and tests

* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling

* Bump version to 0.1.6b1 in __about__.py

* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

* Add comprehensive OCR test suite for various document formats

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.

* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs

* Revert

* Revert

* Update REDMEs

* Refactor import statements for consistency and improve formatting in converter and test files
2026-03-10 09:17:17 -07:00

58 lines
1.7 KiB
TOML

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "markitdown-ocr"
dynamic = ["version"]
description = 'OCR plugin for MarkItDown - Extracts text from images in PDF, DOCX, PPTX, and XLSX via LLM Vision'
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
keywords = ["markitdown", "ocr", "pdf", "docx", "xlsx", "pptx", "llm", "vision"]
authors = [
{ name = "Contributors", email = "noreply@github.com" },
]
classifiers = [
"Development Status :: 4 - Beta",
"Programming Language :: Python",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: Implementation :: CPython",
]
# Core dependencies — matches the file-format libraries markitdown already uses
dependencies = [
"markitdown>=0.1.0",
"pdfminer.six>=20251230",
"pdfplumber>=0.11.9",
"PyMuPDF>=1.24.0",
"mammoth~=1.11.0",
"python-docx",
"python-pptx",
"pandas",
"openpyxl",
"Pillow>=9.0.0",
]
# llm_client is passed in by the user (same as for markitdown image descriptions);
# install openai or any OpenAI-compatible SDK separately.
[project.optional-dependencies]
llm = [
"openai>=1.0.0",
]
[project.urls]
Documentation = "https://github.com/microsoft/markitdown#readme"
Issues = "https://github.com/microsoft/markitdown/issues"
Source = "https://github.com/microsoft/markitdown"
[tool.hatch.version]
path = "src/markitdown_ocr/__about__.py"
# CRITICAL: Plugin entry point - MarkItDown will discover this plugin through this entry point
[project.entry-points."markitdown.plugin"]
ocr = "markitdown_ocr"