[MS] Add OCR layer service for embedded images and PDF scans (#1541)

* Add OCR test data and implement tests for various document formats

- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

* Enhance OCR functionality and validation in document converters

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.

* Add support for scanned PDFs with full-page OCR fallback and implement tests

* Bump version to 0.1.6b1 in __about__.py

* Refactor OCR services to support LLM Vision, update README and tests accordingly

* Add OCR-enabled converters and ensure consistent OCR format across document types

* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters

* Refactor exception imports for consistency across converters and tests

* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling

* Bump version to 0.1.6b1 in __about__.py

* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

* Add comprehensive OCR test suite for various document formats

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.

* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs

* Revert

* Revert

* Update REDMEs

* Refactor import statements for consistency and improve formatting in converter and test files
This commit is contained in:
lesyk
2026-03-10 16:17:17 +00:00
committed by GitHub
parent 4a5340f93b
commit c6308dc822
45 changed files with 5382 additions and 2 deletions
+33 -1
View File
@@ -9,7 +9,7 @@
> [!IMPORTANT]
> Breaking changes between 0.0.1 to 0.1.0:
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
> * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
@@ -132,6 +132,38 @@ markitdown --use-plugins path-to-file.pdf
To find available plugins, search GitHub for the hashtag `#markitdown-plugin`. To develop a plugin, see `packages/markitdown-sample-plugin`.
#### markitdown-ocr Plugin
The `markitdown-ocr` plugin adds OCR support to PDF, DOCX, PPTX, and XLSX converters, extracting text from embedded images using LLM Vision — the same `llm_client` / `llm_model` pattern that MarkItDown already uses for image descriptions. No new ML libraries or binary dependencies required.
**Installation:**
```bash
pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client
```
**Usage:**
Pass the same `llm_client` and `llm_model` you would use for image descriptions:
```python
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
```
If no `llm_client` is provided the plugin still loads, but OCR is silently skipped and the standard built-in converter is used instead.
See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.
### Azure Document Intelligence
To use Microsoft Document Intelligence for conversion:
+21
View File
@@ -0,0 +1,21 @@
MIT License
Copyright (c) Microsoft Corporation.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE
+200
View File
@@ -0,0 +1,200 @@
# MarkItDown OCR Plugin
LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files.
Uses the same `llm_client` / `llm_model` pattern that MarkItDown already supports for image descriptions — no new ML libraries or binary dependencies required.
## Features
- **Enhanced PDF Converter**: Extracts text from images within PDFs, with full-page OCR fallback for scanned documents
- **Enhanced DOCX Converter**: OCR for images in Word documents
- **Enhanced PPTX Converter**: OCR for images in PowerPoint presentations
- **Enhanced XLSX Converter**: OCR for images in Excel spreadsheets
- **Context Preservation**: Maintains document structure and flow when inserting extracted text
## Installation
```bash
pip install markitdown-ocr
```
The plugin uses whatever OpenAI-compatible client you already have. Install one if you don't have it yet:
```bash
pip install openai
```
## Usage
### Command Line
```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```
### Python API
Pass `llm_client` and `llm_model` to `MarkItDown()` exactly as you would for image descriptions:
```python
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
```
If no `llm_client` is provided the plugin still loads, but OCR is silently skipped — falling back to the standard built-in converter.
### Custom Prompt
Override the default extraction prompt for specialized documents:
```python
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
llm_prompt="Extract all text from this image, preserving table structure.",
)
```
### Any OpenAI-Compatible Client
Works with any client that follows the OpenAI API:
```python
from openai import AzureOpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=AzureOpenAI(
api_key="...",
azure_endpoint="https://your-resource.openai.azure.com/",
api_version="2024-02-01",
),
llm_model="gpt-4o",
)
```
## How It Works
When `MarkItDown(enable_plugins=True, llm_client=..., llm_model=...)` is called:
1. MarkItDown discovers the plugin via the `markitdown.plugin` entry point group
2. It calls `register_converters()`, forwarding all kwargs including `llm_client` and `llm_model`
3. The plugin creates an `LLMVisionOCRService` from those kwargs
4. Four OCR-enhanced converters are registered at **priority -1.0** — before the built-in converters at priority 0.0
When a file is converted:
1. The OCR converter accepts the file
2. It extracts embedded images from the document
3. Each image is sent to the LLM with an extraction prompt
4. The returned text is inserted inline, preserving document structure
5. If the LLM call fails, conversion continues without that image's text
## Supported File Formats
### PDF
- Embedded images are extracted by position (via `page.images` / page XObjects) and OCR'd inline, interleaved with the surrounding text in vertical reading order.
- **Scanned PDFs** (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the LLM as a full-page image.
- **Malformed PDFs** that pdfplumber/pdfminer cannot open (e.g. truncated EOF) are retried with PyMuPDF page rendering, so content is still recovered.
### DOCX
- Images are extracted via document part relationships (`doc.part.rels`).
- OCR is run before the DOCX→HTML→Markdown pipeline executes: placeholder tokens are injected into the HTML so that the markdown converter does not escape the OCR markers, and the final placeholders are replaced with the formatted `*[Image OCR]...[End OCR]*` blocks after conversion.
- Document flow (headings, paragraphs, tables) is fully preserved around the OCR blocks.
### PPTX
- Picture shapes, placeholder shapes with images, and images inside groups are all supported.
- Shapes are processed in top-to-left reading order per slide.
- If an `llm_client` is configured, the LLM is asked for a description first; OCR is used as the fallback when no description is returned.
### XLSX
- Images embedded in worksheets (`sheet._images`) are extracted per sheet.
- Cell position is calculated from the image anchor coordinates (column/row → Excel letter notation).
- Images are listed under a `### Images in this sheet:` section after the sheet's data table — they are not interleaved into the table rows.
### Output format
Every extracted OCR block is wrapped as:
```text
*[Image OCR]
<extracted text>
[End OCR]*
```
## Troubleshooting
### OCR text missing from output
The most likely cause is a missing `llm_client` or `llm_model`. Verify:
```python
from openai import OpenAI
from markitdown import MarkItDown
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(), # required
llm_model="gpt-4o", # required
)
```
### Plugin not loading
Confirm the plugin is installed and discovered:
```bash
markitdown --list-plugins # should show: ocr
```
### API errors
The plugin propagates LLM API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs.
## Development
### Running Tests
```bash
cd packages/markitdown-ocr
pytest tests/ -v
```
### Building from Source
```bash
git clone https://github.com/microsoft/markitdown.git
cd markitdown/packages/markitdown-ocr
pip install -e .
```
## Contributing
Contributions are welcome! See the [MarkItDown repository](https://github.com/microsoft/markitdown) for guidelines.
## License
MIT — see [LICENSE](LICENSE).
## Changelog
### 0.1.0 (Initial Release)
- LLM Vision OCR for PDF, DOCX, PPTX, XLSX
- Full-page OCR fallback for scanned PDFs
- Context-aware inline text insertion
- Priority-based converter replacement (no code changes required)
+57
View File
@@ -0,0 +1,57 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "markitdown-ocr"
dynamic = ["version"]
description = 'OCR plugin for MarkItDown - Extracts text from images in PDF, DOCX, PPTX, and XLSX via LLM Vision'
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
keywords = ["markitdown", "ocr", "pdf", "docx", "xlsx", "pptx", "llm", "vision"]
authors = [
{ name = "Contributors", email = "noreply@github.com" },
]
classifiers = [
"Development Status :: 4 - Beta",
"Programming Language :: Python",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: Implementation :: CPython",
]
# Core dependencies — matches the file-format libraries markitdown already uses
dependencies = [
"markitdown>=0.1.0",
"pdfminer.six>=20251230",
"pdfplumber>=0.11.9",
"PyMuPDF>=1.24.0",
"mammoth~=1.11.0",
"python-docx",
"python-pptx",
"pandas",
"openpyxl",
"Pillow>=9.0.0",
]
# llm_client is passed in by the user (same as for markitdown image descriptions);
# install openai or any OpenAI-compatible SDK separately.
[project.optional-dependencies]
llm = [
"openai>=1.0.0",
]
[project.urls]
Documentation = "https://github.com/microsoft/markitdown#readme"
Issues = "https://github.com/microsoft/markitdown/issues"
Source = "https://github.com/microsoft/markitdown"
[tool.hatch.version]
path = "src/markitdown_ocr/__about__.py"
# CRITICAL: Plugin entry point - MarkItDown will discover this plugin through this entry point
[project.entry-points."markitdown.plugin"]
ocr = "markitdown_ocr"
@@ -0,0 +1,4 @@
# SPDX-FileCopyrightText: 2025-present Contributors
# SPDX-License-Identifier: MIT
__version__ = "0.1.0"
@@ -0,0 +1,31 @@
# SPDX-FileCopyrightText: 2025-present Contributors
# SPDX-License-Identifier: MIT
"""
markitdown-ocr: OCR plugin for MarkItDown
Adds LLM Vision-based text extraction from images embedded in PDF, DOCX, PPTX, and XLSX files.
"""
from ._plugin import __plugin_interface_version__, register_converters
from .__about__ import __version__
from ._ocr_service import (
OCRResult,
LLMVisionOCRService,
)
from ._pdf_converter_with_ocr import PdfConverterWithOCR
from ._docx_converter_with_ocr import DocxConverterWithOCR
from ._pptx_converter_with_ocr import PptxConverterWithOCR
from ._xlsx_converter_with_ocr import XlsxConverterWithOCR
__all__ = [
"__version__",
"__plugin_interface_version__",
"register_converters",
"OCRResult",
"LLMVisionOCRService",
"PdfConverterWithOCR",
"DocxConverterWithOCR",
"PptxConverterWithOCR",
"XlsxConverterWithOCR",
]
@@ -0,0 +1,189 @@
"""
Enhanced DOCX Converter with OCR support for embedded images.
Extracts images from Word documents and performs OCR while maintaining context.
"""
import io
import re
import sys
from typing import Any, BinaryIO, Optional
from markitdown.converters import HtmlConverter
from markitdown.converter_utils.docx.pre_process import pre_process_docx
from markitdown import DocumentConverterResult, StreamInfo
from markitdown._exceptions import (
MissingDependencyException,
MISSING_DEPENDENCY_MESSAGE,
)
from ._ocr_service import LLMVisionOCRService
# Try loading dependencies
_dependency_exc_info = None
try:
import mammoth
from docx import Document
except ImportError:
_dependency_exc_info = sys.exc_info()
# Placeholder injected into HTML so that mammoth never sees the OCR markers.
# Must be a single token with no special markdown characters.
_PLACEHOLDER = "MARKITDOWNOCRBLOCK{}"
class DocxConverterWithOCR(HtmlConverter):
"""
Enhanced DOCX Converter with OCR support for embedded images.
Maintains document flow while extracting text from images inline.
"""
def __init__(self, ocr_service: Optional[LLMVisionOCRService] = None):
super().__init__()
self._html_converter = HtmlConverter()
self.ocr_service = ocr_service
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension == ".docx":
return True
if mimetype.startswith(
"application/vnd.openxmlformats-officedocument.wordprocessingml"
):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".docx",
feature="docx",
)
) from _dependency_exc_info[1].with_traceback(
_dependency_exc_info[2]
) # type: ignore[union-attr]
# Get OCR service if available (from kwargs or instance)
ocr_service: Optional[LLMVisionOCRService] = (
kwargs.get("ocr_service") or self.ocr_service
)
if ocr_service:
# 1. Extract and OCR images — returns raw text per image
file_stream.seek(0)
image_ocr_map = self._extract_and_ocr_images(file_stream, ocr_service)
# 2. Convert DOCX → HTML via mammoth
file_stream.seek(0)
pre_process_stream = pre_process_docx(file_stream)
html_result = mammoth.convert_to_html(
pre_process_stream, style_map=kwargs.get("style_map")
).value
# 3. Replace <img> tags with plain placeholder tokens so that
# mammoth's HTML→markdown step never escapes our OCR markers.
html_with_placeholders, ocr_texts = self._inject_placeholders(
html_result, image_ocr_map
)
# 4. Convert HTML → markdown
md_result = self._html_converter.convert_string(
html_with_placeholders, **kwargs
)
md = md_result.markdown
# 5. Swap placeholders for the actual OCR blocks (post-conversion
# so * and _ are never escaped by the markdown converter).
for i, raw_text in enumerate(ocr_texts):
placeholder = _PLACEHOLDER.format(i)
ocr_block = f"*[Image OCR]\n{raw_text}\n[End OCR]*"
md = md.replace(placeholder, ocr_block)
return DocumentConverterResult(markdown=md)
else:
# Standard conversion without OCR
style_map = kwargs.get("style_map", None)
pre_process_stream = pre_process_docx(file_stream)
return self._html_converter.convert_string(
mammoth.convert_to_html(pre_process_stream, style_map=style_map).value,
**kwargs,
)
def _extract_and_ocr_images(
self, file_stream: BinaryIO, ocr_service: LLMVisionOCRService
) -> dict[str, str]:
"""
Extract images from DOCX and OCR them.
Returns:
Dict mapping image relationship IDs to raw OCR text (no markers).
"""
ocr_map = {}
try:
file_stream.seek(0)
doc = Document(file_stream)
for rel in doc.part.rels.values():
if "image" in rel.target_ref.lower():
try:
image_bytes = rel.target_part.blob
image_stream = io.BytesIO(image_bytes)
ocr_result = ocr_service.extract_text(image_stream)
if ocr_result.text.strip():
# Store raw text only — markers added later
ocr_map[rel.rId] = ocr_result.text.strip()
except Exception:
continue
except Exception:
pass
return ocr_map
def _inject_placeholders(
self, html: str, ocr_map: dict[str, str]
) -> tuple[str, list[str]]:
"""
Replace <img> tags with numbered placeholder tokens.
Returns:
(html_with_placeholders, ordered list of raw OCR texts)
"""
if not ocr_map:
return html, []
ocr_texts = list(ocr_map.values())
used: list[int] = []
def replace_img(match: re.Match) -> str: # type: ignore[type-arg]
for i in range(len(ocr_texts)):
if i not in used:
used.append(i)
return f"<p>{_PLACEHOLDER.format(i)}</p>"
return "" # remove image if all OCR texts already used
result = re.sub(r"<img[^>]*>", replace_img, html)
# Any OCR texts that had no matching <img> tag go at the end
for i in range(len(ocr_texts)):
if i not in used:
result += f"<p>{_PLACEHOLDER.format(i)}</p>"
return result, ocr_texts
@@ -0,0 +1,110 @@
"""
OCR Service Layer for MarkItDown
Provides LLM Vision-based image text extraction.
"""
import base64
from typing import Any, BinaryIO
from dataclasses import dataclass
from markitdown import StreamInfo
@dataclass
class OCRResult:
"""Result from OCR extraction."""
text: str
confidence: float | None = None
backend_used: str | None = None
error: str | None = None
class LLMVisionOCRService:
"""OCR service using LLM vision models (OpenAI-compatible)."""
def __init__(
self,
client: Any,
model: str,
default_prompt: str | None = None,
) -> None:
"""
Initialize LLM Vision OCR service.
Args:
client: OpenAI-compatible client
model: Model name (e.g., 'gpt-4o', 'gemini-2.0-flash')
default_prompt: Default prompt for OCR extraction
"""
self.client = client
self.model = model
self.default_prompt = default_prompt or (
"Extract all text from this image. "
"Return ONLY the extracted text, maintaining the original "
"layout and order. Do not add any commentary or description."
)
def extract_text(
self,
image_stream: BinaryIO,
prompt: str | None = None,
stream_info: StreamInfo | None = None,
**kwargs: Any,
) -> OCRResult:
"""Extract text using LLM vision."""
if self.client is None:
return OCRResult(
text="",
backend_used="llm_vision",
error="LLM client not configured",
)
try:
image_stream.seek(0)
content_type: str | None = None
if stream_info:
content_type = stream_info.mimetype
if not content_type:
try:
from PIL import Image
image_stream.seek(0)
img = Image.open(image_stream)
fmt = img.format.lower() if img.format else "png"
content_type = f"image/{fmt}"
except Exception:
content_type = "image/png"
image_stream.seek(0)
base64_image = base64.b64encode(image_stream.read()).decode("utf-8")
data_uri = f"data:{content_type};base64,{base64_image}"
actual_prompt = prompt or self.default_prompt
response = self.client.chat.completions.create(
model=self.model,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": actual_prompt},
{
"type": "image_url",
"image_url": {"url": data_uri},
},
],
}
],
)
text = response.choices[0].message.content
return OCRResult(
text=text.strip() if text else "",
backend_used="llm_vision",
)
except Exception as e:
return OCRResult(text="", backend_used="llm_vision", error=str(e))
finally:
image_stream.seek(0)
@@ -0,0 +1,422 @@
"""
Enhanced PDF Converter with OCR support for embedded images.
Extracts images from PDFs and performs OCR while maintaining document context.
"""
import io
import sys
from typing import Any, BinaryIO, Optional
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from markitdown._exceptions import (
MissingDependencyException,
MISSING_DEPENDENCY_MESSAGE,
)
from ._ocr_service import LLMVisionOCRService
# Import dependencies
_dependency_exc_info = None
try:
import pdfminer
import pdfminer.high_level
import pdfplumber
from PIL import Image
except ImportError:
_dependency_exc_info = sys.exc_info()
def _extract_images_from_page(page: Any) -> list[dict]:
"""
Extract images from a PDF page by rendering page regions.
Returns:
List of dicts with 'stream', 'bbox', 'name', 'y_pos' keys
"""
images_info = []
try:
# Try multiple methods to detect images
images = []
# Method 1: Use page.images (standard approach)
if hasattr(page, "images") and page.images:
images = page.images
# Method 2: If no images found, try underlying PDF objects
if not images and hasattr(page, "objects") and "image" in page.objects:
images = page.objects.get("image", [])
# Method 3: Try filtering all objects for image types
if not images and hasattr(page, "objects"):
all_objs = page.objects
for obj_type in all_objs.keys():
if "image" in obj_type.lower() or "xobject" in obj_type.lower():
potential_imgs = all_objs.get(obj_type, [])
if potential_imgs:
images = potential_imgs
break
for i, img_dict in enumerate(images):
try:
# Try to get the actual image stream from the PDF
img_stream = None
y_pos = 0
# Method A: If img_dict has 'stream' key, use it directly
if "stream" in img_dict and hasattr(img_dict["stream"], "get_data"):
try:
img_bytes = img_dict["stream"].get_data()
# Try to open as PIL Image to validate/decode
pil_img = Image.open(io.BytesIO(img_bytes))
# Convert to RGB if needed (handle CMYK, etc.)
if pil_img.mode not in ("RGB", "L"):
pil_img = pil_img.convert("RGB")
# Save to stream as PNG
img_stream = io.BytesIO()
pil_img.save(img_stream, format="PNG")
img_stream.seek(0)
y_pos = img_dict.get("top", 0)
except Exception:
pass
# Method B: Fallback to rendering page region
if img_stream is None:
x0 = img_dict.get("x0", 0)
y0 = img_dict.get("top", 0)
x1 = img_dict.get("x1", 0)
y1 = img_dict.get("bottom", 0)
y_pos = y0
# Check if dimensions are valid
if x1 <= x0 or y1 <= y0:
continue
# Use pdfplumber's within_bbox to crop, then render
# This preserves coordinate system correctly
bbox = (x0, y0, x1, y1)
cropped_page = page.within_bbox(bbox)
# Render at 150 DPI (balance between quality and size)
page_img = cropped_page.to_image(resolution=150)
# Save to stream
img_stream = io.BytesIO()
page_img.original.save(img_stream, format="PNG")
img_stream.seek(0)
if img_stream:
images_info.append(
{
"stream": img_stream,
"name": f"page_{page.page_number}_img_{i}",
"y_pos": y_pos,
}
)
except Exception:
continue
except Exception:
pass
return images_info
class PdfConverterWithOCR(DocumentConverter):
"""
Enhanced PDF Converter with OCR support for embedded images.
Maintains document structure while extracting text from images inline.
"""
def __init__(self, ocr_service: Optional[LLMVisionOCRService] = None):
super().__init__()
self.ocr_service = ocr_service
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension == ".pdf":
return True
if mimetype.startswith("application/pdf") or mimetype.startswith(
"application/x-pdf"
):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".pdf",
feature="pdf",
)
) from _dependency_exc_info[1].with_traceback(
_dependency_exc_info[2]
) # type: ignore[union-attr]
# Get OCR service if available (from kwargs or instance)
ocr_service: LLMVisionOCRService | None = (
kwargs.get("ocr_service") or self.ocr_service
)
# Read PDF into BytesIO
file_stream.seek(0)
pdf_bytes = io.BytesIO(file_stream.read())
markdown_content = []
try:
with pdfplumber.open(pdf_bytes) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
markdown_content.append(f"\n## Page {page_num}\n")
# If OCR is enabled, interleave text and images by position
if ocr_service:
images_on_page = self._extract_page_images(pdf_bytes, page_num)
if images_on_page:
# Extract text lines with Y positions
chars = page.chars
if chars:
# Group chars into lines based on Y position
lines_with_y = []
current_line = []
current_y = None
for char in sorted(
chars, key=lambda c: (c["top"], c["x0"])
):
y = char["top"]
if current_y is None:
current_y = y
elif abs(y - current_y) > 2: # New line threshold
if current_line:
text = "".join(
[c["text"] for c in current_line]
)
lines_with_y.append(
{"y": current_y, "text": text.strip()}
)
current_line = []
current_y = y
current_line.append(char)
# Add last line
if current_line:
text = "".join([c["text"] for c in current_line])
lines_with_y.append(
{"y": current_y, "text": text.strip()}
)
else:
# Fallback: use simple text extraction
text_content = page.extract_text() or ""
lines_with_y = [
{"y": i * 10, "text": line}
for i, line in enumerate(text_content.split("\n"))
]
# OCR all images
image_data = []
for img_info in images_on_page:
ocr_result = ocr_service.extract_text(
img_info["stream"]
)
if ocr_result.text.strip():
image_data.append(
{
"y_pos": img_info["y_pos"],
"name": img_info["name"],
"ocr_text": ocr_result.text,
"backend": ocr_result.backend_used,
"type": "image",
}
)
# Add text items
content_items = [
{
"y_pos": item["y"],
"text": item["text"],
"type": "text",
}
for item in lines_with_y
if item["text"]
]
content_items.extend(image_data)
# Sort all items by Y position (top to bottom)
content_items.sort(key=lambda x: x["y_pos"])
# Build markdown by interleaving text and images
for item in content_items:
if item["type"] == "text":
markdown_content.append(item["text"])
else: # image
ocr_text = item["ocr_text"]
img_marker = (
f"\n\n*[Image OCR]\n{ocr_text}\n[End OCR]*\n"
)
markdown_content.append(img_marker)
else:
# No images detected - just extract regular text
text_content = page.extract_text() or ""
if text_content.strip():
markdown_content.append(text_content.strip())
else:
# No OCR, just extract text
text_content = page.extract_text() or ""
if text_content.strip():
markdown_content.append(text_content.strip())
# Build final markdown
markdown = "\n\n".join(markdown_content).strip()
# Fallback to pdfminer if empty
if not markdown:
pdf_bytes.seek(0)
markdown = pdfminer.high_level.extract_text(pdf_bytes)
except Exception:
# Fallback to pdfminer
try:
pdf_bytes.seek(0)
markdown = pdfminer.high_level.extract_text(pdf_bytes)
except Exception:
markdown = ""
# Final fallback: If still empty/whitespace and OCR is available,
# treat as scanned PDF and OCR full pages
if ocr_service and (not markdown or not markdown.strip()):
pdf_bytes.seek(0)
markdown = self._ocr_full_pages(pdf_bytes, ocr_service)
return DocumentConverterResult(markdown=markdown)
def _extract_page_images(self, pdf_bytes: io.BytesIO, page_num: int) -> list[dict]:
"""
Extract images from a PDF page using pdfplumber.
Args:
pdf_bytes: PDF file as BytesIO
page_num: Page number (1-indexed)
Returns:
List of image info dicts with 'stream', 'bbox', 'name', 'y_pos'
"""
images = []
try:
pdf_bytes.seek(0)
with pdfplumber.open(pdf_bytes) as pdf:
if page_num <= len(pdf.pages):
page = pdf.pages[page_num - 1] # 0-indexed
images = _extract_images_from_page(page)
except Exception:
pass
# Sort by vertical position (top to bottom)
images.sort(key=lambda x: x["y_pos"])
return images
def _ocr_full_pages(
self, pdf_bytes: io.BytesIO, ocr_service: LLMVisionOCRService
) -> str:
"""
Fallback for scanned PDFs: Convert entire pages to images and OCR them.
Used when text extraction returns empty/whitespace results.
Args:
pdf_bytes: PDF file as BytesIO
ocr_service: OCR service to use
Returns:
Markdown text extracted from OCR of full pages
"""
markdown_parts = []
try:
pdf_bytes.seek(0)
with pdfplumber.open(pdf_bytes) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
try:
markdown_parts.append(f"\n## Page {page_num}\n")
# Render page to image
page_img = page.to_image(resolution=300)
img_stream = io.BytesIO()
page_img.original.save(img_stream, format="PNG")
img_stream.seek(0)
# Run OCR
ocr_result = ocr_service.extract_text(img_stream)
if ocr_result.text.strip():
text = ocr_result.text.strip()
markdown_parts.append(f"*[Image OCR]\n{text}\n[End OCR]*")
else:
markdown_parts.append(
"*[No text could be extracted from this page]*"
)
except Exception as e:
markdown_parts.append(
f"*[Error processing page {page_num}: {str(e)}]*"
)
continue
except Exception:
# pdfplumber failed (e.g. malformed EOF) — try PyMuPDF for rendering
markdown_parts = []
try:
import fitz # PyMuPDF
pdf_bytes.seek(0)
doc = fitz.open(stream=pdf_bytes.read(), filetype="pdf")
for page_num in range(1, doc.page_count + 1):
try:
markdown_parts.append(f"\n## Page {page_num}\n")
page = doc[page_num - 1]
mat = fitz.Matrix(300 / 72, 300 / 72) # 300 DPI
pix = page.get_pixmap(matrix=mat)
img_stream = io.BytesIO(pix.tobytes("png"))
img_stream.seek(0)
ocr_result = ocr_service.extract_text(img_stream)
if ocr_result.text.strip():
text = ocr_result.text.strip()
markdown_parts.append(f"*[Image OCR]\n{text}\n[End OCR]*")
else:
markdown_parts.append(
"*[No text could be extracted from this page]*"
)
except Exception as e:
markdown_parts.append(
f"*[Error processing page {page_num}: {str(e)}]*"
)
continue
doc.close()
except Exception:
return "*[Error: Could not process scanned PDF]*"
return "\n\n".join(markdown_parts).strip()
@@ -0,0 +1,68 @@
"""
Plugin registration for markitdown-ocr.
Registers OCR-enhanced converters with priority-based replacement strategy.
"""
from typing import Any
from markitdown import MarkItDown
from ._ocr_service import LLMVisionOCRService
from ._pdf_converter_with_ocr import PdfConverterWithOCR
from ._docx_converter_with_ocr import DocxConverterWithOCR
from ._pptx_converter_with_ocr import PptxConverterWithOCR
from ._xlsx_converter_with_ocr import XlsxConverterWithOCR
__plugin_interface_version__ = 1
def register_converters(markitdown: MarkItDown, **kwargs: Any) -> None:
"""
Register OCR-enhanced converters with MarkItDown.
This plugin provides OCR support for PDF, DOCX, PPTX, and XLSX files.
The converters are registered with priority -1.0 to run BEFORE built-in
converters (which have priority 0.0), effectively replacing them when
the plugin is enabled.
Args:
markitdown: MarkItDown instance to register converters with
**kwargs: Additional keyword arguments that may include:
- llm_client: OpenAI-compatible client for LLM-based OCR (required for OCR to work)
- llm_model: Model name (e.g., 'gpt-4o')
- llm_prompt: Custom prompt for text extraction
"""
# Create OCR service — reads the same llm_client/llm_model kwargs
# that MarkItDown itself already accepts for image descriptions
llm_client = kwargs.get("llm_client")
llm_model = kwargs.get("llm_model")
llm_prompt = kwargs.get("llm_prompt")
ocr_service: LLMVisionOCRService | None = None
if llm_client and llm_model:
ocr_service = LLMVisionOCRService(
client=llm_client,
model=llm_model,
default_prompt=llm_prompt,
)
# Register converters with priority -1.0 (before built-ins at 0.0)
# This effectively "replaces" the built-in converters when plugin is installed
# Pass the OCR service to each converter's constructor
PRIORITY_OCR_ENHANCED = -1.0
markitdown.register_converter(
PdfConverterWithOCR(ocr_service=ocr_service), priority=PRIORITY_OCR_ENHANCED
)
markitdown.register_converter(
DocxConverterWithOCR(ocr_service=ocr_service), priority=PRIORITY_OCR_ENHANCED
)
markitdown.register_converter(
PptxConverterWithOCR(ocr_service=ocr_service), priority=PRIORITY_OCR_ENHANCED
)
markitdown.register_converter(
XlsxConverterWithOCR(ocr_service=ocr_service), priority=PRIORITY_OCR_ENHANCED
)
@@ -0,0 +1,249 @@
"""
Enhanced PPTX Converter with improved OCR support.
Already has LLM-based image description, this enhances it with traditional OCR fallback.
"""
import io
import sys
from typing import Any, BinaryIO, Optional
from typing import BinaryIO, Any, Optional
from markitdown.converters import HtmlConverter
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from markitdown._exceptions import (
MissingDependencyException,
MISSING_DEPENDENCY_MESSAGE,
)
from ._ocr_service import LLMVisionOCRService
_dependency_exc_info = None
try:
import pptx
except ImportError:
_dependency_exc_info = sys.exc_info()
class PptxConverterWithOCR(DocumentConverter):
"""Enhanced PPTX Converter with OCR fallback."""
def __init__(self, ocr_service: Optional[LLMVisionOCRService] = None):
super().__init__()
self._html_converter = HtmlConverter()
self.ocr_service = ocr_service
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension == ".pptx":
return True
if mimetype.startswith(
"application/vnd.openxmlformats-officedocument.presentationml"
):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".pptx",
feature="pptx",
)
) from _dependency_exc_info[1].with_traceback(
_dependency_exc_info[2]
) # type: ignore[union-attr]
# Get OCR service (from kwargs or instance)
ocr_service: Optional[LLMVisionOCRService] = (
kwargs.get("ocr_service") or self.ocr_service
)
llm_client = kwargs.get("llm_client")
presentation = pptx.Presentation(file_stream)
md_content = ""
slide_num = 0
for slide in presentation.slides:
slide_num += 1
md_content += f"\\n\\n<!-- Slide number: {slide_num} -->\\n"
title = slide.shapes.title
def get_shape_content(shape, **kwargs):
nonlocal md_content
# Pictures
if self._is_picture(shape):
# Get image data
image_stream = io.BytesIO(shape.image.blob)
# Try LLM description first if available
llm_description = ""
if llm_client and kwargs.get("llm_model"):
try:
from ._llm_caption import llm_caption
image_filename = shape.image.filename
image_extension = None
if image_filename:
import os
image_extension = os.path.splitext(image_filename)[1]
image_stream_info = StreamInfo(
mimetype=shape.image.content_type,
extension=image_extension,
filename=image_filename,
)
llm_description = llm_caption(
image_stream,
image_stream_info,
client=llm_client,
model=kwargs.get("llm_model"),
prompt=kwargs.get("llm_prompt"),
)
except Exception:
pass
# Try OCR if LLM failed or not available
ocr_text = ""
if not llm_description and ocr_service:
try:
image_stream.seek(0)
ocr_result = ocr_service.extract_text(image_stream)
if ocr_result.text.strip():
ocr_text = ocr_result.text.strip()
except Exception:
pass
# Format extracted content using unified OCR block format
content = (llm_description or ocr_text or "").strip()
if content:
md_content += f"\n*[Image OCR]\n{content}\n[End OCR]*\n"
# Tables
if self._is_table(shape):
md_content += self._convert_table_to_markdown(shape.table, **kwargs)
# Charts
if shape.has_chart:
md_content += self._convert_chart_to_markdown(shape.chart)
# Text areas
elif shape.has_text_frame:
if shape == title:
md_content += "# " + shape.text.lstrip() + "\\n"
else:
md_content += shape.text + "\\n"
# Group Shapes
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
sorted_shapes = sorted(
shape.shapes,
key=lambda x: (
float("-inf") if not x.top else x.top,
float("-inf") if not x.left else x.left,
),
)
for subshape in sorted_shapes:
get_shape_content(subshape, **kwargs)
sorted_shapes = sorted(
slide.shapes,
key=lambda x: (
float("-inf") if not x.top else x.top,
float("-inf") if not x.left else x.left,
),
)
for shape in sorted_shapes:
get_shape_content(shape, **kwargs)
md_content = md_content.strip()
if slide.has_notes_slide:
md_content += "\\n\\n### Notes:\\n"
notes_frame = slide.notes_slide.notes_text_frame
if notes_frame is not None:
md_content += notes_frame.text
md_content = md_content.strip()
return DocumentConverterResult(markdown=md_content.strip())
def _is_picture(self, shape):
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.PICTURE:
return True
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.PLACEHOLDER:
if hasattr(shape, "image"):
return True
return False
def _is_table(self, shape):
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.TABLE:
return True
return False
def _convert_table_to_markdown(self, table, **kwargs):
import html
html_table = "<html><body><table>"
first_row = True
for row in table.rows:
html_table += "<tr>"
for cell in row.cells:
if first_row:
html_table += "<th>" + html.escape(cell.text) + "</th>"
else:
html_table += "<td>" + html.escape(cell.text) + "</td>"
html_table += "</tr>"
first_row = False
html_table += "</table></body></html>"
return (
self._html_converter.convert_string(html_table, **kwargs).markdown.strip()
+ "\\n"
)
def _convert_chart_to_markdown(self, chart):
try:
md = "\\n\\n### Chart"
if chart.has_title:
md += f": {chart.chart_title.text_frame.text}"
md += "\\n\\n"
data = []
category_names = [c.label for c in chart.plots[0].categories]
series_names = [s.name for s in chart.series]
data.append(["Category"] + series_names)
for idx, category in enumerate(category_names):
row = [category]
for series in chart.series:
row.append(series.values[idx])
data.append(row)
markdown_table = []
for row in data:
markdown_table.append("| " + " | ".join(map(str, row)) + " |")
header = markdown_table[0]
separator = "|" + "|".join(["---"] * len(data[0])) + "|"
return md + "\\n".join([header, separator] + markdown_table[1:])
except ValueError as e:
if "unsupported plot type" in str(e):
return "\\n\\n[unsupported chart]\\n\\n"
except Exception:
return "\\n\\n[unsupported chart]\\n\\n"
@@ -0,0 +1,225 @@
"""
Enhanced XLSX Converter with OCR support for embedded images.
Extracts images from Excel spreadsheets and performs OCR while maintaining cell context.
"""
import io
import sys
from typing import Any, BinaryIO, Optional
from markitdown.converters import HtmlConverter
from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo
from markitdown._exceptions import (
MissingDependencyException,
MISSING_DEPENDENCY_MESSAGE,
)
from ._ocr_service import LLMVisionOCRService
# Try loading dependencies
_xlsx_dependency_exc_info = None
try:
import pandas as pd
from openpyxl import load_workbook
except ImportError:
_xlsx_dependency_exc_info = sys.exc_info()
class XlsxConverterWithOCR(DocumentConverter):
"""
Enhanced XLSX Converter with OCR support for embedded images.
Extracts images with their cell positions and performs OCR.
"""
def __init__(self, ocr_service: Optional[LLMVisionOCRService] = None):
super().__init__()
self._html_converter = HtmlConverter()
self.ocr_service = ocr_service
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension == ".xlsx":
return True
if mimetype.startswith(
"application/vnd.openxmlformats-officedocument.spreadsheetml"
):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
if _xlsx_dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".xlsx",
feature="xlsx",
)
) from _xlsx_dependency_exc_info[1].with_traceback(
_xlsx_dependency_exc_info[2]
) # type: ignore[union-attr]
# Get OCR service if available (from kwargs or instance)
ocr_service: Optional[LLMVisionOCRService] = (
kwargs.get("ocr_service") or self.ocr_service
)
if ocr_service:
# Remove ocr_service from kwargs to avoid duplicate argument error
kwargs_without_ocr = {k: v for k, v in kwargs.items() if k != "ocr_service"}
return self._convert_with_ocr(
file_stream, ocr_service, **kwargs_without_ocr
)
else:
return self._convert_standard(file_stream, **kwargs)
def _convert_standard(
self, file_stream: BinaryIO, **kwargs: Any
) -> DocumentConverterResult:
"""Standard conversion without OCR."""
file_stream.seek(0)
sheets = pd.read_excel(file_stream, sheet_name=None, engine="openpyxl")
md_content = ""
for sheet_name in sheets:
md_content += f"## {sheet_name}\n"
html_content = sheets[sheet_name].to_html(index=False)
md_content += (
self._html_converter.convert_string(
html_content, **kwargs
).markdown.strip()
+ "\n\n"
)
return DocumentConverterResult(markdown=md_content.strip())
def _convert_with_ocr(
self, file_stream: BinaryIO, ocr_service: LLMVisionOCRService, **kwargs: Any
) -> DocumentConverterResult:
"""Convert XLSX with image OCR."""
file_stream.seek(0)
wb = load_workbook(file_stream)
md_content = ""
for sheet_name in wb.sheetnames:
sheet = wb[sheet_name]
md_content += f"## {sheet_name}\n\n"
# Convert sheet data to markdown table
file_stream.seek(0)
try:
df = pd.read_excel(
file_stream, sheet_name=sheet_name, engine="openpyxl"
)
html_content = df.to_html(index=False)
md_content += (
self._html_converter.convert_string(
html_content, **kwargs
).markdown.strip()
+ "\n\n"
)
except Exception:
# If pandas fails, just skip the table
pass
# Extract and OCR images in this sheet
images_with_ocr = self._extract_and_ocr_sheet_images(sheet, ocr_service)
if images_with_ocr:
md_content += "### Images in this sheet:\n\n"
for img_info in images_with_ocr:
ocr_text = img_info["ocr_text"]
md_content += f"*[Image OCR]\n{ocr_text}\n[End OCR]*\n\n"
return DocumentConverterResult(markdown=md_content.strip())
def _extract_and_ocr_sheet_images(
self, sheet: Any, ocr_service: LLMVisionOCRService
) -> list[dict]:
"""
Extract and OCR images from an Excel sheet.
Args:
sheet: openpyxl worksheet
ocr_service: OCR service
Returns:
List of dicts with 'cell_ref' and 'ocr_text'
"""
results = []
try:
# Check if sheet has images
if hasattr(sheet, "_images"):
for img in sheet._images:
try:
# Get image data
if hasattr(img, "_data"):
image_data = img._data()
elif hasattr(img, "image"):
# Some versions store it differently
image_data = img.image
else:
continue
# Create image stream
image_stream = io.BytesIO(image_data)
# Get cell reference
cell_ref = "unknown"
if hasattr(img, "anchor"):
anchor = img.anchor
if hasattr(anchor, "_from"):
from_cell = anchor._from
if hasattr(from_cell, "col") and hasattr(
from_cell, "row"
):
# Convert column number to letter
col_letter = self._column_number_to_letter(
from_cell.col
)
cell_ref = f"{col_letter}{from_cell.row + 1}"
# Perform OCR
ocr_result = ocr_service.extract_text(image_stream)
if ocr_result.text.strip():
results.append(
{
"cell_ref": cell_ref,
"ocr_text": ocr_result.text.strip(),
"backend": ocr_result.backend_used,
}
)
except Exception:
continue
except Exception:
pass
return results
@staticmethod
def _column_number_to_letter(n: int) -> str:
"""Convert column number to Excel column letter (0-indexed)."""
result = ""
n = n + 1 # Make 1-indexed
while n > 0:
n -= 1
result = chr(65 + (n % 26)) + result
n //= 26
return result
@@ -0,0 +1,79 @@
%PDF-1.3
%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com
1 0 obj
<<
/F1 2 0 R
>>
endobj
2 0 obj
<<
/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font
>>
endobj
3 0 obj
<<
/BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /FlateDecode ] /Height 80 /Length 4282 /Subtype /Image
/Type /XObject /Width 400
>>
stream
Gb"/k$+*^]+31jd1_Sc48j,Pi+@:`R01h=9+]FPXQDmE0%*Lb4@[Wi36jU!;cssJbQ5,g%R?K'+$#.h<qu?Z`Dn#2Gqj`$$\bE9$XS)%of4Vd>cT_6mF8#7^^Y_6P]N!%L#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j4Z3bU/Gm9s<T86G'Ht,"?C)(`3j[rI\Y+2%=-fXJAVq>eJc&9=)%mQH&Yh<W#b:QHSf&hc5bR6(7?1RO+2U+,2j=JTJHV/n1*m;JAGbDu[IX.pg30l0S'.0*($<'u;[b/GRDJ[J=c-W0HaX>?jIi$uR^]%u?lJ6Z*VV,Z28T=.3[G"!N]2!6iqW[_CVOQZ9Um#Qd)&t%d!r@Y0=g5[M*c.,qcc"UaVkc?<W;kud8W>KHVDN2.L1\=-s4#kKB4PQPI/e#*[DZR^Y$^Xi6K`(0^so>.#p<K8pN:ein%PZ#Y7*R@MH861DpLo<$rl6(&#6H_8?WBC8!u2*l:OF52PthFZhN<gX0$3^m^nt74tEo^W)lR[oP32CBgdV=rlCn/?MttP@YH\'hL=3DKjK>7hgY.jk*u=UQX(s.K*Fd*WL[eV0^25,*fS$V)Q-Z/Ii`SFMMDe;iM2HMUsg37cCLf8L+b)KY6rDWN#jR5YQ.1R1gba'M2,4kCR4578^=b\Bn#r]R"h8?'u=7,fh/#GD*$m:^@g'YaL,g&OD*(\9V3qM@J4qIp#mRhYee0oG3^KQSpS`k(a)%\\KrFo]NtW+D/curY.W3&31Xd58H(q0_cISK<s@\A@tq[3b*;)prprplo?NNWP"U[H"mT)":p2cCY.e)nGK>lIaB)]QNP9.mWE2l3Yi$/1lIFGq?b"J/%A'=e_4!7DM'qA.H+Eb6+$Wn7]h3.+R^:;FLJDKh6Z0V`KM>R1?7!q\hg`Vs6PqO%XsU&A'e9Y:\qjB8:9X.p&5omkN]TihU"VSM0Gdu%IekLW7Z.T+=gZ8?G+)N`8D1/:))EVV'V%>@o^?^e2`FI#RXkRcVk5<aX)Gb<Anp($Eo(tDRA['co9=J!r?\(k+3obVpgT(rh)[&!*=m"?fb<WboEW]&a*9n`H`'s&IkQkBo"rK"ncnu1$k!hAk*UR=.2DO6/^buFp=jJKa-\QsgioA9CY[S7mOaK!FH$moNAmYB\)0A"En%&3d-/qKIR!Etjt(g9!ZEncRkEWkI'W[)AEhgn9/$+_o'-r&1h\!<m^U0S31\a,_&Rd"r.'Np%gT9f8?\3>1O$"8"QQu<lL!5$/k#`#:KEulTq4-7/YE)\g=Sb$OCcUO)BHRhQu09oJWIZkkosFc_B+&7'c_j!")cW%]@9&7fl>'od45V^,W"[b:mfhGls_c]o[:8?WXO3sS%ABFnD/;VaJp_j5H4#BX4qPO#9Rd2UQp3!11,Hp.<#+W;VjHWUp(CD>tmalGRY\Uq<)U7[bb1;!ICRCfbRQSW`R7B!G^;uZXfo5`5U7D;,E%89G,#:1)%4QDA%S)!5IL>R;C=R4rV;kBDYiJMaSpYGjR@-K1X!l^X,hul.*@fk/SRgZtX.(?1#F1Z,3(>l3p>i<Pf+sbdtG`]h4ZQR\8ke79"3MknReA"?c^RDe@Fjk1.cu5MjEDpdNhJ+7mGf2KIY:S*Y\2Cs77pae4B]4nt\0_9hN"cX*)6Y95CMGu-b.h<l/f)o087<L+.ZYQN\"^nJd(2$BIfm6^Z>hKjO-5DXhZKVbK,R$;#1+h5rhZ?WW+cIfDMR5M:U;WJk^U1M2V=1pp;4^.2,/RU"b7N@$b8R4LOO?H"DR.Lf`L[*m,BTYmDZ_t`L-M$_)#8#p!)[O706GPi_l#Yq>cO^MHRc(Jp:hO`,H*Y]jp")!6$Iu21q$\8nLN&Ju<?TEli:_c^Fu;7mar?jW@Fi5=&+@XX3Du$Vp!Z:kp'-MBe4(Gq5273Z*<l$oQj,ndL:>:,=6H/*LPHo45Js7W8j$_!Qm0FH1P&^"`>@W4%?`Nma<X,sJlXF*,/9?-'cJp]Gl[CD(*jN88AiD,rcf:jl=)$?G1A+QH`L1Y,qGh381N!)?4VfakRqR\de*W_P5=i_rQ88,Nf"08ju'!L3:gtBn9tR`<1O'UehuL-ao(I9mcdD[iu:\EjK;,iTiXhVd0(hgkW_rte\s*ID1Wu(.MjQ`-_-:KRW+1tA<S?3r8>E^)_qfq#N;4tr+%k&Ep8k#92@_4?NnV=N@"8F,!hg:if"abZSI*B&dFMB&j8pk=5i_MJAeY/_a-bBH!b7VKr\Kt#C"Ke<_A>`"`=AC>VJ=jpNj/XAJ.8N&11/:hfIr$D^^R2#qRLKK:(9GU8"CB@_;$5Fq-q:K0TBPN]^2`GM'aEs1Y+T=D'>N2JXWoc8.%IYO^gsm'1RJSeGm+YDRQhLku5aKi&&h'k:Ae':8oK<la[fL\k0;fH3(LIfJts]t<l4*,ri:knWWe!M_E[M,&V9JH2`"=)ml_1[8!OOU7V,rHd]X#^@U_hK>1_Fu*NH]a>^r>**\J#14;Ei@8Dd[B!VZ.j64i(icM@UQ_>]1i+QL[q8@sXNl,qq<0pH2r<c]E5`R>K@bgt+3u4X[=5N,XXpe$Pa+h/i2Ns+!9@kBH_P,uQG__S.W7M^frRPr4EZHW;p0Je?#:'3`%IWs^jMgsS>TFs]-96.iKS'H_`---RRk+q]Jr]FS(In4Pq-F!6Cm%,U[%@0.OI2<<)q%YS\L]"SQrA8jisi-Yc]j.NcUR5eZO4@bV<6:Q<7Y8Tbc.:)0RB[f;uae0#hXi-F,V+Y7!Mj#7a2'<d>UX7up@?R.l5hdJ`J2qIRW9l3nLb6mCBmOi<W\odW.t='rA%`7sRbXB5/RD_LA/<@gLr;i'i3jlV::Z3F&:]ir"sAd&Y6P"h>gnWA-O?D51eitk>F^2j&Iq6CcN2Ju0jXH_V;7Z"7$/f.cVY>Mu"+'&]*\$$EFH_au5?=QCNV/dCcC.k5.]`boT#$n8q"$7k7cbB=_S?6!sI(ERNS%rY/q#(V&?"M=dPp_pD^a<mS>iJ84-qUUOnpsEBD(@=c8&j(fD<_iW8Y:1]3'Z*lk814$BMEn>20Z3q9`%2[odf^kVG8_KfBHJTq!iP-bZf!WUjfi-Z(mjNh$1Mk%I4bUXT_KbmDgHtQ7Z[%/U=`ol;d(+7MLe^9J.%pG(>?Z7,R7p2_!_Qbsrj-nZ^jp1P_<pXaK9!2C;b6ck"=Cj_ThjdJFfoo]T[$FN[^)H53%_>QETt1O4#d3il&h>-]FM?E7.3BltXHM-bbl_r^C;;uMGdf.Kh%L(0?a^%V$SMIKn-g/OBA,Ng_8qOt.G4*;07b-d&^'[LU$f5ngd%r-XNimO'c=1SVor0:Eg?<1-k=*lR5.^@!L"%EH/XBn&hq=*'_o;%t#(A>I6JN':Wh5=&pRCU'1C"15l6HQiH<#l)E>c9A33g31NEH\$h]o'o:W53E#msr(FBMb0g*jP1nCIbQ^<-?M19Kr3mq8.j:>;q*:p4Rb"@"DU#`i.DU&`=Vn-ANGOK'T46_'jF^$R0`j>ib(E*\_<8o*cItM:B3D-9Z>Of29HcT0]Z'G'co.PNW`2:qpYXp0-36TIRP-&3V+PPe>^kkuHt*7[/f`Z?74q^`DXV.TS7@]I@7J7#?[&(&hPL%\629`r50o^;oKq?P9#!l9@Fff9p3njK2nUHBg!&A`c[uXD61%4M,a"/_P#gZUo)#L[uI,Q>:BQkk3P?Scmo]DXk])TLK"NX2u"><@[CElgT\uF2.fcn<iiPL)@TrV2\AYDo>2%@`(OZ<M6L#'7K_ZStJZ)]&Fp39s]tR`?'J?rE-I11YEH*I?3FE.#D8]B:lU#l-Q&"X6RDb@GL2>K[lYeY=buQU?HWK8[#]q-;`G![Hb_HQO`H8MPPH7g](3ooEl5/4f<4*3K2RX<59b07B^anj+,ZW=XiC0F_c7bFM+n7)(]:<Ao2d9eEHWd<E81SK4JM$5dbM`T<KnN,YjlTi4>kV>d%&i?1&P=:i,4>V2MnI*kV+_s8='X"H,gcL;Uo:%-"-M]-mmX/gFJ;bSiNq;:Y3_r5g<a"7!Y]Bk,;T:p3c2CBn/b6lYENkm?LZ[fW1tg10cT`#9kR&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j46GWU$GHRA>~>endstream
endobj
4 0 obj
<<
/Contents 8 0 R /MediaBox [ 0 0 612 792 ] /Parent 7 0 R /Resources <<
/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject <<
/FormXob.0315aed9f6006a101b3226a3b7404028 3 0 R
>>
>> /Rotate 0 /Trans <<
>>
/Type /Page
>>
endobj
5 0 obj
<<
/PageMode /UseNone /Pages 7 0 R /Type /Catalog
>>
endobj
6 0 obj
<<
/Author (anonymous) /CreationDate (D:20260126172022+01'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:20260126172022+01'00') /Producer (ReportLab PDF Library - www.reportlab.com)
/Subject (unspecified) /Title (untitled) /Trapped /False
>>
endobj
7 0 obj
<<
/Count 1 /Kids [ 4 0 R ] /Type /Pages
>>
endobj
8 0 obj
<<
/Filter [ /ASCII85Decode /FlateDecode ] /Length 260
>>
stream
Gas3/9kseb&-h'>I`6Z84fgHmCc;"L7g6_e&889#h,kA$Zt,m0Hdcho6>O[sLZ+YF+:QDRLY`5CAhdUI=MeslW_fp84Bms2r(UspMdQW.jtWA9rW?q[M1*5b[XIYc1kOQ$55sEf7La^q2$a/'T.)S#<V#*e,['$SVK^(f9:,Nq;AW\a?Zt7p:RM+pHF)-4F;E;l5ui'$5;T>HA_.,@?H2a/)Ol=NY+4r->>:n6'/ubPg6GC78<Gb)GJls9>QKuE<U0~>endstream
endobj
xref
0 9
0000000000 65535 f
0000000073 00000 n
0000000104 00000 n
0000000211 00000 n
0000004683 00000 n
0000004939 00000 n
0000005007 00000 n
0000005303 00000 n
0000005362 00000 n
trailer
<<
/ID
[<5d5eceaa0d906ef66e559ebcd616f18d><5d5eceaa0d906ef66e559ebcd616f18d>]
% ReportLab generated PDF document -- digest (http://www.reportlab.com)
/Info 6 0 R
/Root 5 0 R
/Size 9
>>
startxref
5712
%%EOF
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -0,0 +1,79 @@
%PDF-1.3
%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com
1 0 obj
<<
/F1 2 0 R
>>
endobj
2 0 obj
<<
/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font
>>
endobj
3 0 obj
<<
/BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /FlateDecode ] /Height 100 /Length 4720 /Subtype /Image
/Type /XObject /Width 500
>>
stream
Gb"0VH#OJ:qoA4A'Hn[Z$4u82K`j4ZR%PRX#F(qaK&V?K8<b&(;#dar'M8Oni!CfV<*>q$(4m``.P7J2`EVk#7#VtC6:(*s$)76DC7esMC3FfEP21e=J%f9>Up;d>4l&7WcZJp*DIk*ozzzzzzzzzzzzzzzzzzzzzzz!!#9dqY/lsQRl8pCtPt8mFiS/o[0Y;WL90Bj2R)5Y[N/Gfn0f!(^fES:Hsio97D?(Xn?5jq!"]KU?6X;&P"ZofW]f$p(q"VdAg3IP#\)UWg`=MO$9TD0>@3js(id(lnKeb'nVdJ7e;beRl:iu3cs8nIEn"+F\0s:]mE81*hAmoIah4b[;OfHn`%MZXAic<H:=+SS^nBEXOMIOI2D3Y<1k'jGjr"Mb7m#8djq#sB[J"W0CSVhD_EOgna,8=^]'+;Q/-Q29o7K`^M3`IrI+P7M:JgD-;8@g?QuaS(L!S'NU#&pVmQm&;+ooep7=EoT&nD(?bbsoCn2i[<[UJujmhj_M<u#m'ah(clITBmFV`P'C`bB@K_BHIO[n\%kN(^*?b\d]&LhJ.U*@&CM]/WZY.jbti9?:_k*Y>(J)7jH@XF(Q5('jYY[u"DM\[nu[VacM!sePdg%40X+-%@'f%P<0baG`E4oP$%Q/JgWmY\Um466eV$?Y)Q9)nh\c]Mq3f`',ShaFWth-oBcOd=q3cTY"+4B:C5D=,9ZUiS/hg5kVA4*K+[1fUQ5bbN_se_nC2++F"$]j*.j.s0:>;7?2OB:nigYB[?_a,Ulb<cmfe?*!8BYQ*mgYI_'G:#:9hc!%'nIu9\#K.OFt>Aq1i\Q1k(X_QRsW=DbSc%F'bgtZ/2>e+n:Kbn'oJ*sr;^;r/1Z+Vo?T!W4\7L>qdS!IH-WeB#r>ch5>%TRL&'caJA=^o?na4Y*tXgMf5H"N01jlPU;HhZ+Fp?gW#Ip5H[Y;u'ao8X`t6%]C0Gr*LD?+V"4C6Y<]^1GKRW1+$NV&M@2Zk6GD=d]J*qS.+7cN!h6O!e)cfIf(11iD*Y7*8G.!Fu"f5Q5o`Fk=$<gK"E?j,ZG(Q<S6H:\dHiQ8`a=>Zb+\X]m_AEjKB&FH%hV\FAFmKDpQr8j8Fc:%F6j4nD>GFYUI\FW`_flD4$L7>hqmCTh$Uf"^_*P&AuHA>O*GCsJP2O$EX=EQ9*O\8c#4&JZa4ARj7@85!.BB?cmQFmIWVr;9Tt>%neNS8ud$:Htthg-nk9OnPg[;$TI`>0jlL=-.a+_hJV!J^kl(gQd$+PUS\<me#jih7@a_WGZZ(.4K"nV+[.iR&jScFk0\%2K,Zhh0o%R)Nq/WF?l+\*Dd3krNhA#g[-\oC]W"'hnHd4_hLeQjHEBn&n644.4m-ZIetY!]Maa6"3/b.DR^j3BPL<$4Z,)s+'5XPm7A'P[OZn]2^N_0O[lEQSs)o18I6(3r!Qc+D!gf8aN/&O]Qq2:ob7jW;;79<m7q;t-1bA`T7?icP9s!jiRtHa:-1$`1XkWli8@t0Uu_-o6OtUq>"[U''if?a#-Wq5c7%ag\H8!"ALF'oU?-RQD7B?-GJ]">g6^'>CFf@5s8D[^<m$#K27ioe<`YMIH[l(oGML?\W`P:J[(9Uijd"PR!k=->Y$F-4CB"/,6\Z#s5Dlg2HM"F7VJkA+mbDo1>N-;k32'EW?6)(KY`B.a^\mY\PAK*gH+(7uU^F-po`(RMK06EPHHdD0;Bn\le,c[Y^V3lFa4(IC]mFtt!K@dP<o88m]iqJ+@)2E/k&hb!@XF*Gief89[^t6$$47L+p[?u]I+odK?/mT/=%`2+)fO@AogV9q*r!U1m1g?N]&Ti<e"o\RXmOcGj);^2<k\(R>\obnltl>ED4q/%Sh>o`U)Q4>YWf)aUl].\e=Ep??@2&sT>Dj(+.kQ]\905L.Bs_jQMg@#5Ad*SY8q<&o<LoLF#$U&]1eeYfg_3L=pCsBn2Zo8@8IZ+Xg>-8CKR8V]"=q/;I!IC.2@XQ-aimNpYWG+0>@4U4t=ghn%S,KY6Ikm?-DX7<>iL_CI@3\nVAbo+bpOJCAW-_HQp]RX&=4gH(-^/Z@s5UCp8LSn\c))=o$*Q*<M^D^#4JMJtt=95Q%_uUo1-F7q-h)d`H"f/Jt%*3B9)\WKo2Erql0!qemE![bE'dH=+t)O0/c#?9IC9XEPfCe/A/_qsP1I:Gi93mAc2E?t738#p#IUW/S7Mm!5=<`aIfEM<Ys2>e&.Y0ZhJX-aq'tbOsIoYE.k=J%d;VB:aB<bI_rblEfBj^p1R=K*Hi'nV;J/+I,YT[]<3iSq0j"_fD5.GHQ9sRi?Hm3c3S-p!79pR,Q`;q!mC0XK\qU5$G/?oAYN4XKK9![O9M9;(JK$g@P<;orgiE)Wd/_e6&iR=?W_Xldsm><h,SP+L,3L,ntesb+_2WpLI=+=Q*1F"S(>qmjXiPmbHLOYe%@qT_T#OK#H+:rVJ+)kHmJLjHDqC"3ei1+I0_7b&fC4SN?G-:Hn<P`F4_m)X;X7>C^ac96r3OHOcmBeK9_BW&gqhjl7$/j4:&TqtBm]_@&#AP,Y']*Eo)'ONPAD?/7inL-[;Y?u405b]Ka3.k@s]eE(j,^\#rI[BQU.aF>#4B$F4/opmSM*F5.or8NVf4NVD[_hmc;1iLl9739e\*dBrn2.Z2*I#rOpSJRG>K>mOaX&`@A]5&)7CQehdOsNbC>B_D5cTCSXB*di92jW_WX!=EhTKNR%fVt.%Q<$j[iSM!0d6/k`CY(3)+XeI\r:.i,Kg1O$IGhnlT&gm[L0VDFcUFbqACJhEm'4S\ChfI]q:mQ"ZL[OBmJ_6*\'6h<$$-W"[^F9VSm/"@Ys!-Y/kBOeN9p]O$ui,lR;]Y'fWi?-gh&>)baIM5;iRQYFQq5Me#,uCc8P?h`0K;LQ;G)X')<<ZlIDq&LLPU^bo=&gcErEUfq:W`I+ft!GF'4,DQL*1IX_:-FmGd!pPJ9q(G?8P`t=nl4**0bl/9C1$Pk;?WMl,4lD^\UVMQ6bn$qBfs80*^[?ET$I!^-aH!XgKQFCSW7VR7m>c6G0oT/C6@E@rs_sQ]md5bQ1:uFPQQE5J6oF@\//u>D@rl6i0Db`9"CffLREne*h9ea#&lL)T6cQ+8d[]`uKp7-3LZ,`(u?(+fr>(mI*G/\k-YJ:uXP.0=t4*2mZ-eQ't.M\@&WZ\RXWp+0AS>cW33cqTe`:fXp?:r\D9j`52V-"$nNukEh^\mZGUTX8#HKq+^Sc;g;39(Dp^!D$\>%6A^eKD`,3r`Ehh.Y<QB$Hd6Dn\4V,K$O2eQ#]H'IHuY<'.PWh7M8sFIp`W<?e^(pf-sk`:cWX(0Rr5S=GEL-Ru1K?[mL]^3qom@^04`'ab3DaOa@KMi0rX@XE^O>K:6c7M&12fk$N'7q-hiZ)75?UbZH"N)8kd3WGbMX"P/.K^RR%.rqaT@h#u_AEs1)jILMOBks:._,[m*eQ/HMh/2T8\^k5p-Gbk4:UO]EUnspPj)`O0k>MR,M8scu:M"<+[OYCfCtV].VbWfJfu8q0hWVoO!s]=g_kY<KG3'BXc*o(K]QH6CDr&"TSejAI&r>p4`r`?FMo`BP7H7X"Km<=Xfhj^&%sjRIEf&?W*]uFIg@FfT]->7T*G\=GA%PJ<HdnOVSmLf.+KHmSZ$jZQ*BecC<(3<9grri,I2*)b1:SDnGU+d]Xg6MY8$T(:.4?Uk`u[BiGe0Oe2f<H[Ue1=Kh4riO*S$)8!@o+s?;W"![]<f%`Y5^Zc;'ok\LTFOfW\3I*DUf6[:**:Q92N&d_']\[d1Hqldmd(IV"Q.V:G^=4P&F/.]W6G(>cB1O#e,SV5<H8f\>>$gU<+7PBoDYDti\Up=RQ$_FGu!./Yu`]tGE[]1Xflr3@X<>TH,QPDG[k\(f5EN#4:dBj9Dtm'hE@kFId$O4XJRs#<fi\gX(P(\HEsYBB9>>%h8.(c#WXs*h$)D\#t'aEg:?XOs\SA`#9^5CU9:p<JsU>L#D+>hdhTPki]s+6c*j=3>f=V>+D"=D5gHfUbY*f!X/5kZq(aU0s.TSSbiDpX_$Rm57L'='`/+(t;Rbo#i[rD<hl-MMd9XiU7<R]U8H\\)5nGm!GTqIWJoT^k&3K2m'gnqWfVra4/mgQ`:tY5PX.=H\ipm,paod-T=!9ie-6^rsOM%b,=gWO8LhMekGg^s513Ue4#ZT>@pYrm#1Im](Qt<AVg4pp1hYAJ<c+q=&_b]DnkJ,HRr<'>$A+9]t/CS)@C[llo,Jtei=]'$rSMO^!toPHeWb/:-J8LrU&LWJ)\^W(Lh_BC(Sq=]a:sW(9CiU>+L>jbY2:SMO\P;Zl(Q*_#4$"rP+Wa'D/kYlP9isqjdB"Z4^4CMs\(rfP=ldGLb]=a4-[40"Pb'F3QT9K-?3m2,`JlHL%\1Ij*8m=o$^)IJ``GG>8o)=:i+tB%*VOA&aJ4rTY`4<mp,AAS#lUS&fu(ONL&D/#q[E-aS3rEpZinTItFX7`LZA;mpPt<aK*M4`L.JdGj.pL%3[B<0^9=Js@ifcC6aG'JXr;^#lG4Z!G16L%!Kgc\r_t,%&S@[KH;Sbh`b-,X:&f!<?*P^juT/EcNB$W&AtP0_oZ%$360Lace*-_EY$)IJ\1l;I3ZnIJS%;Cu2i#mbPJcDt*f-eZa,X:4"Ls6%]AE=]p#qGtjbd%>DQu\n93U_d,;'5W.rd^OPtDft"Z(lDUVVUibkLA^[AGcHdpAzzzzzzzzzzzzzzzzzz!.Z!\J#>u+ci~>endstream
endobj
4 0 obj
<<
/Contents 8 0 R /MediaBox [ 0 0 612 792 ] /Parent 7 0 R /Resources <<
/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject <<
/FormXob.41b05a9cf8679f0fe6e7c30c9462b767 3 0 R
>>
>> /Rotate 0 /Trans <<
>>
/Type /Page
>>
endobj
5 0 obj
<<
/PageMode /UseNone /Pages 7 0 R /Type /Catalog
>>
endobj
6 0 obj
<<
/Author (anonymous) /CreationDate (D:20260126172022+01'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:20260126172022+01'00') /Producer (ReportLab PDF Library - www.reportlab.com)
/Subject (unspecified) /Title (untitled) /Trapped /False
>>
endobj
7 0 obj
<<
/Count 1 /Kids [ 4 0 R ] /Type /Pages
>>
endobj
8 0 obj
<<
/Filter [ /ASCII85Decode /FlateDecode ] /Length 250
>>
stream
Gas2BZ&Z[T&4Ckp`KUTrY_02PMb#<CFN=Wfj',kM@19sp55uUe"pptDD)Los"F*-#r%7t"K39EA8f/'^$OO.*D:jQe'n<f:3Cq8'p9Rm8qll,u+[sQj[W6hrFQL%\7G?"sX/%4LXYeUkIBuT`A)Y3?=ouE3GIShId3E("2qqVte.E2,r_bJ%q1G(F,@9C<XiC-L`O1W5it(MP9X]^nj..r=,_#ecrj!ceT&ATWd4)p.7/d!C@/gP%;p#~>endstream
endobj
xref
0 9
0000000000 65535 f
0000000073 00000 n
0000000104 00000 n
0000000211 00000 n
0000005122 00000 n
0000005378 00000 n
0000005446 00000 n
0000005742 00000 n
0000005801 00000 n
trailer
<<
/ID
[<38bd217c814ddf937f148e537dce51f8><38bd217c814ddf937f148e537dce51f8>]
% ReportLab generated PDF document -- digest (http://www.reportlab.com)
/Info 6 0 R
/Root 5 0 R
/Size 9
>>
startxref
6141
%%EOF
@@ -0,0 +1,139 @@
%PDF-1.3
%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com
1 0 obj
<<
/F1 2 0 R
>>
endobj
2 0 obj
<<
/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font
>>
endobj
3 0 obj
<<
/BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /FlateDecode ] /Height 80 /Length 3374 /Subtype /Image
/Type /XObject /Width 400
>>
stream
Gb"/kH&U9Q'ZR>43%?O'&jT85=3qe+N1f0n5S02hEQ*VDL]ZS)-n8)hB\WWtW#6!n.\\/*',>!9/rH$h#V%%<D1W$*'jLaXkBH(D['+neI4_Y@WVLF\J+T;ghR7Y(mQ%b#VGJrM63n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&/S]&BhHiT>FDK@e$Z6UXYgOsk+@,I0-;pc?Ln*mjmTTlQ+?K]e$#D.eB)O5NG7.u<*,RJ_p,8ck3p(&R=G"JnR/3XR:fF1m&!LSDdTjBc:RULFd1\@RVR9Y?IMP'Ajfu#dnfOdRMIRU5?(oCkRY2lY[PsI[bY!Lo7-qefWg9>U_8tF1IHqd?[hMs%mDBpEmdN`p:`8WC.(r[:;%"2F3\c4Mk792G,A7iab,>+^X:Q1B)Ct4JU?`lHM9=Q*\(H$'0?&(T*=du53O+iUIN?O=m=In-UqEE?ghWKl;bQuY*`RG[FO81(L.M1]2c_&%<`*Rq.JU[hs+4A7O4D[l.+Bm#OHs>Bg2RPN#ZQX55r(P>4!naFjcC+c`URpH'c],PFQUH`f2c]IHB5B)pF^[%Q[/+G3L2gM0I)]FA`$/.TW]/!sdN]/(^i(\!C)"CY$!K4SDm&fpLsK;:U@[oi*BN6O,DaRecA5ZZ2bqqSNigQne<3\(m/F9af2d;H'i,rO4%j7$89^HKB/%EH:d:UE-7IC?3kP08QNWFSkIn&tXU0hVtf\g_heWn%H2@\EYR$R+7l,it!qkZIsA%.8%:<b4Q+4o7Re<@/=uc<D?1CkG#t*Sj+#k=3?C8<eoR]gMmuDKs]#U,%B6^\'Q>l.UaQb$n85YGMOQX1#>:k7>o*u_[[jr:Hg5Je^Y:RKsDj_5eqt.H?@qPba^,b_QC<DQE1;H6P%jjc90Rgp0)'S2.M@\$lIp4C<@5NV3A+Sq=@L,V9GDV`O9L@%f5,%,!P;IfdpJOYbPun*G`61QOD3'0Y6GmF^2XsRCVfROl*/ge#f)W1s!?*VEG*0."1`MfU_Qj-_HcIc]o.$p?:mMIQ<MK^AHukRr/l=E7?<+:@V_=m:@ob4Q**VPHYWj[Zo=CRr=V!*BT#C(LJ_:?.qW/l#KCp@t8d\[H2o7Bt=PmC=",eKi0L-.*!_c/%pOEK43<A[3H$o[#=noHdqYr^oBKc5fjO>LsH:W;>mQ8`hXnJRbG389AnU"B(jou524Uh#I;&<U<No:XESS[)bnZpSA"f;hd8gI+P]H$/\(H<O[nan=8V^RLa-H@:Xf+/INI%?Zd4q"#jj>q*.4u=YF[nrO\>7O#of";9>"U0op9fI&cNL!f<:OL7XG#UCV^-W3n&X!D@elVYI"hK-(KVsoa8aWL<4F`IGb_t#tRF<8C]_m^L^GO6G3G5S09nd$$CBb>0uZ(bhmeXG1t'(r;5s6L_nY]J6S``aIm.70L=toNMorm#gcR5B07$ZWs&l!IuphJhN(6@caN->pY971W`M`H*G1Tioii3W&eZ2TZB^Na&P9Djoa9RTg9i]Nf@JZ^+U;<o?jn>70)4?,Z*\6\f"L%[`BJ6Kj9)%X!(RqAJ!s1(OmA61d_0J*HI^@ba8Pi</l@aVqWUPab%]CO5^)(hH3sGpX-^oTZd5(_lb]&C\k)B3IsfnOdI_nbfq2>Pl-9KMgjahN*A`OY%3@&b90[&bRjRh=*NU*X?K'!LVo@=gF%20@?O=gnO^q+Jr=$XI<l#nIRPZI$e8J>159G:[l)_5.-%,a+qd/.CDY\U4-ibD!])@[SqAD!$G2,qi7HmfIX"M?gq4^a:eU]L7T>7]\Ni+_8Hg7U"jUaRtKJ]bt'P<e&53@l:okMAKlUSY_?MI_!>3`l9op5MT]g<,Eau4CBfS9qgs8'hVO^q,7ISl%l9Re9VphQHBQpHf7;ouN+#41(!]Bpq.a"s<1RM\\qcK'V\`Q!N1@Y46f"8:opL%80Oi]-T\^Ju'R*bNtSnP$N;[5T*[5i*NZi1b"eWK@pn&8g.BWL$qSS5iR0*5>0K<j*&mC2tplBiAcJVg9(]ip]PZ1oSVKVeSV_/S3Pg9=ab"pNZ5V'2SCk-Vb@KNuhlT6mp3R?8XX`:Q.L9H,UNhF3YacG3W(VX0*uZ8>4e?>K_G;Rn$CUR?oeU93At-Arf=,=bsA0p$p(CN!F<.%bX@`pKfj\]c&XOS1!:`do;;tZ6cVZD;+'sUf$Bt,Q7PD^1rYmq+$QG]u!IsY+87o.os_h#/VHUYNfDXl;b!f:?gYC^>D*J>\&T\cBJ-sI_$N&=^rPk>LSVt;>P.46N:frEgm*GGj4=<qT,Y(1Zc+Z]h7%8,[8^^eLRgoB$=^7=$"Xl3[>=b41/JMmG;,"j2T%QRo@!%UIU4G6f=MZjM:Z2<iT63Xu]!qai:DQ:RWOH]Hn]!\#0inkU)(>L9M]+g80^DM^g"!X,S'0pD;8jH/WctXls@'Zr*MP]h7%8,[8^^e]B13k1X#5g%F[m.h'jn0tqYq>l-H\%mM%:Cju%L]qAkq0pc,X(bDM0Q0YGKF?F$>3]@tZ>A#n)hKGBreCD[e6OjG&;>-_QZ2iafV]GHA*H^f/E.J;%Om]_HNhJJ%MZ3l1UbMX*]CbunmMSLel=3ef=.@Tn,XYJoeZ)Vm1L<aCJD&:+Ijur)RA66H:%K_'"UbD-=0FNC4t9b>F/i:Y!Za@[WkasO461hJ%d@!MX3Q&p.'mXeE"=VfNAkB83#nEb-?5c%',NbTmp@Ic`7),9GbJp)qSmsX%1!Y73W6=/j,Gsc`f7.5:+Yeeq`Ai+!fL?NOM]56n(<,D$J18;P:'SXB'[uZ2^6A$<-j66f6Jk.&&c3W_A9(cRi[QK19NKCB\1b$_[lLN+.MEm1P"`E)t++5a[#+RZP^-(?@hY,mDDVfkSlm8e\\@>q0l2NF1QP^9\%[*ac\o<7!Y^`e%EF9dL)U==0$Z?$N0G;m+M)j+f_J:*uda;lh'4kU0Eu=[1gfDqKnase*WU=mVh!(:]"OUWP//]CqZjE&P4=Fd]4EP,j2"jQH?"R,$k%R$h6Q3^%g%3\k2'Q2t#0eqVHl3N]f0YdOSV61-pC:UfT.\lB:IuJp7hZ'6FlP!LEm!e%`Z.s*j]KJu),bq<-MI+2hQBC-aRH=2j-B+b#)kL'"'pn0^RRn5.!9II/5DmNHR"$lgsSKY-\*\*MgP]XVFu.o`0C'fI6ZKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&<`?/!KNiek5~>endstream
endobj
4 0 obj
<<
/Contents 12 0 R /MediaBox [ 0 0 612 792 ] /Parent 11 0 R /Resources <<
/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject <<
/FormXob.b6d21e33426b982eedc18c3d4e93428b 3 0 R
>>
>> /Rotate 0 /Trans <<
>>
/Type /Page
>>
endobj
5 0 obj
<<
/BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /FlateDecode ] /Height 80 /Length 3746 /Subtype /Image
/Type /XObject /Width 400
>>
stream
Gb"/kGE<NX(WT`Jj\uYAo)R):e01]`";Q!n&d*n+.):?M$;N-L92SQib[Ni],#Y)4BR+![6n$6QZkR#<71h=R.UYRO+N",lDr8pJ4*g7;J)9%2]0:`2oQ9iO]^,RBHegiOad<J[KFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgJmN6a^Irj:'BVG/#9rV!+sehf4Np$4uGT6?Z%jn76cYI/bg\b0"PYFheo17N)h[bT;Qm:q@c230t>rr(GQpkKpmDTn_bcA]O+Jd#cU@+2Zm>]f;6bs;T&C"&dtb&Y8uEeu<LE-(N92GKeb>4LdJj]\*:qP]'A(VpnpR/2-8pYO1I<E>P5N\HYDRFS?rdSVTS(Rm^Cb\t8pp[.ji7rj(S`LM@bl-r9u\(u5id79+1Zkc$%??stuVZtblhmB,pZt^mu0`%Ls1i]8CHul4%*Hj.6mVDOM9@Q>X0"[KH505<^W'N@NK#fUn2VXT_I7/`G*I"#V]/HO(ZsFbo9PDDePMK]!H;tTT$fdk/bb^Xe<o%RJ;c=p@KgDW9>;u04LW/P]CXtHuJmX$+n(Vc<?4@i#F^)<je+N!;?@B4`(2I>\^&%Tk]_oPE2P5G57Z;<354V60?+`jnL(BV81&C9p'qn^>huY?a*aVo\^AQF(LMnkjY1\5I3=GilDM^Mf#@3^HP(^=%G)"Pl2nSAIgiK?0>KOPWqO$PVHF<:_o#LfLahWdh*9(3^m/EKhl,$Q3cLgQYNON\9m^^C9op:#?rd<2$Vk!&)d<th.i@gbDG"YOud]6@72asongem@R4YF`QZs^cCb&Z^>EY^Bos&AIDEpB'*`7%OW#]/JaVk$ICr,?V+Pq1smN.e[Kbq-r/47c-[@E9!u4pDp'F`gCN0YN(T*.>3lQo[*tf$^DS35K&9pYYmC(C!8KWI=YO-Yh<iYq"0n-PcX/R22T"lAiUQ?;[;g"V[q<\)6WG:J^rp+%ZAH>K@cWZ,bpLf<4]8ob:WF?31lf84UU8ba9QV_YEY=:-f*?M'pH:5PUm1J)N_lEV.S5l4J>"ICf>9o#Q>b\(rC/e:S+@s,o'A+RfKf[#p[C^,r^GPUR4gNZ=J]6kHiV:DYI45Dd,\<-li[]NSt[l+63)OsT7L1W3eYFAoO;cK:jZ>hAN$F,_R#nIAbhgQ+SUFQjta@9[u&(LAMenD)m[`R55a*Y3l"))/k=oMU)6$+i]jVa*618S=TZ_GFf=XBs_%K:K'Fo]DciT&f4e(<2n?Vtb+e^g%Nu,1VYfeOc,e:Kg5KNLk9Rcmq(6W8>+.5UZlp$,%&K@JAXf9t/kp;B>ou]%HuU9<jA3FHc!7?g+9s6HK-!P:8IcUIVq-FJPOLMNgE:%ADX":FE$rF][b6ElT3'f8;d%Jo:3eXU\iUJe7VBlWsY`p!Z_)<M"V>Q7>RO1)%6mY+XGj^Cfi\llJ`iigb(cAMjb)$@^lQ9+"%O3RN/\G-)Fg+T0@+7sBYYOlju6E^l%O_dec#[W(oiP)i[<g)DQB?>1+.GAD<*#ee)n]ZlQc:X6"mf=0EeR3-\Rc-pb@okO8@.<a?P.;n:M[q+PD?$2E+e8+3J=j@AkSTd+T3ms/eoJ)7>\^lM5KERoAErtC0R56,oZN&WoDC3T.FfcQ):>]e:aVd1keQ_tITI3"f#+HG?-@C\E:ct,l:h<Cp?GYt&rHEP%f@DuqiK3-OdO5;eandB'^*u(E>'Y7/kYTAclDW&K5RRRs"3=c::U8&qE3Bk7N4522.XFX-a_8A&BTV-MqW1^SOa5rC:q\>me%lnUEUg#`JHMa+Iu3uP#:.&O#=ggsUgq2ho1`OG(%W"^SARTTWNR+&lC)M$Q`-sKP).<Rn<-H*Y^_.@YouJ.NulS]BgMA@p*ha_HBliRAPSEe%/q,5$rS?H&(.ebb:^u]K\M$!o#]`(^ABPX>'>!(C!`PX,)AUtl)+6md<^LC%`&<0kH\Z:!VH;uD<4`a?Bq\X!.ko\0k6B4?aIal%rsfT"YPIS4"n?"LH<iq4aDoZS1+2c#!%I<oORlE.6243F,5XrA7$Vp<=5I%YtpMPkuDakPrW:M755EjC1M+ASW1"l%6ssh".hAcrM+K$.)r`aK*P&Hs.u$/ckS5US2A?-\M(ZV;>Fn=!Y?$@qsJM8hgJQ9F[&m!?Bq\X'Z@O/_1jZPJh0WIlPaMENKSCZH_qf=)`E/ta?<^&&/;eKN[POQ>c>>*@<Qcq<skZL!t1i)YtpMPkuDakPrZ,@m=)4N13gIaPe1(Bgc3F?fe]L"FH<.+3eU/;OU:9l)jAj1eZ7B0jUb:P*^U'nk0B7LJU1=RVTVD0$LIB(o)AMa*Y*&F=l%,@f3RrO7lp:h;as9gS_OV%PBlb1<mC&X1YMQ7Z;Q.M?N'DLPOF@X4;:2e@\4k)e#VPa.Wa&'e[fnk(@eW9s8Hp3(LD&9UR,/A3p9VHP#n-p/dS?.Tbjb2E/1mX<eeWghoetsFg_!8\hVa==/BRk&$!s#r7l.`*`fG.c&hEW(G1egFu0Agn(Uo=c'TZhS"`tF_h6I>QRnu,W>Ap+0*lZ6>P/?C=#Kk>pfEI%X5o!bF40@(o?U'<]Eiua1#T-.-6qJtTfI(.G1oN.lKY+4/`*/<nC_FrBr@t'B!tXgNRb)RL+TCgW3<iX5O9#P?a!)IFFI8oG32\=#jga2H_&2^^0Kf,.OsLu_#eO08A/mmJDDtTcmr4t6O1`Dr,Q`30k4J%!o:K3@7,[V(bXXTZYYa2Cd1qoBXD(l2cQ3/<j,7X5ml5p#+n>6f$6'lUmhZt>sFRC2D);h+q6STAmNibWJLoa!_K+fl3/2UYVS=VJ+Mu+BpgT4#ns,rG4")XqHRFhp?e]lW)6<M/i!6i]r"JcHp"^;CaL.d(qc=(QVd0f"!@6_597/@.grp/FPoFQ-+&Hk[Kh<ZWObTpod[MGb+)FWKf>msB7&pCcn]iARI''Q3&WZn4`cfmXRYX$=L#_*oT6]L\$1K[YD^6ieQ7a0Sj]d/M^g5G<=fIGJCn-7*ka$`dtNA96I:UC5Sul1`\4p@]Ls&$eZG>,T=j\`^pYF(5psRDI[ZIJUoPB8(M_b:GbMQoKNEpNmMNdcINqeCi3'cEM:qF.X06_nLu%gkBg5VlBSp+B1fTm,9!@mC$E:N.o^gaK:4o/&_:c/cq+m2[#j^;N_DJlWf4=p-!+I-6h@omP!WV_Y"`IN?:CP*<`.u@#A9nDKO*4\G2pT\?kZ'Dt?,HQ7f!)_^L]dn6@h4u8d7pi9\@8;-o?'k"lKf\AD;4odYK88Q^$"H$>rOH)/`G1DGh5_0eMt&PpsLK>qs%#:d;7WAZfKR;rBm5'b+%b!VCn!a[@b*Y1e"S\)QM"QV-!NdrV>Ws'[ojRr<li1<loMopq>5.dVU]'X/_ssN#`kAmA/umL!VA/'h#6Ik/uc^1EO5Ek,(eJ=3@$n19'-D]1fkFPi7&q%$9QZs+c](m.qMB*W<LO;^VmoE[;onAu'qSYrV;=B=>iUUHSFKA3ttm5!=55Q\.qs8;OBp(ii!qKaVIY^A]oF]ER6$H=l(;gJ?HbR]':Z$q1FFKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`fq"Z!rjOlg~>endstream
endobj
6 0 obj
<<
/Contents 13 0 R /MediaBox [ 0 0 612 792 ] /Parent 11 0 R /Resources <<
/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject <<
/FormXob.67f2b803142796cfcc78829acfff7782 5 0 R
>>
>> /Rotate 0 /Trans <<
>>
/Type /Page
>>
endobj
7 0 obj
<<
/BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /FlateDecode ] /Height 80 /Length 3671 /Subtype /Image
/Type /XObject /Width 400
>>
stream
Gb"/kH&NHV(WW/Z*isSg&d2e95Yqb:,+b_Ylbt"c&J%p\W#GlYi-cjh1$F4VW);.c'TXPH7"cs*'CeNo0aOLC"P*Zn'GO!upO2q-m[9Y1pX_cEoNV.hZ!I=.X/ihg[pN.eCb@(q63n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f<&gLu-H0\W/S)K\UiU/d1e=((AE1\ViZgoP7G`;;^96"eA^VhA0L.[BPc_E\[V_jF2]4eaSpjj$D"&Bno1d#['rMp*ip0m[:kfFCp?cGGD5F+!`f-%F.hd'N;+G=82r*?QTUB\d3]4;&O$@@]/UdP:"a?Mun%X(hIdfY%c!Gar_>X+@1HlIml`F@TKbp&ZMDsE$keh8Gd3hgr.uuh/jX)6'!qj^&c9!\hln?+ERl7S6Q>-NjMlpa9'WJ6Y0"C)9EqnU6a<@k<:/9+6qo^@Z'\I'[FS"YZTL.@L2sJd]G1f=ahJl&2l`Ks-M:S_+:'iN)f]\_,l;^8p>qsj0Lbs?q^q550MlTodIT_d4`uhTtM2WG=S:0CRJ?g#U89K(O`^f6E_&QM!\8c8?YcYWG^A,Rga7Gib>7NblcZ\TL@>T?RFh4gP,RLuW%NY0c='mP/s6[2ZP"RWEO$.%PqO$8NHF;:(g19^%:Vd53pXdDSgjf-D?!)SSY:tXbbWl,ljiaKo6&,Tk^%ZEq<rWC2djpdFNmk>T*#!;VcpRKUM_Ah@JTSpQ_,km?"fI5ldt/$XrDiRJ>7IaG`llTEkqIRJqXuLseKJ0E@^B[c'G&YC.*Td\lQ;0M&l<>n.Lhop@hJHBr`p<Goua16/n;ob.>5F)Zdo(A@eK#h]C[X$;ni/uM_oq\m:G*7H0QjW]3@5ik9%GVKE;f&Uf!n]DI`Nb%2C3boPu^,\d9$\T7(8@A:HcW%-c/0@u<e?e^USp7r<*.WB9Qj*&d0_/#)>2Bsh8UF<RaQg/[=u(e[]i3HG8U$[j'W<'t#[TmqC\O=RK\k4gLjcBXSgP'657?9`D%/7'<t=0\%EfUaok*dBq#D;/+Kc<ku7gob>(_H9.A]-D0Nm^Ygqir3#\O:*\f:`5QV2)9W.6-PkGX:u*s.q82:[bLF*!YSl>fWgl`93^mI>>?W-=Q[qRY)bl6lGcHZFNr),2$%FM_ALH%]g?+ZMc<[[9]ZgI@@Xj1?$uZ`fQ@E=T_=^Y)Iq>Z]n3T+,ESq+[<(gAZ#oB@"U39sQJhH7qWUVC,rjBb5Bpf01/"POnP#E1HLYF]r-FX(;KH&I":;elcp?*VMmI:t9XJ+lKojSD4*?HTLsA.bG26.GQ$>[hlK'QDo^(hmQ-cTH%506+oa4EtR1#lVq>$DdUYh/>J)/5O7A3XUoj?\?S"2`7HXiK0'jc'n-K65=5Xbo07YG+,DnJ#k)B0'A/6f!>6\]:+"l=`SG$XA)CADn0K'_K4f/f=5]6+1l9?W_X6Ot?rDnk[F$>,)cO;]%-S-9:BXpp/7pgGNTNfPU?P,hXj.lFe))E5rE0u.QQXBgC'[=5fD-+Dd7D_BiCOsStaKInr&6L*#iQ7hibR@-51:cbp\1q]mqe13@abo2#F%iXN!ou3+QMh*ChlU('RV`>Se52@/A>k8Q@LYVs5!&,>nq4C`OI,m?KN"dj,lR]4^*`J4t3RN0'e>.R)(f4&I7-<086hRGl]?\CGX4ZLuPqD?mGbY492PJuG5NhO5Rl'E?jVGU6ID)(X#%WK)7j1.B[t1a9Ju#GK#qImB754Z:n*t7Ur]Z,U]e<VdfHLtQ-nqGDhosqd,=eUe.n.A!MBr':nf?;3P9TgoKmg#m"o/ECONol,Ita.<KBmQk4*.;]m50gE0nc>k*9ufGe;5djX]Ln4+n3J_qX-GkTR1n9@0\q1VH9&8FY9fC.nhd]SpA>*.D5WHRi4b(7#_j-Wf>90+Nlk8XG?8Zml*$eFnI4mV<54R5pU/km!].$fd-Iknlq@5,Xe$Tq95^0d<lAV<+_t?GZbWe?PK(;2'+J=hKZqF"CV9=i:=Sc3pXBNm0iUoS9^uD((5S3VosJdC=XqK(Y2"k`E5Uq'nDYoh05K4psDTXB`"b1or?HP0("#t;7\'`7fWI:CgYu@0,G<mOUj\+!#:VG'\hb3LlHB]mfcAP9ZI;M,2&bnaX]6X/Y7e('"4F+\QPhZp1:%8f78Z'U.)Ue6KD?ha_fc%12pV.ZP#..XGC/#0BU7nKDib`:Hn"\?[&('o]Qm.c(BLBHqjocLXHf@LM`\>@eFKu9Kg=%YequpeAKtGp$Y/ZWq<GeX&l?&`O&_;(00M@OlMMsp8pnZ(i0jKoBoC=e\>Nh>cB:jR9h2CeD)rjW#<;*rq4m'&E21"B#_8-[n2C1%.R]OKZKFC,\A?;GZg/0Y7QnAmMs\'7ippJ^\Xso)'/EhB"`5cHhZ>EZWOn-3.t*;K+PqCj$o$J&L5tQQ=@Q((N`qd]u'*NP.O+%`e+d_QG%XgkgA[eF6Dhl!:9<aH2$USNm2LWq28%<VYR)jaXbV4YCJ4u\jDXW7Cc+bW^I:L/(3_5/$Gmk<L%t/D89:Y9U<>n!ZO%2OGm.G+*H6L3F-L($4-&[X?+S`,)XG+'c8f#F"j,#4M.6<MY4!T\h9(FHlkcV,>FdOF]l)Y>rspUrd+UDb:`D!)fGlVc8C*chooP,GPs!_V7E]#$=\U/qWTG4Pfm%09%<@9,->21E?O4?&pP2@H$`4$?gM>J.^gGA7I8>OOjhu9o]#%SpYD:!d-.[JU5C>G.uT">4k:L+m`uN(or>=//s',t<F).:*dk3lKC#=$qW5D':K;=$Vh$e>G-.OlmLurLLn0%0^[#WL$M5f>V7B:m$9_l<#pr?m_rNDl<j,-Cn?O7'?6LH#Ilm/to:\)&a/]7'o(b@./-_E+cQ^)a4*],55ON>nbM;@Ec#YLh)mC]GG;L?q"@ifc)?NL)=1E(%%]V"'[:(Jp]+fX=<H2:\81X=InR?-T/csEXCV7l>pXRL(KCof)C>7a(8CDp=I/^[HE.V#Z]?\'*R<,cj#3Qd'QlFa+LO?d-;J@`k]u&gL%D;=rZV;%XY.6PuMmCm6;Dc%f8>TCO-`\Q1J;CGJ(.F>s_rU",SE[+39$;uD4Q!l$]cFl9np^it-Z]/KiBJ2.rd<i1"5=SRESH6hk"I1b&.Z[f2i1jldA*7J@TMK#qXgf3]<7t,gRFY%(IWDRW[lrsp-a6$p3oYdYf(Bd^OH#r>$B]'[tKFhF0C=aRdt[f,Y&iJ]18[YX+V2TDbj8FlguYXiND`1&gV9j[X(r2L6iXSoZBYBjd4#Tfh\E%5A]:QeC^]5U+TaDlRI4g@n3(]?$0/`*ag3C]s<bYFJo\=W[`.=4S7639@Ps.ou\;sq=0D>YKFND8ubs#ku->#@_ZT/BZ#nhJEtc$fKAnu*.>2c7>0%$]DfZY`<r1%=hp-6L:goFqV(ALl`"4(F?bVarqV"-(gH8)Y?-O\1!d^"N#`kAb+5=sRHml8%4?f?63n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUEtn)Jn'&1>[~>endstream
endobj
8 0 obj
<<
/Contents 14 0 R /MediaBox [ 0 0 612 792 ] /Parent 11 0 R /Resources <<
/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject <<
/FormXob.7ce3f428fed09445afad362830e52447 7 0 R
>>
>> /Rotate 0 /Trans <<
>>
/Type /Page
>>
endobj
9 0 obj
<<
/PageMode /UseNone /Pages 11 0 R /Type /Catalog
>>
endobj
10 0 obj
<<
/Author (anonymous) /CreationDate (D:20260126185515+01'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:20260126185515+01'00') /Producer (ReportLab PDF Library - www.reportlab.com)
/Subject (unspecified) /Title (untitled) /Trapped /False
>>
endobj
11 0 obj
<<
/Count 3 /Kids [ 4 0 R 6 0 R 8 0 R ] /Type /Pages
>>
endobj
12 0 obj
<<
/Filter [ /ASCII85Decode /FlateDecode ] /Length 302
>>
stream
Gas2D0i,\@'SL]1MAmuWBq2]41U`B=+JA;@)ETUJW/a5S$%>'U)FQlACq\foqfM9dic+ZLN<Q`tbE@K*:^spj"Oo"1`W"9\N8S]7/.WgR/fq$*$ITZl?0A3Yd+#RVYd`S"!VHM:q3ue\ZE&.5ico>/#%%PKVtVn!b+n6KWeM,?U:f@u6(=k$)>9=A;GQ#t3m&eV#g&$:bL-jnalu?/Fi#S%7?Zn?-:G9#d\O:D4D7XQ`j*RVq8@Qm.FMjt9rX$+<uAFWrR=.*pU4ORU>6iZ0lp3O3um&1LmEd6.tN*K;n6j'~>endstream
endobj
13 0 obj
<<
/Filter [ /ASCII85Decode /FlateDecode ] /Length 281
>>
stream
Gas2DYti1j&;GBn`BOD:C$`F9ZVnjI&kF'G*Fh#tN?'!3]KRM,ciKk&h=0aE:V*="Q_Ne&,OhA1lR406EhL)):sXAaA0[ug"g)PsmBSG*k#J$))")C&+kr+KmIL<Brl.":L6#Q;:T1n?*25E!Zk"i,4uuBV3G4oRN56iFD+G.*U'<hlkt*7N8pVC@\#B7T'\f?qTfO:fq24F=Moh9cYOO9_Ug3_JW1$`&3Et?9G$Rf%HgIe&37c9!:H9)*A"58?9%Ib;S.e4E4@\m25^i]720%7~>endstream
endobj
14 0 obj
<<
/Filter [ /ASCII85Decode /FlateDecode ] /Length 249
>>
stream
Gas2B8INBh&;BTK(%8)Q2+a"?*aMq\4K.?![FW8"e$p+akF8n$UkfH'`dI4a"80L!>ZbC60Zk4LLGE6]&5Z#qMYu/6Ns)3ldF]OCoN(cR,K(-$>Bb@Hb$Fm@B;e+Uh$?f>L6HTg25p.\@EBp=GIr"0+>.bL"Ab!5e$0H>2u,XrGS3n+\I^LXNi]kl12d&'Y,la0?'!jr\BDiS++DQrec,bZT6(6I/"hnM&*R'u?RM762ns?o2j@QC[f~>endstream
endobj
xref
0 15
0000000000 65535 f
0000000073 00000 n
0000000104 00000 n
0000000211 00000 n
0000003775 00000 n
0000004033 00000 n
0000007969 00000 n
0000008227 00000 n
0000012088 00000 n
0000012346 00000 n
0000012415 00000 n
0000012712 00000 n
0000012784 00000 n
0000013177 00000 n
0000013549 00000 n
trailer
<<
/ID
[<8efaabb9b9953607755769fba673a5bf><8efaabb9b9953607755769fba673a5bf>]
% ReportLab generated PDF document -- digest (http://www.reportlab.com)
/Info 10 0 R
/Root 9 0 R
/Size 15
>>
startxref
13889
%%EOF
@@ -0,0 +1,88 @@
%PDF-1.3
%東京 ReportLab Generated PDF document http://www.reportlab.com
1 0 obj
<<
/F1 2 0 R
>>
endobj
2 0 obj
<<
/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font
>>
endobj
3 0 obj
<<
/BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /FlateDecode ] /Height 80 /Length 4030 /Subtype /Image
/Type /XObject /Width 400
>>
stream
Gb"/jGApR4(<3O%]`e\8(`3O5&gh'9UL4Y/2^+lN@RPf5J1sk+NQ;)EK[=iEKil-Qc;6BQ-:LGDM9+$J',f.nAHF$>+;)0QS%B_Sc9&LGT7Z-dn+T=!BA[iT=_mDCmsXW7DDr_l&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j4;C$M\fk4SCf9@^_*csp?0E:tA:NU]#jiWj*eoUTRh)0!!`5]fOL5)!H>oFG9DVR2p+lUH`IrqKY)BXD"&`V6FB2;HMc(7';GJ3Ri.fjIH*^/fZ0Or*2Il>Q?21r^]?[Q:^FZY!?_$=YY:S0iN>)NU7,O;iAF)l:J:7R.&b*4>RZXupbn&1%rQH_7Vb5!5MER63=/j;H?_mC;n]Y$@HQVU)1)MO]*WmC]s?Bm'Eo$F't5&H1H?C?\PZOT*=k"OUBF^Z7('_KU*cW%)S*WMHX>B\o<I2:$`SBCZ%`^?q2(GB+]eubDeR*F:Vn)#4P":#10VP[s<B,;6r>eYSG,9p^:L_3O7EcSHah6r%e]`O=YOgf8dp9lDfH=\S3c8r1HgU==+3,mg#Rl>?_pYUI1'87(cTWVY:DUqL#0^"?4&$]I&kNC0`5JLC0C)C?hEoh,WmelnPO$)t=.f&emDnSAh"(gjeWnZaC>tjK_q=<Y;2^p2tgSS*;Q.a5>kWnKp@"uqSN>jhKjj-*a*6R/cmlaT]DPqQi#i_sfOu^V];l<C(qWb+)+X,K]55F9'T7-DN5.u%#%NNom8J@WaU$Ys4EcZ<p'k6@=H1U1Nf\"4dJ%Sa[;MZc\a,_<liPGacdte(N4#Ld&J7G4#qWW.gemV!7e*Ykse!lmlI6&MpTjGEYW""=AdA+bUmFtWqp@60F94#=0o#op<o8V#IIK09?Vu_>9H0E/'"u"N,<U8_fPGUCUB[J#'4)`ugV+[/l@hgImB]$Q&++O4II3)@-X=:iG5aNlrihrDt1>!G@S%i^qIJ7$3^\6AsEbQnB1qh&Ubj<rb+7oR9`51Wc:HuhG`mCl"YPN?sfuR?9TKbo,*Xsp8_GH7iUV'<j2Q"^R:?R!:`1LALo[6Bt.TJfM[:mr31c/1%Y[kk=hS"9r+/DM.<0XIVdF$A<$KL.*`4/*c#08_`F9A=5:/6g][Wq=O\\1b/3Y3fE$[VL2A^Dsg+d,P8$hWM:-_?FRhKnLi6AL;36.CAZjVR[)*:k&[VG3Q>UVt*h^pT^rHWE2?D;2KcRo]!jbUdu=J*Y^iO7NMpK&+/M?E#p8P[:&KcCI&WT>lj0hn!r'Ddu;@XC[FI)\EY_V0_d]7q%$Vah7c\%+%Y5*O#<]LtTjQE1fFkLQENDq6=GM6pd`rMIpbfSH&W.T3_Ol$<b!Gm_&PqlR5%C@JK*Ol!h2Imp7Ot.+d%:%3%3]MgkH[#Hbj+HhPP+H1'Iu;KCj>&_r2rYl%'!QJ.^n(hm(#X5AF,*LSQi,0+l+X_QCd-sXH3[JDTElP16rE1kDGP]cKR_8u#V]Y"6aMOg*%"icN@-gSBDCX=S#a-tPZm-JQ<M>nqsR%UpnUK?#%8%U]B4C%d;CZj!6'e<<Q+hZci=8cPWZ5+GD%id30L5h:gr7\PodKJi:2tP_K@\Qq+EV)a?Ul]h5_1DjI[kCso9J35<SVlOcqg71^,=fU%5!E:*Yls*-m+ARh)lulg3U$-Nba:,pm+SkJTsi93ruC,0)`CY;VB`@`"Ms*Z\d)j&;boQ1M6h3^7dju]MOg*%KdG1:BYpEDMN0Qp=1H1W:%UjRoYunt=j%eq(Yulu6#ZJULEF)i>=uH5kuE5#MQ?sdq?&m6=[kl8Tc=t$9jhO62tP_K@\QrVip+eZoCIBVVI.)e.%EMOIc+6PiK4ajcW;'k,f-)WZWXVHl1GC1G[.CW]@LAEm#oop\)2X5*2\@n4)j*X1-$m:bbYF=S8$HLkmmrTSX5auF,e"0nU1VTA)3IC$D8-]"Dr>:d49"#,PSWbhql]_\g6^db0"bZp1dtL,AY,Hrd`/-m-ruOhB-14LQ;od4K*.0TNLGYVbWfTAdFC7kkt8JqJt7[c^Ql>:b?jjnDUs$l_[IMNbHS?"ibH+Q6-&jp=Nm3cV`0>dPSYcp:A>VG$`7Io@6oL.1Yru`9tj;1MbRC8OuD!T$h`FdRA1Y^%4"c]QM>g?3P;LgT"R'VH'Wq6-8?<USYnh?<PGk\JN:FmgdODqo^Y-[-q!`!^tV."8sD%$a=W62-nVR5dA`f_`,bBeeu3ao@Bu0gUBEIr:BIZ;o%Z0&e^p5.nho"?_Kdin'6=Th05;oSNV>N<>b"^YjQ;nmbG@*@ga"'Fml$&HKSjO@C'C@dp'!_Ffa>t?@e@l=1ULa\_XlA]C"jJ[EOb[EU<Ad0TKc?#SM"3X(Lm^XP:Ai-*%LAh7FIFeZ)VBqd3-*?CmNmaQd@AW)n[b?#",SQm'MSet=LGn*8H(EmU"ap$8frb$7:h(cmkPT'j1f=')P0OfDf-J!fq>LM]CT:sc(6S,=-4)`A+!]_LKEDDRhbe129S\o$XGLlIB_@GSM;0aiBoeY#3\$rnT",rr#-L8WThrVH3Wd>f5/m!I81VBY=an%\pL-#\@;=L#`iU,5`YFD7-fMIo%:V-`E0oiVNt"U>:mo'NpD2RN%p)fKE=Wh?#X9URXg?^k0Ij3g**Ork^G>.)NP0^Zo`HhZs,@E=NRrX<D_QiVi,Ql*A5m(A3^.6?$s:Tscqo1s(3kg6"-]oontmG$5h'tiM,?M3U6bKrXpDQZ+8OY)5\YPQ1RA1]df+(N?OL"VhJ@gqIkI.E-;o/3Qt1BZ.-^fcFk]\"'SkiU-Zp$1)VkDu]4'.6Q)Rpf>e7Rl\9C@L/t\;Z<&1+XN&%j.rRWDZ,P`5RWN'o-KfFuX0F4HJ=HdaGcm]mfpk]J#M4P%(H_.XIrZ=?Cg4eurF6+-eE^<^5E+044/<]H!b,^3/b-4Pk'SYL'H2BJX;H*1(;>.@2s+l4^Ld[GX<"m,,S8jnVW1p25U-)leT"(Rd*85eRMpFgl;HQ5B?dNZ>%sGi@`*P8u]+OF+6o8`@u[s,9A[)598-5To.NR<lP-If-_B`3<WT]66n!$kEk=@IN'deV@j'G*jECo*=n-CGGPSQS2^c0JU>Hgri>q[;4C6)9l.D<V/o>Yr;9tI4q7F2Vk[EZD9m8%k8qS#MUgZFAT/+X&c>ZakEt-!tIC@gq7p=,@:&"fuR?9TKhl$^"^,@C[&CBoS#=R9q$`uOH:#JSJ9<W:p15N\sY?eMAc-TfVUQCf[/aU7Q<+W&cX!Gg5QIU/.fucFm:(bne,@%k0<G*A&jUu)&=dF=S^2O(/T;7N!1"hV*7RChA=/aUheSb/jG#CKc/_f;X(jRo.*8Mg=Ih\Oh@5uQu:VmJml*(fb1#VW`5t>P:&Gm=.MEs`m+]EUC$a"&A7V[4&1(O+/U5tc%5l8bfgduMA7WcZLS/+L8KGT=PZX]or?B?!uj1:N/iq<5oqK2W)9>[j2YeFBE.YV?a;6JT9Q4NVaHp2:S]C*Y[u"D`JYPMk(OUXco6M[L(>@Y58HVo85]"55<n&T0HJOk!N-,QmPlFjWD]R'acb;QGO![lac[s)?XtR.?N'ae(!#%[/$O0^<q#:-:!!5#^Yc+q1Gi1@C=S]=(`^A8mFp['?G60sRthIolIN'VAu5FhcfZf/QG"1R`Q25(?i\KC4#^p(-nO-j%L*Xa(C,)gChDno6`*pRMWEi/GSmFTTK>IG+o_]R.!G(9muq-BFa8ETq]Ipg#U)AK2f>//o4/S?>UdM"B);/a.)]9Kd\TSI[X3Z=U2f//"o0>`-h5:!aHeD^"pG1@4>4BtrUnbQ\p&f=niu3sje\cKZtRhg)ldr?b(YV+-RL1_Dd!GjKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U/qprr=Mu.$a~>endstream
endobj
4 0 obj
<<
/BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /FlateDecode ] /Height 80 /Length 4649 /Subtype /Image
/Type /XObject /Width 400
>>
stream
Gb"/jGB=Qg)obtD(hlkN"f0(5:+-aT<2CgqGS',YK-J92</AC'U_i9g+@S\\U&raT$(ug!$4%UW-.5X!OX),AnUAfpn!Ei.ZhN>CmL:6LTAL_NG:0`'1[i!JbWE/;*Y0EI&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j4Eeer2aX7P8Gl-m;mrV!9$er./Dr&!ITgFGgA]g5nB?j\gC<`5,n"5+/EDb62SNDgA(+`SG0AE<rQgcQJG2U/e+R8qjL(>AddU%7a,tE0=*^(Ec[;Xqd5T\EOW]FtK0RnAjQS4C.$PtF;o^lOY2JjA(q!>?5rnj<G\:)&,dhUp&iW]f.:o[KoXBDknr:%Vf^;G^:].Qgi,t!Cq7>hqh;qnpMFKAOW-;:r<^AH9g>e+lD6ps06kbI-F(^N'<gi-Gcfoqm`D<`e/XBDl/[X1QKi:_Nlme*"2?`R7BfZQ0Yn][CW7>_eACt5OcbEjk(IV6iiD9J4s/kRp`OHAta@uhcDp$1sUc^m;JVdmc-<LpXGp$)c(Hk;,Z7Z;:iTQi5(W(^OWj5YQ"X&GpV4SR^\.hEaCrqG<">P%b#odXg*ftJs\9L;@@2JoU%\UmIb7]_:X"BnAg8PVo7"#l=q;QoL`\om<c?*C)#T0=:[KaS]?>+g+\Q7Q1-1hhO`F6CiVjuIe^m/?\904&j@l.#kH4Fk1D;,Pn,s$FCkgKq>QM?oqqM4!SS5Q<QQX'N=qdO.h^m&2`sM\[n]^V[k)d\U9,\%pa.@q.TDm$L"eIRL!nb*BjG)?7Bqo-V?%d\TU3:EfTg^\mXEJ,E_-&<#c6bEkHfgiJN=njqoeR5#87\R3+#(G8t>*fUYEeuW#*!X5lBcX+"oeKmkS@.(h*0mT1nrUS,bYIsE5U03`ScpJ<e8mf>^IB)[\rqY`>H2da;>5Fp[LW%H??G5X>9I<PAY[@K\1i1gkRchBYhS[+bJ,aqhT)A3+5BlPN)'4Kg>8jeZbo1@@Rl5+ugph@\]QnSZaRRJ1c_/-=odWtJ379>AFMnp+G4!`KBR8dD=K*Pa.$qbpOj^:tQl,H?<^#prXH0"q[r1$MSsu`#-;Bq^d`.=is1np^^aDK96L*.(Hg9*0T*>V9Q`\n^`L]5>i\sQ)AOEM\?DBt!8#7Y0SiiDsB23^-W`?+JZ!P-=ienX,kdX6M.]EUBN#=ETZtP"4E0#kk.nX&MZXupQJK6dnON]"CP^nh:H5s_#io8rs2G=?r4"L]CJn3h!L8UnMn3&`J&s32P.9WsPPW!4%+F!3AJ[djYeuWV-Z>A4"8En[*=5Y$_-RU/bi:`*I1]ICNmogd^&>KHo<_nI&aUW0Z4F*r.YHH(N0/])H@AVt,Qj$t>c%"]+(Grh2@hqR\Kp>)Z"qC's<2ica_Zur<m^u*Y/Q8MTPR<N]nt9$(DsPuVc&uYY%\d#Y-N4c2<Xd/X\=C<^SfiC5&Xk4B%BU5u_1MuTN[G%fFL=0<a#bps%YP9MDr1EZ\)4&m]`M"L$&W*k$l5Y3iR$k4ldeYQkic^(a@KBro!2iME/?\=G3i$/_H*tt(cQ?&U`e+$N@567Stob?D:u4k4BLb^V?V:WLl&4sV)/U+,U1BRd9b&3,)e"jWF"OBU.>-Q2/AL<_pP5LNT<5uc'*A?hCXJ.kFHh"?b\4MT7?jNW:Z<';^CI[++A`7OE8^;3KaGt`tU2'.D<$$(8lJ$qXeKD*.IYNhqsqO(qjtQ7I&aVcqsCYCu`NpZ:0D]`fV90Y<_!ZI:XrsMu4G<QsOrnk)*8Si:<@U^<s54-76lfcp@@u'Ae03>pR/S`Z+]D^@_i>hXTXH<?g3fN.t`mHFrIk+[^u,Nj?A=n'Rm8Z?>Qg<A$#ni\E8Ed[UQk3^PU.?M3aB)jcO&2:>*\=T.d1+*ZF@"?A5D+<XGJ21+oBV+^@Ul)1.3B,E[O-k`eb[<f-3'8ScY12"q)NH?`MY$JYo98TG?o]]l2K>E8\PZb2+R`274i=_&+Z-TjqgJjbPoZE^@ah=VW647kCR58IohRMC(*CR(BT4n*g4pe*Q*FX(Ze.='Up?^1I6=]+CR-!_%#,)!P&(,gr)C9gt'd@U<[_Mh<bGWalRRZ:i#nmA)7@_#-gU80l>7[.BXW9QFj@HU`+`>?^il-h`Cm[n-7Qu"^R/N<pp-\T5XtaG+fY:'fp1.8-b7N1^\)2X5)8a8-&-1WqRO?!W,`XY#S-!.OM%I+5h25bRQ9Y.]eP[9P9!<'"`J%WLB$Hd$-E,<4N*a'd,.Y1#h7D;b:aKfuJfOZ2&A>q)/q=rVY'\h65$\b8K9Z?3pKR74:;W&Vrb0'RUnf8X,*n!Vo@(0TeZW?;S5&\%q=Edol##.]0ta!D>-S?B@9uVpR#`p(A3D=o2%[df2_8e$4'f1)NRB<l#/epT=HLX<p$1'cRu'Req!dUQ^M`WY%C7F7G4"#B&i77,rq+Z8\EqkQUW6mqn`1?2:<99X*Ib9nLECu$5p=q2dRJff._W-+(3b(Yrlpuq[pdum$V%>TH'-m?jdWZp&qR5i[E?3(7'GO%!UQIuh95N^kDD#a!lS/(+69W4/mZ%*VQC%uHIj\7D6iFIm5:M9YL]ma?a!d!FZRM2L8hJQ&\WdViAY@$CLtG[U/u!RSi'E`c5,!URlB=%P(1u[;9l7Q6M'9A^A>u+KnAe0>frXsk/mMom5)EOm'BeFBCOfX;l@XR`#.>7PVnNg$Ar0C2iBc2!hXl2M:?(bVG3Z?oZE^@ah:&r%'`jC6A5c$G<'9m%\d#1i<%XtiOYApFZ#ngUWnJtG/]%:$X.K=GW,cd(,X]VC#UfU)`BPA]i2*s(/L,')$%Fg"HB/6o9V+;QA(poeWD(H%'PK'i*'^M$-GEjj5ZsajEKE@7)e`B&V9IbT7,k5*08(&Z-#ERJlX!nJ6-,qOtU0+)>0FG+$Y3Z0!U5:(0dnE2>ldh:HuO;nY0Pm-Nb)*J,HR6XB5,?i``NJa]m"YMA3m7nYoRq4LCiWU8$(:YI&.rTieR/pt(60)sl<&(qm64bGjc,Wid`T\j#uS,;!3t#ZWX@Qb]H6m+0<mq"pT?/gh,$$MM]52_Qf@HL!0M.BgGYRaS8&f<<*5L9LAQhWCm'9YO$4=E`:QN3t-8WYjUQ_:E*b:=22WPC.B]jq;$$D\t?-7XMIQbD+4-gUCs0@Q;HpGa*a;+>:1;qsHNtRn/+a^TqR>@.Xea;s?o];:@&[UK4L#Bgp,F,RsE=He%6J^-n8]*f2^ig*%<Ho%<Bl^t<YGaN-n_khWk[Q9Js,*5hZBeUD3LcIOkYKViFO>WUU9]mE:;]ttcMZNm[=\K]o`.A)aeHbb.4k%p-knF1D'??PP_$'uAW<n&=urVQ?PaH<5kR52PVqQ%ASi>.YfGi'1*4F,A`4oAd^jGb*;,,KJMg7lSZNi\i-]QnS9fD@keR"@(a"19:*>e>/RJdS>U2U)kn??q^kNZ$^Bf?AOeYG1M#F69N)YKHC81t4$<='G]b*BVjAL@1)g&>WXcmq%"$FN(@d[u(<&)jh7^9q42j;/'(@*qZ9#1_hV7kg;bg1;[eiWMc>NH^cj+,)L?e-tC8Ul?"$BW,("fP"k2kTgOS\'#P]UR$afb6UO5'fV1fm!:>pa-mFu0fN>Vk8V,EUiAp`*k=7lnd6j#G?@gXj^]4:[lhRS7^\h!<=Ohc'UIUA;HcM*b-SH\Up<+TaZX2<A99=J]8a]hLl'35R1/-l(io8r/DFn:Ul4p7&\[%C"A]pCUpdeZ(I(:I`"K>JrHeBK!>nJAaY?kLL/gnU,#h^q""dFhRG&6H/a5UeX7Z<FFkZgN[Qi[]b[qUYmnY9pRZKapTW57tpDaub.C],WPGf!!siWZ[$Gc?(sK1T!bnhJ6m\oc&$UX6PlA0M$=gp2k0U'g^N_FXLaSPN%91<W='3St=UOZ_!6o;:`G7>p6gFeM-UOA/J@Q7cIs89qr*N`gtc.gV7W@Pgu35H&1[il-gG6ps9s12"o1k*p:dX^3kuci>4)9#`+:[3-;KGd.!54*Cm-Y<5R+fnq"UN/<B'K;7<YI,ljoR\iis3%@XTHKF\iIdi7K^8P2@-It.qj05blIf9,65(+?-)OUrW];+]CXb/IHrbtO&cAE>ejI8lFT$382/`"$_QOh#2/DLjqoXT1J5gUR^:Q73Z/%*Q8G$9Be]Pl[k/.(FEb(9d)@N&6Z]ZhSo`8H7)_IDWLQ('dTVL5SII6X+!=b>6UY]Ahta^E[MKH*pf9IX>_4DG/tD:u3@Q=+'LrO%cbH8T["^qG*h2K%:eK3(85o6G^0<BC>e=,qT0_iZI$F6Ci^qWb,K`6fP]$4[.MF'Y6B(&,:Gh,TCP2$t[aqq^Lo&44Hf_5)pDh0RQRF/\'r*qi?.M@`+%+QqN0=0@MG=&Q81PV9AI^:8FXigm1m+bV6r>dtn(*3_sE%hF_WLlbEq6:+#iY$HCPCI\XR.7d!#(c,btV+R!a:h@tE4Z#"&=0Gs$*@i:d&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+bUCn#U+j463n`f&4-XGKFgHU+lr@e*ukCHmf~>endstream
endobj
5 0 obj
<<
/Contents 9 0 R /MediaBox [ 0 0 612 792 ] /Parent 8 0 R /Resources <<
/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject <<
/FormXob.41b05a9cf8679f0fe6e7c30c9462b767 3 0 R /FormXob.94284ebb61fac7951963d5746d1b193a 4 0 R
>>
>> /Rotate 0 /Trans <<
>>
/Type /Page
>>
endobj
6 0 obj
<<
/PageMode /UseNone /Pages 8 0 R /Type /Catalog
>>
endobj
7 0 obj
<<
/Author (anonymous) /CreationDate (D:20260126172022+01'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:20260126172022+01'00') /Producer (ReportLab PDF Library - www.reportlab.com)
/Subject (unspecified) /Title (untitled) /Trapped /False
>>
endobj
8 0 obj
<<
/Count 1 /Kids [ 5 0 R ] /Type /Pages
>>
endobj
9 0 obj
<<
/Filter [ /ASCII85Decode /FlateDecode ] /Length 300
>>
stream
Gas3-b=]`-&-h'@TAj4XMk6`hV:j"r+Qu/5_JPH2[*jl@3?0-u[(9'GR:-/*/ft>A_=nj;i7d2;EpsDoJr<OBhVlHiq4E/El7+06*H?(h_eGnqiS:>Dgn0>N^CGqOd65m'$2XdN[8"CN<R^<p;O;.QTL>"4'o-s=`lHc!JpSi8$*d@]6l&@V%Q+V`W6/nPEL_rB?OF1iZbk.;Ju<];RLo@-9lO$dQ,9&`I`%EM@\dBr0Lf$$+R^&+/ncK?;0=7o:`];ceF"uKA7ETdrT"0YNT=QC"`>/@%I83@M@]K&@Nk~>endstream
endobj
xref
0 10
0000000000 65535 f
0000000073 00000 n
0000000104 00000 n
0000000211 00000 n
0000004431 00000 n
0000009270 00000 n
0000009574 00000 n
0000009642 00000 n
0000009938 00000 n
0000009997 00000 n
trailer
<<
/ID
[<60f7c7338a7d1cfd54f86e6a06e41602><60f7c7338a7d1cfd54f86e6a06e41602>]
% ReportLab generated PDF document -- digest (http://www.reportlab.com)
/Info 7 0 R
/Root 6 0 R
/Size 10
>>
startxref
10387
%%EOF
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -0,0 +1,223 @@
"""
Unit tests for DocxConverterWithOCR.
For each DOCX test file: convert with a mock OCR service then compare the
full output string against the expected snapshot.
OCR block format used by the converter:
*[Image OCR]
MOCK_OCR_TEXT_12345
[End OCR]*
"""
import sys
from pathlib import Path
from typing import Any
import pytest
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from markitdown_ocr._ocr_service import OCRResult # noqa: E402
from markitdown_ocr._docx_converter_with_ocr import ( # noqa: E402
DocxConverterWithOCR,
)
from markitdown import StreamInfo # noqa: E402
TEST_DATA_DIR = Path(__file__).parent / "ocr_test_data"
_MOCK_TEXT = "MOCK_OCR_TEXT_12345"
class MockOCRService:
def extract_text( # noqa: ANN101
self, image_stream: Any, **kwargs: Any
) -> OCRResult:
return OCRResult(text=_MOCK_TEXT, backend_used="mock")
@pytest.fixture(scope="module")
def svc() -> MockOCRService:
return MockOCRService()
def _convert(filename: str, ocr_service: MockOCRService) -> str:
path = TEST_DATA_DIR / filename
if not path.exists():
pytest.skip(f"Test file not found: {path}")
converter = DocxConverterWithOCR()
with open(path, "rb") as f:
return converter.convert(
f, StreamInfo(extension=".docx"), ocr_service=ocr_service
).text_content
# ---------------------------------------------------------------------------
# docx_image_start.docx
# ---------------------------------------------------------------------------
def test_docx_image_start(svc: MockOCRService) -> None:
expected = (
"Document with Image at Start\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"This is the main content after the header image.\n\n"
"More text content here."
)
assert _convert("docx_image_start.docx", svc) == expected
# ---------------------------------------------------------------------------
# docx_image_middle.docx
# ---------------------------------------------------------------------------
def test_docx_image_middle(svc: MockOCRService) -> None:
expected = (
"# Introduction\n\n"
"This is the introduction section.\n\n"
"We will see an image below.\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"# Analysis\n\n"
"This section comes after the image."
)
assert _convert("docx_image_middle.docx", svc) == expected
# ---------------------------------------------------------------------------
# docx_image_end.docx
# ---------------------------------------------------------------------------
def test_docx_image_end(svc: MockOCRService) -> None:
expected = (
"Report\n\n"
"Main findings of the report.\n\n"
"Details and analysis.\n\n"
"Recommendations.\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("docx_image_end.docx", svc) == expected
# ---------------------------------------------------------------------------
# docx_multiple_images.docx
# ---------------------------------------------------------------------------
def test_docx_multiple_images(svc: MockOCRService) -> None:
expected = (
"Multi-Image Document\n\n"
"First section\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"Second section with another image\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"Conclusion"
)
assert _convert("docx_multiple_images.docx", svc) == expected
# ---------------------------------------------------------------------------
# docx_multipage.docx
# ---------------------------------------------------------------------------
def test_docx_multipage(svc: MockOCRService) -> None:
expected = (
"# Page 1 - Mixed Content\n\n"
"This is the first paragraph on page 1.\n\n"
"BEFORE IMAGE: Important content appears here.\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"AFTER IMAGE: This content follows the image.\n\n"
"More text on page 1.\n\n"
"# Page 2 - Image at End\n\n"
"Content on page 2.\n\n"
"Multiple paragraphs of text.\n\n"
"Building up to the image...\n\n"
"Final paragraph before image.\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"# Page 3 - Image at Start\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"Content that follows the header image.\n\n"
"AFTER IMAGE: This text is after the image."
)
assert _convert("docx_multipage.docx", svc) == expected
# ---------------------------------------------------------------------------
# docx_complex_layout.docx
# ---------------------------------------------------------------------------
def test_docx_complex_layout(svc: MockOCRService) -> None:
expected = (
"Complex Document\n\n"
"| | |\n"
"| --- | --- |\n"
"| Feature | Status |\n"
"| Authentication | Active |\n"
"| Encryption | Enabled |\n\n"
"Security notice:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("docx_complex_layout.docx", svc) == expected
# ---------------------------------------------------------------------------
# _inject_placeholders — internal unit tests (no file I/O)
# ---------------------------------------------------------------------------
def test_inject_placeholders_single_image() -> None:
converter = DocxConverterWithOCR()
html = "<p>Before</p><img src='x.png'/><p>After</p>"
result_html, texts = converter._inject_placeholders(html, {"rId1": "TEXT"})
assert "<img" not in result_html
assert "MARKITDOWNOCRBLOCK0" in result_html
assert texts == ["TEXT"]
def test_inject_placeholders_two_images_sequential_tokens() -> None:
converter = DocxConverterWithOCR()
html = "<img src='a.png'/><p>Mid</p><img src='b.png'/>"
result_html, texts = converter._inject_placeholders(
html, {"rId1": "FIRST", "rId2": "SECOND"}
)
assert "MARKITDOWNOCRBLOCK0" in result_html
assert "MARKITDOWNOCRBLOCK1" in result_html
assert result_html.index("MARKITDOWNOCRBLOCK0") < result_html.index(
"MARKITDOWNOCRBLOCK1"
)
assert len(texts) == 2
def test_inject_placeholders_no_img_tag_appends_at_end() -> None:
converter = DocxConverterWithOCR()
html = "<p>No images</p>"
result_html, texts = converter._inject_placeholders(html, {"rId1": "ORPHAN"})
assert "MARKITDOWNOCRBLOCK0" in result_html
assert texts == ["ORPHAN"]
def test_inject_placeholders_empty_map_leaves_html_unchanged() -> None:
converter = DocxConverterWithOCR()
html = "<p>Content</p><img src='pic.jpg'/>"
result_html, texts = converter._inject_placeholders(html, {})
assert result_html == html
assert texts == []
# ---------------------------------------------------------------------------
# No OCR service — no OCR tags emitted
# ---------------------------------------------------------------------------
def test_docx_no_ocr_service_no_tags() -> None:
path = TEST_DATA_DIR / "docx_image_middle.docx"
if not path.exists():
pytest.skip(f"Test file not found: {path}")
converter = DocxConverterWithOCR()
with open(path, "rb") as f:
md = converter.convert(f, StreamInfo(extension=".docx")).text_content
assert "*[Image OCR]" not in md
assert "[End OCR]*" not in md
@@ -0,0 +1,234 @@
"""
Unit tests for PdfConverterWithOCR.
For each PDF test file: convert with a mock OCR service then compare the
full output string against the expected snapshot.
OCR block format used by the converter:
*[Image OCR]
MOCK_OCR_TEXT_12345
[End OCR]*
"""
import io
import sys
from pathlib import Path
from typing import Any
from unittest.mock import MagicMock, patch
import pytest
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from markitdown_ocr._ocr_service import OCRResult # noqa: E402
from markitdown_ocr._pdf_converter_with_ocr import ( # noqa: E402
PdfConverterWithOCR,
)
from markitdown import StreamInfo # noqa: E402
TEST_DATA_DIR = Path(__file__).parent / "ocr_test_data"
_MOCK_TEXT = "MOCK_OCR_TEXT_12345"
_OCR_BLOCK = f"*[Image OCR]\n{_MOCK_TEXT}\n[End OCR]*"
_PAGE_1_SCANNED = f"## Page 1\n\n\n\n\n{_OCR_BLOCK}"
class MockOCRService:
def extract_text(
self, # noqa: ANN101
image_stream: Any,
**kwargs: Any,
) -> OCRResult:
return OCRResult(text=_MOCK_TEXT, backend_used="mock")
@pytest.fixture(scope="module")
def svc() -> MockOCRService:
return MockOCRService()
def _convert(filename: str, ocr_service: MockOCRService) -> str:
path = TEST_DATA_DIR / filename
if not path.exists():
pytest.skip(f"Test file not found: {path}")
converter = PdfConverterWithOCR()
with open(path, "rb") as f:
return converter.convert(
f, StreamInfo(extension=".pdf"), ocr_service=ocr_service
).text_content
# ---------------------------------------------------------------------------
# pdf_image_start.pdf
# ---------------------------------------------------------------------------
def test_pdf_image_start(svc: MockOCRService) -> None:
expected = (
"## Page 1\n\n\n\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n\n"
"This is text BEFORE the image.\n\n"
"The image should appear above this text.\n\n"
"This is more content after the image."
)
assert _convert("pdf_image_start.pdf", svc) == expected
# ---------------------------------------------------------------------------
# pdf_image_middle.pdf
# ---------------------------------------------------------------------------
def test_pdf_image_middle(svc: MockOCRService) -> None:
expected = (
"## Page 1\n\n\n"
"Section 1: Introduction\n\n"
"This document contains an image in the middle.\n\n"
"Here is some introductory text.\n\n\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n\n"
"Section 2: Details\n\n"
"This text appears AFTER the image."
)
assert _convert("pdf_image_middle.pdf", svc) == expected
# ---------------------------------------------------------------------------
# pdf_image_end.pdf
# ---------------------------------------------------------------------------
def test_pdf_image_end(svc: MockOCRService) -> None:
expected = (
"## Page 1\n\n\n"
"Main Content\n\n"
"This is the main text content.\n\n"
"The image will appear at the end.\n\n"
"Keep reading...\n\n\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("pdf_image_end.pdf", svc) == expected
# ---------------------------------------------------------------------------
# pdf_multiple_images.pdf
# ---------------------------------------------------------------------------
def test_pdf_multiple_images(svc: MockOCRService) -> None:
expected = (
"## Page 1\n\n\n"
"Document with Multiple Images\n\n\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n\n"
"Text between first and second image.\n\n\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n\n"
"Final text after all images."
)
assert _convert("pdf_multiple_images.pdf", svc) == expected
# ---------------------------------------------------------------------------
# pdf_complex_layout.pdf
# ---------------------------------------------------------------------------
def test_pdf_complex_layout(svc: MockOCRService) -> None:
expected = (
"## Page 1\n\n\n"
"Complex Layout Document\n\n"
"Table:\n\n"
"ItemQuantity\n\n\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n\n"
"Widget A5"
)
assert _convert("pdf_complex_layout.pdf", svc) == expected
# ---------------------------------------------------------------------------
# pdf_multipage.pdf — pdfplumber/pdfminer fail (EOF); PyMuPDF fallback used
# ---------------------------------------------------------------------------
def test_pdf_multipage(svc: MockOCRService) -> None:
# pdfplumber cannot open this file (Unexpected EOF), so _ocr_full_pages
# falls back to PyMuPDF for page rendering. Each page becomes one OCR block.
expected = (
f"## Page 1\n\n\n{_OCR_BLOCK}\n\n\n"
f"## Page 2\n\n\n{_OCR_BLOCK}\n\n\n"
f"## Page 3\n\n\n{_OCR_BLOCK}"
)
assert _convert("pdf_multipage.pdf", svc) == expected
# ---------------------------------------------------------------------------
# pdf_scanned_*.pdf — raster-only pages → full-page OCR
# ---------------------------------------------------------------------------
def test_pdf_scanned_invoice(svc: MockOCRService) -> None:
assert _convert("pdf_scanned_invoice.pdf", svc) == _PAGE_1_SCANNED
def test_pdf_scanned_meeting_minutes(svc: MockOCRService) -> None:
assert _convert("pdf_scanned_meeting_minutes.pdf", svc) == _PAGE_1_SCANNED
def test_pdf_scanned_minimal(svc: MockOCRService) -> None:
assert _convert("pdf_scanned_minimal.pdf", svc) == _PAGE_1_SCANNED
def test_pdf_scanned_sales_report(svc: MockOCRService) -> None:
assert _convert("pdf_scanned_sales_report.pdf", svc) == _PAGE_1_SCANNED
def test_pdf_scanned_report(svc: MockOCRService) -> None:
expected = (
f"{_PAGE_1_SCANNED}\n\n\n\n"
f"## Page 2\n\n\n\n\n{_OCR_BLOCK}\n\n\n\n"
f"## Page 3\n\n\n\n\n{_OCR_BLOCK}"
)
assert _convert("pdf_scanned_report.pdf", svc) == expected
# ---------------------------------------------------------------------------
# Scanned PDF fallback path (pdfplumber finds no text → full-page OCR)
# ---------------------------------------------------------------------------
def test_pdf_scanned_fallback_format(svc: MockOCRService) -> None:
"""_ocr_full_pages emits *[Image OCR]...[End OCR]* for each page."""
path = TEST_DATA_DIR / "pdf_image_start.pdf"
if not path.exists():
pytest.skip(f"Test file not found: {path}")
converter = PdfConverterWithOCR()
with patch("pdfplumber.open") as mock_plumber:
mock_pdf = MagicMock()
mock_page = MagicMock()
mock_page.page_number = 1
mock_pdf.pages = [mock_page]
mock_pdf.__enter__.return_value = mock_pdf
mock_plumber.return_value = mock_pdf
with open(path, "rb") as f:
md = converter._ocr_full_pages(io.BytesIO(f.read()), svc)
expected = "## Page 1\n\n\n" "*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
assert (
md == expected
), f"_ocr_full_pages must produce:\n{expected!r}\nActual:\n{md!r}"
# ---------------------------------------------------------------------------
# No OCR service — no OCR tags emitted
# ---------------------------------------------------------------------------
def test_pdf_no_ocr_service_no_tags() -> None:
path = TEST_DATA_DIR / "pdf_image_middle.pdf"
if not path.exists():
pytest.skip(f"Test file not found: {path}")
converter = PdfConverterWithOCR()
with open(path, "rb") as f:
md = converter.convert(f, StreamInfo(extension=".pdf")).text_content
assert "*[Image OCR]" not in md
assert "[End OCR]*" not in md
@@ -0,0 +1,148 @@
"""
Unit tests for PptxConverterWithOCR.
For each PPTX test file: convert with a mock OCR service then compare the
full output string against the expected snapshot.
OCR block format used by the converter:
*[Image OCR]
MOCK_OCR_TEXT_12345
[End OCR]*
Note: PPTX slide text uses literal backslash-n (\\n) sequences from the
underlying PPTX converter template; OCR blocks use real newlines.
"""
import sys
from pathlib import Path
from typing import Any
import pytest
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from markitdown_ocr._ocr_service import OCRResult # noqa: E402
from markitdown_ocr._pptx_converter_with_ocr import ( # noqa: E402
PptxConverterWithOCR,
)
from markitdown import StreamInfo # noqa: E402
TEST_DATA_DIR = Path(__file__).parent / "ocr_test_data"
_MOCK_TEXT = "MOCK_OCR_TEXT_12345"
_OCR_BLOCK = f"*[Image OCR]\n{_MOCK_TEXT}\n[End OCR]*"
class MockOCRService:
def extract_text(
self, # noqa: ANN101
image_stream: Any,
**kwargs: Any,
) -> OCRResult:
return OCRResult(text=_MOCK_TEXT, backend_used="mock")
@pytest.fixture(scope="module")
def svc() -> MockOCRService:
return MockOCRService()
def _convert(filename: str, ocr_service: MockOCRService) -> str:
path = TEST_DATA_DIR / filename
if not path.exists():
pytest.skip(f"Test file not found: {path}")
converter = PptxConverterWithOCR()
with open(path, "rb") as f:
return converter.convert(
f, StreamInfo(extension=".pptx"), ocr_service=ocr_service
).text_content
# ---------------------------------------------------------------------------
# pptx_image_start.pptx
# ---------------------------------------------------------------------------
def test_pptx_image_start(svc: MockOCRService) -> None:
# Slide 1: title "Welcome" followed by an image
expected = (
"\\n\\n<!-- Slide number: 1 -->\\n# Welcome\\n\\n"
"\n*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("pptx_image_start.pptx", svc) == expected
# ---------------------------------------------------------------------------
# pptx_image_middle.pptx
# ---------------------------------------------------------------------------
def test_pptx_image_middle(svc: MockOCRService) -> None:
# Slide 1: Introduction | Slide 2: Architecture + image | Slide 3: Conclusion # noqa: E501
expected = (
"\\n\\n<!-- Slide number: 1 -->\\n# Introduction"
"\\n\\n\\n\\n<!-- Slide number: 2 -->\\n# Architecture\\n\\n"
"\n*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
"\\n\\n<!-- Slide number: 3 -->\\n# Conclusion\\n\\n"
)
assert _convert("pptx_image_middle.pptx", svc) == expected
# ---------------------------------------------------------------------------
# pptx_image_end.pptx
# ---------------------------------------------------------------------------
def test_pptx_image_end(svc: MockOCRService) -> None:
# Slide 1: Presentation | Slide 2: Thank You + image
expected = (
"\\n\\n<!-- Slide number: 1 -->\\n# Presentation"
"\\n\\n\\n\\n<!-- Slide number: 2 -->\\n# Thank You\\n\\n"
"\n*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("pptx_image_end.pptx", svc) == expected
# ---------------------------------------------------------------------------
# pptx_multiple_images.pptx
# ---------------------------------------------------------------------------
def test_pptx_multiple_images(svc: MockOCRService) -> None:
# Slide 1: two images, no title text
expected = (
"\\n\\n<!-- Slide number: 1 -->\\n# \\n"
"\n*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
"\n\n*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("pptx_multiple_images.pptx", svc) == expected
# ---------------------------------------------------------------------------
# pptx_complex_layout.pptx
# ---------------------------------------------------------------------------
def test_pptx_complex_layout(svc: MockOCRService) -> None:
expected = (
"\\n\\n<!-- Slide number: 1 -->\\n# Product Comparison"
"\\n\\nOur products lead the market\\n"
"\n*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("pptx_complex_layout.pptx", svc) == expected
# ---------------------------------------------------------------------------
# No OCR service — no OCR tags emitted
# ---------------------------------------------------------------------------
def test_pptx_no_ocr_service_no_tags() -> None:
path = TEST_DATA_DIR / "pptx_image_middle.pptx"
if not path.exists():
pytest.skip(f"Test file not found: {path}")
converter = PptxConverterWithOCR()
with open(path, "rb") as f:
md = converter.convert(f, StreamInfo(extension=".pptx")).text_content
assert "*[Image OCR]" not in md
assert "[End OCR]*" not in md
@@ -0,0 +1,249 @@
"""
Unit tests for XlsxConverterWithOCR.
For each XLSX test file: convert with a mock OCR service then compare the
full output string against the expected snapshot.
OCR block format used by the converter:
*[Image OCR]
MOCK_OCR_TEXT_12345
[End OCR]*
Images are grouped at the end of each sheet under:
### Images in this sheet:
"""
import sys
from pathlib import Path
from typing import Any
import pytest
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from markitdown_ocr._ocr_service import OCRResult # noqa: E402
from markitdown_ocr._xlsx_converter_with_ocr import ( # noqa: E402
XlsxConverterWithOCR,
)
from markitdown import StreamInfo # noqa: E402
TEST_DATA_DIR = Path(__file__).parent / "ocr_test_data"
_MOCK_TEXT = "MOCK_OCR_TEXT_12345"
_OCR_BLOCK = f"*[Image OCR]\n{_MOCK_TEXT}\n[End OCR]*"
_IMG_SECTION = "### Images in this sheet:"
class MockOCRService:
def extract_text(
self, # noqa: ANN101
image_stream: Any,
**kwargs: Any,
) -> OCRResult:
return OCRResult(text=_MOCK_TEXT, backend_used="mock")
@pytest.fixture(scope="module")
def svc() -> MockOCRService:
return MockOCRService()
def _convert(filename: str, ocr_service: MockOCRService) -> str:
path = TEST_DATA_DIR / filename
if not path.exists():
pytest.skip(f"Test file not found: {path}")
converter = XlsxConverterWithOCR()
with open(path, "rb") as f:
return converter.convert(
f, StreamInfo(extension=".xlsx"), ocr_service=ocr_service
).text_content
# ---------------------------------------------------------------------------
# xlsx_image_start.xlsx
# ---------------------------------------------------------------------------
def test_xlsx_image_start(svc: MockOCRService) -> None:
expected = (
"## Sales Q1\n\n"
"| Product | Sales |\n"
"| --- | --- |\n"
"| Widget A | 100 |\n"
"| Widget B | 150 |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"## Forecast Q2\n\n"
"| Projected Sales | Unnamed: 1 |\n"
"| --- | --- |\n"
"| Widget A | 120 |\n"
"| Widget B | 180 |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("xlsx_image_start.xlsx", svc) == expected
# ---------------------------------------------------------------------------
# xlsx_image_middle.xlsx
# ---------------------------------------------------------------------------
def test_xlsx_image_middle(svc: MockOCRService) -> None:
expected = (
"## Revenue\n\n"
"| Q1 Report | Unnamed: 1 |\n"
"| --- | --- |\n"
"| NaN | NaN |\n"
"| Revenue | $50,000 |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| Profit Margin | 40% |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"## Expenses\n\n"
"| Expense Breakdown | Unnamed: 1 |\n"
"| --- | --- |\n"
"| NaN | NaN |\n"
"| Expenses | $30,000 |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| Savings | $5,000 |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("xlsx_image_middle.xlsx", svc) == expected
# ---------------------------------------------------------------------------
# xlsx_image_end.xlsx
# ---------------------------------------------------------------------------
def test_xlsx_image_end(svc: MockOCRService) -> None:
expected = (
"## Sheet\n\n"
"| Financial Summary | Unnamed: 1 |\n"
"| --- | --- |\n"
"| Total Revenue | $500,000 |\n"
"| Total Expenses | $300,000 |\n"
"| Net Profit | $200,000 |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| Signature: | NaN |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"## Budget\n\n"
"| Budget Allocation | Unnamed: 1 |\n"
"| --- | --- |\n"
"| Marketing | $100,000 |\n"
"| R&D | $150,000 |\n"
"| Operations | $50,000 |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| NaN | NaN |\n"
"| Approved: | NaN |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("xlsx_image_end.xlsx", svc) == expected
# ---------------------------------------------------------------------------
# xlsx_multiple_images.xlsx
# ---------------------------------------------------------------------------
def test_xlsx_multiple_images(svc: MockOCRService) -> None:
expected = (
"## Overview\n\n"
"| Dashboard |\n"
"| --- |\n"
"| Status: Active |\n"
"| NaN |\n"
"| NaN |\n"
"| NaN |\n"
"| NaN |\n"
"| Performance Summary |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"## Details\n\n"
"| Detailed Metrics |\n"
"| --- |\n"
"| System Health |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"## Summary\n\n"
"| Quarter Summary |\n"
"| --- |\n"
"| Overall Performance |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("xlsx_multiple_images.xlsx", svc) == expected
# ---------------------------------------------------------------------------
# xlsx_complex_layout.xlsx
# ---------------------------------------------------------------------------
def test_xlsx_complex_layout(svc: MockOCRService) -> None:
expected = (
"## Complex Report\n\n"
"| Annual Report 2024 | Unnamed: 1 |\n"
"| --- | --- |\n"
"| NaN | NaN |\n"
"| Month | Sales |\n"
"| Jan | 1000 |\n"
"| Feb | 1200 |\n"
"| NaN | NaN |\n"
"| Total | 2200 |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"## Customers\n\n"
"| Customer Metrics | Unnamed: 1 |\n"
"| --- | --- |\n"
"| NaN | NaN |\n"
"| New Customers | 250 |\n"
"| Retention Rate | 92% |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*\n\n"
"## Regions\n\n"
"| Regional Breakdown | Unnamed: 1 |\n"
"| --- | --- |\n"
"| NaN | NaN |\n"
"| Region | Revenue |\n"
"| North | $800K |\n"
"| South | $600K |\n\n"
"### Images in this sheet:\n\n"
"*[Image OCR]\nMOCK_OCR_TEXT_12345\n[End OCR]*"
)
assert _convert("xlsx_complex_layout.xlsx", svc) == expected
# ---------------------------------------------------------------------------
# No OCR service — no OCR tags emitted
# ---------------------------------------------------------------------------
def test_xlsx_no_ocr_service_no_tags() -> None:
path = TEST_DATA_DIR / "xlsx_image_middle.xlsx"
if not path.exists():
pytest.skip(f"Test file not found: {path}")
converter = XlsxConverterWithOCR()
with open(path, "rb") as f:
md = converter.convert(f, StreamInfo(extension=".xlsx")).text_content
assert "*[Image OCR]" not in md
assert "[End OCR]*" not in md
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.1.5"
__version__ = "0.1.6b1"