Compare commits
49 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 7fdaefb724 | |||
| 251dddcf0c | |||
| dde250a456 | |||
| 3d4fe3cdcc | |||
| 447c047731 | |||
| 8a9d8f1593 | |||
| 17365654c9 | |||
| 59eb60f8cb | |||
| 459d462f29 | |||
| c3f6cb356c | |||
| 0c4d3945a0 | |||
| f8b60b5403 | |||
| 16ca285d30 | |||
| b81a387616 | |||
| ea1a3dfb60 | |||
| b6e5da8874 | |||
| fb1ad24833 | |||
| 1178c2e211 | |||
| 9278119bb3 | |||
| da7bcea527 | |||
| 3bfb821c09 | |||
| 62b72284fe | |||
| 1dd3c83339 | |||
| 9dc982a3b1 | |||
| effde4767b | |||
| 04bf831209 | |||
| 9fd680c366 | |||
| 38261fd31c | |||
| 131f0c7739 | |||
| 56f7579ce2 | |||
| cb421cf9ea | |||
| 39e7252940 | |||
| bbcf876b18 | |||
| 041be54471 | |||
| ebe2684b3d | |||
| 8576f1d915 | |||
| 3fcd48cdfc | |||
| 9e067c42b6 | |||
| 9a951055f0 | |||
| 73b9d57312 | |||
| 3ca57986ef | |||
| c1f9a323ee | |||
| e928b43afb | |||
| 2ffe6ea591 | |||
| efc55b260d | |||
| 52432bd228 | |||
| c0a511ecff | |||
| cd6aa41361 | |||
| 716f74dcb9 |
@@ -5,7 +5,7 @@ jobs:
|
||||
pre-commit:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/checkout@v5
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
|
||||
@@ -5,7 +5,7 @@ jobs:
|
||||
tests:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/checkout@v5
|
||||
- uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: |
|
||||
|
||||
@@ -52,6 +52,7 @@ coverage.xml
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
cover/
|
||||
.test-logs/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
@@ -164,3 +165,4 @@ cython_debug/
|
||||
#.idea/
|
||||
src/.DS_Store
|
||||
.DS_Store
|
||||
.cursorrules
|
||||
|
||||
@@ -4,14 +4,18 @@
|
||||

|
||||
[](https://github.com/microsoft/autogen)
|
||||
|
||||
> [!TIP]
|
||||
> MarkItDown now offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop. See [markitdown-mcp](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcp) for more information.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Breaking changes between 0.0.1 to 0.1.0:
|
||||
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]~=0.1.0a1'` to have backward-compatible behavior.
|
||||
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
|
||||
> * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
|
||||
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
|
||||
|
||||
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
|
||||
|
||||
At present, MarkItDown supports:
|
||||
MarkItDown currently supports the conversion from:
|
||||
|
||||
- PDF
|
||||
- PowerPoint
|
||||
@@ -35,14 +39,39 @@ responses unprompted. This suggests that they have been trained on vast amounts
|
||||
Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions
|
||||
are also highly token-efficient.
|
||||
|
||||
## Prerequisites
|
||||
MarkItDown requires Python 3.10 or higher. It is recommended to use a virtual environment to avoid dependency conflicts.
|
||||
|
||||
With the standard Python installation, you can create and activate a virtual environment using the following commands:
|
||||
|
||||
```bash
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
If using `uv`, you can create a virtual environment with:
|
||||
|
||||
```bash
|
||||
uv venv --python=3.12 .venv
|
||||
source .venv/bin/activate
|
||||
# NOTE: Be sure to use 'uv pip install' rather than just 'pip install' to install packages in this virtual environment
|
||||
```
|
||||
|
||||
If you are using Anaconda, you can create a virtual environment with:
|
||||
|
||||
```bash
|
||||
conda create -n markitdown python=3.12
|
||||
conda activate markitdown
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
To install MarkItDown, use pip: `pip install 'markitdown[all]~=0.1.0a1'`. Alternatively, you can install it from the source:
|
||||
To install MarkItDown, use pip: `pip install 'markitdown[all]'`. Alternatively, you can install it from the source:
|
||||
|
||||
```bash
|
||||
git clone git@github.com:microsoft/markitdown.git
|
||||
cd markitdown
|
||||
pip install -e packages/markitdown[all]
|
||||
pip install -e 'packages/markitdown[all]'
|
||||
```
|
||||
|
||||
## Usage
|
||||
@@ -69,7 +98,7 @@ cat path-to-file.pdf | markitdown
|
||||
MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the `[all]` option. However, you can also install them individually for more control. For example:
|
||||
|
||||
```bash
|
||||
pip install markitdown[pdf, docx, pptx]
|
||||
pip install 'markitdown[pdf, docx, pptx]'
|
||||
```
|
||||
|
||||
will install only the dependencies for PDF, DOCX, and PPTX files.
|
||||
@@ -135,14 +164,14 @@ result = md.convert("test.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
|
||||
To use Large Language Models for image descriptions (currently only for pptx and image files), provide `llm_client` and `llm_model`:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="optional custom prompt")
|
||||
result = md.convert("example.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
@@ -170,7 +199,7 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
|
||||
|
||||
### How to Contribute
|
||||
|
||||
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
|
||||
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are of course just suggestions and you are welcome to contribute in any way you like.
|
||||
|
||||
<div align="center">
|
||||
|
||||
|
||||
@@ -0,0 +1,28 @@
|
||||
FROM python:3.13-slim-bullseye
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
ENV EXIFTOOL_PATH=/usr/bin/exiftool
|
||||
ENV FFMPEG_PATH=/usr/bin/ffmpeg
|
||||
ENV MARKITDOWN_ENABLE_PLUGINS=True
|
||||
|
||||
# Runtime dependency
|
||||
# NOTE: Add any additional MarkItDown plugins here
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
ffmpeg \
|
||||
exiftool
|
||||
|
||||
# Cleanup
|
||||
RUN rm -rf /var/lib/apt/lists/*
|
||||
|
||||
COPY . /app
|
||||
RUN pip --no-cache-dir install /app
|
||||
|
||||
WORKDIR /workdir
|
||||
|
||||
# Default USERID and GROUPID
|
||||
ARG USERID=nobody
|
||||
ARG GROUPID=nogroup
|
||||
|
||||
USER $USERID:$GROUPID
|
||||
|
||||
ENTRYPOINT [ "markitdown-mcp" ]
|
||||
@@ -0,0 +1,138 @@
|
||||
# MarkItDown-MCP
|
||||
|
||||
[](https://pypi.org/project/markitdown-mcp/)
|
||||

|
||||
[](https://github.com/microsoft/autogen)
|
||||
|
||||
The `markitdown-mcp` package provides a lightweight STDIO, Streamable HTTP, and SSE MCP server for calling MarkItDown.
|
||||
|
||||
It exposes one tool: `convert_to_markdown(uri)`, where uri can be any `http:`, `https:`, `file:`, or `data:` URI.
|
||||
|
||||
## Installation
|
||||
|
||||
To install the package, use pip:
|
||||
|
||||
```bash
|
||||
pip install markitdown-mcp
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
To run the MCP server, using STDIO (default) use the following command:
|
||||
|
||||
|
||||
```bash
|
||||
markitdown-mcp
|
||||
```
|
||||
|
||||
To run the MCP server, using Streamable HTTP and SSE use the following command:
|
||||
|
||||
```bash
|
||||
markitdown-mcp --http --host 127.0.0.1 --port 3001
|
||||
```
|
||||
|
||||
## Running in Docker
|
||||
|
||||
To run `markitdown-mcp` in Docker, build the Docker image using the provided Dockerfile:
|
||||
```bash
|
||||
docker build -t markitdown-mcp:latest .
|
||||
```
|
||||
|
||||
And run it using:
|
||||
```bash
|
||||
docker run -it --rm markitdown-mcp:latest
|
||||
```
|
||||
This will be sufficient for remote URIs. To access local files, you need to mount the local directory into the container. For example, if you want to access files in `/home/user/data`, you can run:
|
||||
|
||||
```bash
|
||||
docker run -it --rm -v /home/user/data:/workdir markitdown-mcp:latest
|
||||
```
|
||||
|
||||
Once mounted, all files under data will be accessible under `/workdir` in the container. For example, if you have a file `example.txt` in `/home/user/data`, it will be accessible in the container at `/workdir/example.txt`.
|
||||
|
||||
## Accessing from Claude Desktop
|
||||
|
||||
It is recommended to use the Docker image when running the MCP server for Claude Desktop.
|
||||
|
||||
Follow [these instructions](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users) to access Claude's `claude_desktop_config.json` file.
|
||||
|
||||
Edit it to include the following JSON entry:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"markitdown": {
|
||||
"command": "docker",
|
||||
"args": [
|
||||
"run",
|
||||
"--rm",
|
||||
"-i",
|
||||
"markitdown-mcp:latest"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If you want to mount a directory, adjust it accordingly:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"markitdown": {
|
||||
"command": "docker",
|
||||
"args": [
|
||||
"run",
|
||||
"--rm",
|
||||
"-i",
|
||||
"-v",
|
||||
"/home/user/data:/workdir",
|
||||
"markitdown-mcp:latest"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
To debug the MCP server you can use the `mcpinspector` tool.
|
||||
|
||||
```bash
|
||||
npx @modelcontextprotocol/inspector
|
||||
```
|
||||
|
||||
You can then connect to the inspector through the specified host and port (e.g., `http://localhost:5173/`).
|
||||
|
||||
If using STDIO:
|
||||
* select `STDIO` as the transport type,
|
||||
* input `markitdown-mcp` as the command, and
|
||||
* click `Connect`
|
||||
|
||||
If using Streamable HTTP:
|
||||
* select `Streamable HTTP` as the transport type,
|
||||
* input `http://127.0.0.1:3001/mcp` as the URL, and
|
||||
* click `Connect`
|
||||
|
||||
If using SSE:
|
||||
* select `SSE` as the transport type,
|
||||
* input `http://127.0.0.1:3001/sse` as the URL, and
|
||||
* click `Connect`
|
||||
|
||||
Finally:
|
||||
* click the `Tools` tab,
|
||||
* click `List Tools`,
|
||||
* click `convert_to_markdown`, and
|
||||
* run the tool on any valid URI.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
The server does not support authentication, and runs with the privileges of the user running it. For this reason, when running in SSE or Streamable HTTP mode, it is recommended to run the server bound to `localhost` (default).
|
||||
|
||||
## Trademarks
|
||||
|
||||
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
|
||||
trademarks or logos is subject to and must follow
|
||||
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
|
||||
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
|
||||
Any use of third-party trademarks or logos are subject to those third-party's policies.
|
||||
@@ -0,0 +1,69 @@
|
||||
[build-system]
|
||||
requires = ["hatchling"]
|
||||
build-backend = "hatchling.build"
|
||||
|
||||
[project]
|
||||
name = "markitdown-mcp"
|
||||
dynamic = ["version"]
|
||||
description = 'An MCP server for the "markitdown" library.'
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.10"
|
||||
license = "MIT"
|
||||
keywords = []
|
||||
authors = [
|
||||
{ name = "Adam Fourney", email = "adamfo@microsoft.com" },
|
||||
]
|
||||
classifiers = [
|
||||
"Development Status :: 4 - Beta",
|
||||
"Programming Language :: Python",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
"Programming Language :: Python :: 3.13",
|
||||
"Programming Language :: Python :: Implementation :: CPython",
|
||||
"Programming Language :: Python :: Implementation :: PyPy",
|
||||
]
|
||||
dependencies = [
|
||||
"mcp~=1.8.0",
|
||||
"markitdown[all]>=0.1.1,<0.2.0",
|
||||
]
|
||||
|
||||
[project.urls]
|
||||
Documentation = "https://github.com/microsoft/markitdown#readme"
|
||||
Issues = "https://github.com/microsoft/markitdown/issues"
|
||||
Source = "https://github.com/microsoft/markitdown"
|
||||
|
||||
[tool.hatch.version]
|
||||
path = "src/markitdown_mcp/__about__.py"
|
||||
|
||||
[project.scripts]
|
||||
markitdown-mcp = "markitdown_mcp.__main__:main"
|
||||
|
||||
[tool.hatch.envs.types]
|
||||
extra-dependencies = [
|
||||
"mypy>=1.0.0",
|
||||
]
|
||||
[tool.hatch.envs.types.scripts]
|
||||
check = "mypy --install-types --non-interactive {args:src/markitdown_mcp tests}"
|
||||
|
||||
[tool.coverage.run]
|
||||
source_pkgs = ["markitdown-mcp", "tests"]
|
||||
branch = true
|
||||
parallel = true
|
||||
omit = [
|
||||
"src/markitdown_mcp/__about__.py",
|
||||
]
|
||||
|
||||
[tool.coverage.paths]
|
||||
markitdown-mcp = ["src/markitdown_mcp", "*/markitdown-mcp/src/markitdown_mcp"]
|
||||
tests = ["tests", "*/markitdown-mcp/tests"]
|
||||
|
||||
[tool.coverage.report]
|
||||
exclude_lines = [
|
||||
"no cov",
|
||||
"if __name__ == .__main__.:",
|
||||
"if TYPE_CHECKING:",
|
||||
]
|
||||
|
||||
[tool.hatch.build.targets.sdist]
|
||||
only-include = ["src/markitdown_mcp"]
|
||||
@@ -0,0 +1,4 @@
|
||||
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
|
||||
#
|
||||
# SPDX-License-Identifier: MIT
|
||||
__version__ = "0.0.1a4"
|
||||
@@ -0,0 +1,9 @@
|
||||
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
|
||||
#
|
||||
# SPDX-License-Identifier: MIT
|
||||
|
||||
from .__about__ import __version__
|
||||
|
||||
__all__ = [
|
||||
"__version__",
|
||||
]
|
||||
@@ -0,0 +1,127 @@
|
||||
import contextlib
|
||||
import sys
|
||||
import os
|
||||
from collections.abc import AsyncIterator
|
||||
from mcp.server.fastmcp import FastMCP
|
||||
from starlette.applications import Starlette
|
||||
from mcp.server.sse import SseServerTransport
|
||||
from starlette.requests import Request
|
||||
from starlette.routing import Mount, Route
|
||||
from starlette.types import Receive, Scope, Send
|
||||
from mcp.server import Server
|
||||
from mcp.server.streamable_http_manager import StreamableHTTPSessionManager
|
||||
from markitdown import MarkItDown
|
||||
import uvicorn
|
||||
|
||||
# Initialize FastMCP server for MarkItDown (SSE)
|
||||
mcp = FastMCP("markitdown")
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def convert_to_markdown(uri: str) -> str:
|
||||
"""Convert a resource described by an http:, https:, file: or data: URI to markdown"""
|
||||
return MarkItDown(enable_plugins=check_plugins_enabled()).convert_uri(uri).markdown
|
||||
|
||||
|
||||
def check_plugins_enabled() -> bool:
|
||||
return os.getenv("MARKITDOWN_ENABLE_PLUGINS", "false").strip().lower() in (
|
||||
"true",
|
||||
"1",
|
||||
"yes",
|
||||
)
|
||||
|
||||
|
||||
def create_starlette_app(mcp_server: Server, *, debug: bool = False) -> Starlette:
|
||||
sse = SseServerTransport("/messages/")
|
||||
session_manager = StreamableHTTPSessionManager(
|
||||
app=mcp_server,
|
||||
event_store=None,
|
||||
json_response=True,
|
||||
stateless=True,
|
||||
)
|
||||
|
||||
async def handle_sse(request: Request) -> None:
|
||||
async with sse.connect_sse(
|
||||
request.scope,
|
||||
request.receive,
|
||||
request._send,
|
||||
) as (read_stream, write_stream):
|
||||
await mcp_server.run(
|
||||
read_stream,
|
||||
write_stream,
|
||||
mcp_server.create_initialization_options(),
|
||||
)
|
||||
|
||||
async def handle_streamable_http(
|
||||
scope: Scope, receive: Receive, send: Send
|
||||
) -> None:
|
||||
await session_manager.handle_request(scope, receive, send)
|
||||
|
||||
@contextlib.asynccontextmanager
|
||||
async def lifespan(app: Starlette) -> AsyncIterator[None]:
|
||||
"""Context manager for session manager."""
|
||||
async with session_manager.run():
|
||||
print("Application started with StreamableHTTP session manager!")
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
print("Application shutting down...")
|
||||
|
||||
return Starlette(
|
||||
debug=debug,
|
||||
routes=[
|
||||
Route("/sse", endpoint=handle_sse),
|
||||
Mount("/mcp", app=handle_streamable_http),
|
||||
Mount("/messages/", app=sse.handle_post_message),
|
||||
],
|
||||
lifespan=lifespan,
|
||||
)
|
||||
|
||||
|
||||
# Main entry point
|
||||
def main():
|
||||
import argparse
|
||||
|
||||
mcp_server = mcp._mcp_server
|
||||
|
||||
parser = argparse.ArgumentParser(description="Run a MarkItDown MCP server")
|
||||
|
||||
parser.add_argument(
|
||||
"--http",
|
||||
action="store_true",
|
||||
help="Run the server with Streamable HTTP and SSE transport rather than STDIO (default: False)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--sse",
|
||||
action="store_true",
|
||||
help="(Deprecated) An alias for --http (default: False)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--host", default=None, help="Host to bind to (default: 127.0.0.1)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--port", type=int, default=None, help="Port to listen on (default: 3001)"
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
use_http = args.http or args.sse
|
||||
|
||||
if not use_http and (args.host or args.port):
|
||||
parser.error(
|
||||
"Host and port arguments are only valid when using streamable HTTP or SSE transport (see: --http)."
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
if use_http:
|
||||
starlette_app = create_starlette_app(mcp_server, debug=True)
|
||||
uvicorn.run(
|
||||
starlette_app,
|
||||
host=args.host if args.host else "127.0.0.1",
|
||||
port=args.port if args.port else 3001,
|
||||
)
|
||||
else:
|
||||
mcp.run()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,3 @@
|
||||
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
|
||||
#
|
||||
# SPDX-License-Identifier: MIT
|
||||
@@ -1,7 +1,7 @@
|
||||
# MarkItDown Sample Plugin
|
||||
|
||||
[](https://pypi.org/project/markitdown/)
|
||||

|
||||
[](https://pypi.org/project/markitdown-sample-plugin/)
|
||||

|
||||
[](https://github.com/microsoft/autogen)
|
||||
|
||||
|
||||
|
||||
@@ -1,6 +1,5 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
import os
|
||||
import pytest
|
||||
|
||||
from markitdown import MarkItDown, StreamInfo
|
||||
from markitdown_sample_plugin import RtfConverter
|
||||
|
||||
@@ -0,0 +1,232 @@
|
||||
# THIRD-PARTY SOFTWARE NOTICES AND INFORMATION
|
||||
|
||||
**Do Not Translate or Localize**
|
||||
|
||||
This project incorporates components from the projects listed below. The original copyright notices and the licenses
|
||||
under which MarkItDown received such components are set forth below. MarkItDown reserves all rights not expressly
|
||||
granted herein, whether by implication, estoppel or otherwise.
|
||||
|
||||
1.dwml (https://github.com/xiilei/dwml)
|
||||
|
||||
dwml NOTICES AND INFORMATION BEGIN HERE
|
||||
|
||||
-----------------------------------------
|
||||
|
||||
NOTE 1: What follows is a verbatim copy of dwml's LICENSE file, as it appeared on March 28th, 2025 - including
|
||||
placeholders for the copyright owner and year.
|
||||
|
||||
NOTE 2: The Apache License, Version 2.0, requires that modifications to the dwml source code be documented.
|
||||
The following section summarizes these changes. The full details are available in the MarkItDown source code
|
||||
repository under PR #1160 (https://github.com/microsoft/markitdown/pull/1160)
|
||||
|
||||
This project incorporates `dwml/latex_dict.py` and `dwml/omml.py` files without any additional logic modifications (which
|
||||
lives in `packages/markitdown/src/markitdown/converter_utils/docx/math` location). However, we have reformatted the code
|
||||
according to `black` code formatter. From `tests/docx.py` file, we have used `DOCXML_ROOT` XML namespaces and the rest of
|
||||
the file is not used.
|
||||
|
||||
-----------------------------------------
|
||||
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "{}"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright {yyyy} {name of copyright owner}
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
|
||||
-----------------------------------------
|
||||
END OF dwml NOTICES AND INFORMATION
|
||||
@@ -27,30 +27,34 @@ dependencies = [
|
||||
"beautifulsoup4",
|
||||
"requests",
|
||||
"markdownify",
|
||||
"magika>=0.6.1rc3",
|
||||
"magika~=0.6.1",
|
||||
"charset-normalizer",
|
||||
"defusedxml",
|
||||
"onnxruntime<=1.20.1; sys_platform == 'win32'",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
all = [
|
||||
"python-pptx",
|
||||
"mammoth",
|
||||
"mammoth~=1.11.0",
|
||||
"pandas",
|
||||
"openpyxl",
|
||||
"xlrd",
|
||||
"pdfminer.six",
|
||||
"lxml",
|
||||
"pdfminer.six>=20251230",
|
||||
"pdfplumber>=0.11.9",
|
||||
"olefile",
|
||||
"pydub",
|
||||
"SpeechRecognition",
|
||||
"youtube-transcript-api",
|
||||
"youtube-transcript-api~=1.0.0",
|
||||
"azure-ai-documentintelligence",
|
||||
"azure-identity"
|
||||
"azure-identity",
|
||||
]
|
||||
pptx = ["python-pptx"]
|
||||
docx = ["mammoth"]
|
||||
docx = ["mammoth~=1.11.0", "lxml"]
|
||||
xlsx = ["pandas", "openpyxl"]
|
||||
xls = ["pandas", "xlrd"]
|
||||
pdf = ["pdfminer.six"]
|
||||
pdf = ["pdfminer.six>=20251230", "pdfplumber>=0.11.9"]
|
||||
outlook = ["olefile"]
|
||||
audio-transcription = ["pydub", "SpeechRecognition"]
|
||||
youtube-transcription = ["youtube-transcript-api"]
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
|
||||
#
|
||||
# SPDX-License-Identifier: MIT
|
||||
__version__ = "0.1.0a4"
|
||||
__version__ = "0.1.5b1"
|
||||
|
||||
@@ -33,13 +33,13 @@ def main():
|
||||
OR
|
||||
|
||||
markitdown < example.pdf
|
||||
|
||||
|
||||
OR to save to a file use
|
||||
|
||||
|
||||
markitdown example.pdf -o example.md
|
||||
|
||||
|
||||
OR
|
||||
|
||||
|
||||
markitdown example.pdf > example.md
|
||||
"""
|
||||
).strip(),
|
||||
@@ -104,6 +104,12 @@ def main():
|
||||
help="List installed 3rd-party plugins. Plugins are loaded when using the -p or --use-plugin option.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--keep-data-uris",
|
||||
action="store_true",
|
||||
help="Keep data URIs (like base64-encoded images) in the output. By default, data URIs are truncated.",
|
||||
)
|
||||
|
||||
parser.add_argument("filename", nargs="?")
|
||||
args = parser.parse_args()
|
||||
|
||||
@@ -181,9 +187,15 @@ def main():
|
||||
markitdown = MarkItDown(enable_plugins=args.use_plugins)
|
||||
|
||||
if args.filename is None:
|
||||
result = markitdown.convert_stream(sys.stdin.buffer, stream_info=stream_info)
|
||||
result = markitdown.convert_stream(
|
||||
sys.stdin.buffer,
|
||||
stream_info=stream_info,
|
||||
keep_data_uris=args.keep_data_uris,
|
||||
)
|
||||
else:
|
||||
result = markitdown.convert(args.filename, stream_info=stream_info)
|
||||
result = markitdown.convert(
|
||||
args.filename, stream_info=stream_info, keep_data_uris=args.keep_data_uris
|
||||
)
|
||||
|
||||
_handle_output(args, result)
|
||||
|
||||
@@ -192,9 +204,14 @@ def _handle_output(args, result: DocumentConverterResult):
|
||||
"""Handle output to stdout or file"""
|
||||
if args.output:
|
||||
with open(args.output, "w", encoding="utf-8") as f:
|
||||
f.write(result.text_content)
|
||||
f.write(result.markdown)
|
||||
else:
|
||||
print(result.text_content)
|
||||
# Handle stdout encoding errors more gracefully
|
||||
print(
|
||||
result.markdown.encode(sys.stdout.encoding, errors="replace").decode(
|
||||
sys.stdout.encoding
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def _exit_with_error(message: str):
|
||||
|
||||
@@ -1,7 +1,4 @@
|
||||
import os
|
||||
import tempfile
|
||||
from warnings import warn
|
||||
from typing import Any, Union, BinaryIO, Optional, List
|
||||
from typing import Any, BinaryIO, Optional
|
||||
from ._stream_info import StreamInfo
|
||||
|
||||
|
||||
@@ -72,7 +69,7 @@ class DocumentConverter:
|
||||
data = file_stream.read(100) # ... peek at the first 100 bytes, etc.
|
||||
file_stream.seek(cur_pos) # Reset the position to the original position
|
||||
|
||||
Prameters:
|
||||
Parameters:
|
||||
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
|
||||
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
|
||||
- kwargs: Additional keyword arguments for the converter.
|
||||
@@ -93,7 +90,7 @@ class DocumentConverter:
|
||||
"""
|
||||
Convert a document to Markdown text.
|
||||
|
||||
Prameters:
|
||||
Parameters:
|
||||
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
|
||||
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
|
||||
- kwargs: Additional keyword arguments for the converter.
|
||||
|
||||
@@ -1,16 +1,13 @@
|
||||
import copy
|
||||
import mimetypes
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import shutil
|
||||
import tempfile
|
||||
import warnings
|
||||
import traceback
|
||||
import io
|
||||
from dataclasses import dataclass
|
||||
from importlib.metadata import entry_points
|
||||
from typing import Any, List, Optional, Union, BinaryIO
|
||||
from typing import Any, List, Dict, Optional, Union, BinaryIO
|
||||
from pathlib import Path
|
||||
from urllib.parse import urlparse
|
||||
from warnings import warn
|
||||
@@ -20,6 +17,7 @@ import charset_normalizer
|
||||
import codecs
|
||||
|
||||
from ._stream_info import StreamInfo
|
||||
from ._uri_utils import parse_data_uri, file_uri_to_path
|
||||
|
||||
from .converters import (
|
||||
PlainTextConverter,
|
||||
@@ -40,6 +38,7 @@ from .converters import (
|
||||
ZipConverter,
|
||||
EpubConverter,
|
||||
DocumentIntelligenceConverter,
|
||||
CsvConverter,
|
||||
)
|
||||
|
||||
from ._base_converter import DocumentConverter, DocumentConverterResult
|
||||
@@ -116,6 +115,7 @@ class MarkItDown:
|
||||
# TODO - remove these (see enable_builtins)
|
||||
self._llm_client: Any = None
|
||||
self._llm_model: Union[str | None] = None
|
||||
self._llm_prompt: Union[str | None] = None
|
||||
self._exiftool_path: Union[str | None] = None
|
||||
self._style_map: Union[str | None] = None
|
||||
|
||||
@@ -140,6 +140,7 @@ class MarkItDown:
|
||||
# TODO: Move these into converter constructors
|
||||
self._llm_client = kwargs.get("llm_client")
|
||||
self._llm_model = kwargs.get("llm_model")
|
||||
self._llm_prompt = kwargs.get("llm_prompt")
|
||||
self._exiftool_path = kwargs.get("exiftool_path")
|
||||
self._style_map = kwargs.get("style_map")
|
||||
|
||||
@@ -193,12 +194,28 @@ class MarkItDown:
|
||||
self.register_converter(PdfConverter())
|
||||
self.register_converter(OutlookMsgConverter())
|
||||
self.register_converter(EpubConverter())
|
||||
self.register_converter(CsvConverter())
|
||||
|
||||
# Register Document Intelligence converter at the top of the stack if endpoint is provided
|
||||
docintel_endpoint = kwargs.get("docintel_endpoint")
|
||||
if docintel_endpoint is not None:
|
||||
docintel_args: Dict[str, Any] = {}
|
||||
docintel_args["endpoint"] = docintel_endpoint
|
||||
|
||||
docintel_credential = kwargs.get("docintel_credential")
|
||||
if docintel_credential is not None:
|
||||
docintel_args["credential"] = docintel_credential
|
||||
|
||||
docintel_types = kwargs.get("docintel_file_types")
|
||||
if docintel_types is not None:
|
||||
docintel_args["file_types"] = docintel_types
|
||||
|
||||
docintel_version = kwargs.get("docintel_api_version")
|
||||
if docintel_version is not None:
|
||||
docintel_args["api_version"] = docintel_version
|
||||
|
||||
self.register_converter(
|
||||
DocumentIntelligenceConverter(endpoint=docintel_endpoint)
|
||||
DocumentIntelligenceConverter(**docintel_args),
|
||||
)
|
||||
|
||||
self._builtins_enabled = True
|
||||
@@ -242,9 +259,10 @@ class MarkItDown:
|
||||
# Local path or url
|
||||
if isinstance(source, str):
|
||||
if (
|
||||
source.startswith("http://")
|
||||
or source.startswith("https://")
|
||||
or source.startswith("file://")
|
||||
source.startswith("http:")
|
||||
or source.startswith("https:")
|
||||
or source.startswith("file:")
|
||||
or source.startswith("data:")
|
||||
):
|
||||
# Rename the url argument to mock_url
|
||||
# (Deprecated -- use stream_info)
|
||||
@@ -253,7 +271,7 @@ class MarkItDown:
|
||||
_kwargs["mock_url"] = _kwargs["url"]
|
||||
del _kwargs["url"]
|
||||
|
||||
return self.convert_url(source, stream_info=stream_info, **_kwargs)
|
||||
return self.convert_uri(source, stream_info=stream_info, **_kwargs)
|
||||
else:
|
||||
return self.convert_local(source, stream_info=stream_info, **kwargs)
|
||||
# Path object
|
||||
@@ -363,22 +381,80 @@ class MarkItDown:
|
||||
url: str,
|
||||
*,
|
||||
stream_info: Optional[StreamInfo] = None,
|
||||
file_extension: Optional[str] = None,
|
||||
mock_url: Optional[str] = None,
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult:
|
||||
"""Alias for convert_uri()"""
|
||||
# convert_url will likely be deprecated in the future in favor of convert_uri
|
||||
return self.convert_uri(
|
||||
url,
|
||||
stream_info=stream_info,
|
||||
file_extension=file_extension,
|
||||
mock_url=mock_url,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def convert_uri(
|
||||
self,
|
||||
uri: str,
|
||||
*,
|
||||
stream_info: Optional[StreamInfo] = None,
|
||||
file_extension: Optional[str] = None, # Deprecated -- use stream_info
|
||||
mock_url: Optional[
|
||||
str
|
||||
] = None, # Mock the request as if it came from a different URL
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult: # TODO: fix kwargs type
|
||||
# Send a HTTP request to the URL
|
||||
response = self._requests_session.get(url, stream=True)
|
||||
response.raise_for_status()
|
||||
return self.convert_response(
|
||||
response,
|
||||
stream_info=stream_info,
|
||||
file_extension=file_extension,
|
||||
url=mock_url,
|
||||
**kwargs,
|
||||
)
|
||||
) -> DocumentConverterResult:
|
||||
uri = uri.strip()
|
||||
|
||||
# File URIs
|
||||
if uri.startswith("file:"):
|
||||
netloc, path = file_uri_to_path(uri)
|
||||
if netloc and netloc != "localhost":
|
||||
raise ValueError(
|
||||
f"Unsupported file URI: {uri}. Netloc must be empty or localhost."
|
||||
)
|
||||
return self.convert_local(
|
||||
path,
|
||||
stream_info=stream_info,
|
||||
file_extension=file_extension,
|
||||
url=mock_url,
|
||||
**kwargs,
|
||||
)
|
||||
# Data URIs
|
||||
elif uri.startswith("data:"):
|
||||
mimetype, attributes, data = parse_data_uri(uri)
|
||||
|
||||
base_guess = StreamInfo(
|
||||
mimetype=mimetype,
|
||||
charset=attributes.get("charset"),
|
||||
)
|
||||
if stream_info is not None:
|
||||
base_guess = base_guess.copy_and_update(stream_info)
|
||||
|
||||
return self.convert_stream(
|
||||
io.BytesIO(data),
|
||||
stream_info=base_guess,
|
||||
file_extension=file_extension,
|
||||
url=mock_url,
|
||||
**kwargs,
|
||||
)
|
||||
# HTTP/HTTPS URIs
|
||||
elif uri.startswith("http:") or uri.startswith("https:"):
|
||||
response = self._requests_session.get(uri, stream=True)
|
||||
response.raise_for_status()
|
||||
return self.convert_response(
|
||||
response,
|
||||
stream_info=stream_info,
|
||||
file_extension=file_extension,
|
||||
url=mock_url,
|
||||
**kwargs,
|
||||
)
|
||||
else:
|
||||
raise ValueError(
|
||||
f"Unsupported URI scheme: {uri.split(':')[0]}. Supported schemes are: file:, data:, http:, https:"
|
||||
)
|
||||
|
||||
def convert_response(
|
||||
self,
|
||||
@@ -474,7 +550,7 @@ class MarkItDown:
|
||||
# Sanity check -- make sure the cur_pos is still the same
|
||||
assert (
|
||||
cur_pos == file_stream.tell()
|
||||
), f"File stream position should NOT change between guess iterations"
|
||||
), "File stream position should NOT change between guess iterations"
|
||||
|
||||
_kwargs = {k: v for k, v in kwargs.items()}
|
||||
|
||||
@@ -485,6 +561,9 @@ class MarkItDown:
|
||||
if "llm_model" not in _kwargs and self._llm_model is not None:
|
||||
_kwargs["llm_model"] = self._llm_model
|
||||
|
||||
if "llm_prompt" not in _kwargs and self._llm_prompt is not None:
|
||||
_kwargs["llm_prompt"] = self._llm_prompt
|
||||
|
||||
if "style_map" not in _kwargs and self._style_map is not None:
|
||||
_kwargs["style_map"] = self._style_map
|
||||
|
||||
@@ -541,7 +620,7 @@ class MarkItDown:
|
||||
|
||||
# Nothing can handle it!
|
||||
raise UnsupportedFormatException(
|
||||
f"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported."
|
||||
"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported."
|
||||
)
|
||||
|
||||
def register_page_converter(self, converter: DocumentConverter) -> None:
|
||||
|
||||
@@ -0,0 +1,52 @@
|
||||
import base64
|
||||
import os
|
||||
from typing import Tuple, Dict
|
||||
from urllib.request import url2pathname
|
||||
from urllib.parse import urlparse, unquote_to_bytes
|
||||
|
||||
|
||||
def file_uri_to_path(file_uri: str) -> Tuple[str | None, str]:
|
||||
"""Convert a file URI to a local file path"""
|
||||
parsed = urlparse(file_uri)
|
||||
if parsed.scheme != "file":
|
||||
raise ValueError(f"Not a file URL: {file_uri}")
|
||||
|
||||
netloc = parsed.netloc if parsed.netloc else None
|
||||
path = os.path.abspath(url2pathname(parsed.path))
|
||||
return netloc, path
|
||||
|
||||
|
||||
def parse_data_uri(uri: str) -> Tuple[str | None, Dict[str, str], bytes]:
|
||||
if not uri.startswith("data:"):
|
||||
raise ValueError("Not a data URI")
|
||||
|
||||
header, _, data = uri.partition(",")
|
||||
if not _:
|
||||
raise ValueError("Malformed data URI, missing ',' separator")
|
||||
|
||||
meta = header[5:] # Strip 'data:'
|
||||
parts = meta.split(";")
|
||||
|
||||
is_base64 = False
|
||||
# Ends with base64?
|
||||
if parts[-1] == "base64":
|
||||
parts.pop()
|
||||
is_base64 = True
|
||||
|
||||
mime_type = None # Normally this would default to text/plain but we won't assume
|
||||
if len(parts) and len(parts[0]) > 0:
|
||||
# First part is the mime type
|
||||
mime_type = parts.pop(0)
|
||||
|
||||
attributes: Dict[str, str] = {}
|
||||
for part in parts:
|
||||
# Handle key=value pairs in the middle
|
||||
if "=" in part:
|
||||
key, value = part.split("=", 1)
|
||||
attributes[key] = value
|
||||
elif len(part) > 0:
|
||||
attributes[part] = ""
|
||||
|
||||
content = base64.b64decode(data) if is_base64 else unquote_to_bytes(data)
|
||||
|
||||
return mime_type, attributes, content
|
||||
@@ -0,0 +1,273 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/latex_dict.py
|
||||
On 25/03/2025
|
||||
"""
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
CHARS = ("{", "}", "_", "^", "#", "&", "$", "%", "~")
|
||||
|
||||
BLANK = ""
|
||||
BACKSLASH = "\\"
|
||||
ALN = "&"
|
||||
|
||||
CHR = {
|
||||
# Unicode : Latex Math Symbols
|
||||
# Top accents
|
||||
"\u0300": "\\grave{{{0}}}",
|
||||
"\u0301": "\\acute{{{0}}}",
|
||||
"\u0302": "\\hat{{{0}}}",
|
||||
"\u0303": "\\tilde{{{0}}}",
|
||||
"\u0304": "\\bar{{{0}}}",
|
||||
"\u0305": "\\overbar{{{0}}}",
|
||||
"\u0306": "\\breve{{{0}}}",
|
||||
"\u0307": "\\dot{{{0}}}",
|
||||
"\u0308": "\\ddot{{{0}}}",
|
||||
"\u0309": "\\ovhook{{{0}}}",
|
||||
"\u030a": "\\ocirc{{{0}}}}",
|
||||
"\u030c": "\\check{{{0}}}}",
|
||||
"\u0310": "\\candra{{{0}}}",
|
||||
"\u0312": "\\oturnedcomma{{{0}}}",
|
||||
"\u0315": "\\ocommatopright{{{0}}}",
|
||||
"\u031a": "\\droang{{{0}}}",
|
||||
"\u0338": "\\not{{{0}}}",
|
||||
"\u20d0": "\\leftharpoonaccent{{{0}}}",
|
||||
"\u20d1": "\\rightharpoonaccent{{{0}}}",
|
||||
"\u20d2": "\\vertoverlay{{{0}}}",
|
||||
"\u20d6": "\\overleftarrow{{{0}}}",
|
||||
"\u20d7": "\\vec{{{0}}}",
|
||||
"\u20db": "\\dddot{{{0}}}",
|
||||
"\u20dc": "\\ddddot{{{0}}}",
|
||||
"\u20e1": "\\overleftrightarrow{{{0}}}",
|
||||
"\u20e7": "\\annuity{{{0}}}",
|
||||
"\u20e9": "\\widebridgeabove{{{0}}}",
|
||||
"\u20f0": "\\asteraccent{{{0}}}",
|
||||
# Bottom accents
|
||||
"\u0330": "\\wideutilde{{{0}}}",
|
||||
"\u0331": "\\underbar{{{0}}}",
|
||||
"\u20e8": "\\threeunderdot{{{0}}}",
|
||||
"\u20ec": "\\underrightharpoondown{{{0}}}",
|
||||
"\u20ed": "\\underleftharpoondown{{{0}}}",
|
||||
"\u20ee": "\\underledtarrow{{{0}}}",
|
||||
"\u20ef": "\\underrightarrow{{{0}}}",
|
||||
# Over | group
|
||||
"\u23b4": "\\overbracket{{{0}}}",
|
||||
"\u23dc": "\\overparen{{{0}}}",
|
||||
"\u23de": "\\overbrace{{{0}}}",
|
||||
# Under| group
|
||||
"\u23b5": "\\underbracket{{{0}}}",
|
||||
"\u23dd": "\\underparen{{{0}}}",
|
||||
"\u23df": "\\underbrace{{{0}}}",
|
||||
}
|
||||
|
||||
CHR_BO = {
|
||||
# Big operators,
|
||||
"\u2140": "\\Bbbsum",
|
||||
"\u220f": "\\prod",
|
||||
"\u2210": "\\coprod",
|
||||
"\u2211": "\\sum",
|
||||
"\u222b": "\\int",
|
||||
"\u22c0": "\\bigwedge",
|
||||
"\u22c1": "\\bigvee",
|
||||
"\u22c2": "\\bigcap",
|
||||
"\u22c3": "\\bigcup",
|
||||
"\u2a00": "\\bigodot",
|
||||
"\u2a01": "\\bigoplus",
|
||||
"\u2a02": "\\bigotimes",
|
||||
}
|
||||
|
||||
T = {
|
||||
"\u2192": "\\rightarrow ",
|
||||
# Greek letters
|
||||
"\U0001d6fc": "\\alpha ",
|
||||
"\U0001d6fd": "\\beta ",
|
||||
"\U0001d6fe": "\\gamma ",
|
||||
"\U0001d6ff": "\\theta ",
|
||||
"\U0001d700": "\\epsilon ",
|
||||
"\U0001d701": "\\zeta ",
|
||||
"\U0001d702": "\\eta ",
|
||||
"\U0001d703": "\\theta ",
|
||||
"\U0001d704": "\\iota ",
|
||||
"\U0001d705": "\\kappa ",
|
||||
"\U0001d706": "\\lambda ",
|
||||
"\U0001d707": "\\m ",
|
||||
"\U0001d708": "\\n ",
|
||||
"\U0001d709": "\\xi ",
|
||||
"\U0001d70a": "\\omicron ",
|
||||
"\U0001d70b": "\\pi ",
|
||||
"\U0001d70c": "\\rho ",
|
||||
"\U0001d70d": "\\varsigma ",
|
||||
"\U0001d70e": "\\sigma ",
|
||||
"\U0001d70f": "\\ta ",
|
||||
"\U0001d710": "\\upsilon ",
|
||||
"\U0001d711": "\\phi ",
|
||||
"\U0001d712": "\\chi ",
|
||||
"\U0001d713": "\\psi ",
|
||||
"\U0001d714": "\\omega ",
|
||||
"\U0001d715": "\\partial ",
|
||||
"\U0001d716": "\\varepsilon ",
|
||||
"\U0001d717": "\\vartheta ",
|
||||
"\U0001d718": "\\varkappa ",
|
||||
"\U0001d719": "\\varphi ",
|
||||
"\U0001d71a": "\\varrho ",
|
||||
"\U0001d71b": "\\varpi ",
|
||||
# Relation symbols
|
||||
"\u2190": "\\leftarrow ",
|
||||
"\u2191": "\\uparrow ",
|
||||
"\u2192": "\\rightarrow ",
|
||||
"\u2193": "\\downright ",
|
||||
"\u2194": "\\leftrightarrow ",
|
||||
"\u2195": "\\updownarrow ",
|
||||
"\u2196": "\\nwarrow ",
|
||||
"\u2197": "\\nearrow ",
|
||||
"\u2198": "\\searrow ",
|
||||
"\u2199": "\\swarrow ",
|
||||
"\u22ee": "\\vdots ",
|
||||
"\u22ef": "\\cdots ",
|
||||
"\u22f0": "\\adots ",
|
||||
"\u22f1": "\\ddots ",
|
||||
"\u2260": "\\ne ",
|
||||
"\u2264": "\\leq ",
|
||||
"\u2265": "\\geq ",
|
||||
"\u2266": "\\leqq ",
|
||||
"\u2267": "\\geqq ",
|
||||
"\u2268": "\\lneqq ",
|
||||
"\u2269": "\\gneqq ",
|
||||
"\u226a": "\\ll ",
|
||||
"\u226b": "\\gg ",
|
||||
"\u2208": "\\in ",
|
||||
"\u2209": "\\notin ",
|
||||
"\u220b": "\\ni ",
|
||||
"\u220c": "\\nni ",
|
||||
# Ordinary symbols
|
||||
"\u221e": "\\infty ",
|
||||
# Binary relations
|
||||
"\u00b1": "\\pm ",
|
||||
"\u2213": "\\mp ",
|
||||
# Italic, Latin, uppercase
|
||||
"\U0001d434": "A",
|
||||
"\U0001d435": "B",
|
||||
"\U0001d436": "C",
|
||||
"\U0001d437": "D",
|
||||
"\U0001d438": "E",
|
||||
"\U0001d439": "F",
|
||||
"\U0001d43a": "G",
|
||||
"\U0001d43b": "H",
|
||||
"\U0001d43c": "I",
|
||||
"\U0001d43d": "J",
|
||||
"\U0001d43e": "K",
|
||||
"\U0001d43f": "L",
|
||||
"\U0001d440": "M",
|
||||
"\U0001d441": "N",
|
||||
"\U0001d442": "O",
|
||||
"\U0001d443": "P",
|
||||
"\U0001d444": "Q",
|
||||
"\U0001d445": "R",
|
||||
"\U0001d446": "S",
|
||||
"\U0001d447": "T",
|
||||
"\U0001d448": "U",
|
||||
"\U0001d449": "V",
|
||||
"\U0001d44a": "W",
|
||||
"\U0001d44b": "X",
|
||||
"\U0001d44c": "Y",
|
||||
"\U0001d44d": "Z",
|
||||
# Italic, Latin, lowercase
|
||||
"\U0001d44e": "a",
|
||||
"\U0001d44f": "b",
|
||||
"\U0001d450": "c",
|
||||
"\U0001d451": "d",
|
||||
"\U0001d452": "e",
|
||||
"\U0001d453": "f",
|
||||
"\U0001d454": "g",
|
||||
"\U0001d456": "i",
|
||||
"\U0001d457": "j",
|
||||
"\U0001d458": "k",
|
||||
"\U0001d459": "l",
|
||||
"\U0001d45a": "m",
|
||||
"\U0001d45b": "n",
|
||||
"\U0001d45c": "o",
|
||||
"\U0001d45d": "p",
|
||||
"\U0001d45e": "q",
|
||||
"\U0001d45f": "r",
|
||||
"\U0001d460": "s",
|
||||
"\U0001d461": "t",
|
||||
"\U0001d462": "u",
|
||||
"\U0001d463": "v",
|
||||
"\U0001d464": "w",
|
||||
"\U0001d465": "x",
|
||||
"\U0001d466": "y",
|
||||
"\U0001d467": "z",
|
||||
}
|
||||
|
||||
FUNC = {
|
||||
"sin": "\\sin({fe})",
|
||||
"cos": "\\cos({fe})",
|
||||
"tan": "\\tan({fe})",
|
||||
"arcsin": "\\arcsin({fe})",
|
||||
"arccos": "\\arccos({fe})",
|
||||
"arctan": "\\arctan({fe})",
|
||||
"arccot": "\\arccot({fe})",
|
||||
"sinh": "\\sinh({fe})",
|
||||
"cosh": "\\cosh({fe})",
|
||||
"tanh": "\\tanh({fe})",
|
||||
"coth": "\\coth({fe})",
|
||||
"sec": "\\sec({fe})",
|
||||
"csc": "\\csc({fe})",
|
||||
}
|
||||
|
||||
FUNC_PLACE = "{fe}"
|
||||
|
||||
BRK = "\\\\"
|
||||
|
||||
CHR_DEFAULT = {
|
||||
"ACC_VAL": "\\hat{{{0}}}",
|
||||
}
|
||||
|
||||
POS = {
|
||||
"top": "\\overline{{{0}}}", # not sure
|
||||
"bot": "\\underline{{{0}}}",
|
||||
}
|
||||
|
||||
POS_DEFAULT = {
|
||||
"BAR_VAL": "\\overline{{{0}}}",
|
||||
}
|
||||
|
||||
SUB = "_{{{0}}}"
|
||||
|
||||
SUP = "^{{{0}}}"
|
||||
|
||||
F = {
|
||||
"bar": "\\frac{{{num}}}{{{den}}}",
|
||||
"skw": r"^{{{num}}}/_{{{den}}}",
|
||||
"noBar": "\\genfrac{{}}{{}}{{0pt}}{{}}{{{num}}}{{{den}}}",
|
||||
"lin": "{{{num}}}/{{{den}}}",
|
||||
}
|
||||
F_DEFAULT = "\\frac{{{num}}}{{{den}}}"
|
||||
|
||||
D = "\\left{left}{text}\\right{right}"
|
||||
|
||||
D_DEFAULT = {
|
||||
"left": "(",
|
||||
"right": ")",
|
||||
"null": ".",
|
||||
}
|
||||
|
||||
RAD = "\\sqrt[{deg}]{{{text}}}"
|
||||
|
||||
RAD_DEFAULT = "\\sqrt{{{text}}}"
|
||||
|
||||
ARR = "\\begin{{array}}{{c}}{text}\\end{{array}}"
|
||||
|
||||
LIM_FUNC = {
|
||||
"lim": "\\lim_{{{lim}}}",
|
||||
"max": "\\max_{{{lim}}}",
|
||||
"min": "\\min_{{{lim}}}",
|
||||
}
|
||||
|
||||
LIM_TO = ("\\rightarrow", "\\to")
|
||||
|
||||
LIM_UPP = "\\overset{{{lim}}}{{{text}}}"
|
||||
|
||||
M = "\\begin{{matrix}}{text}\\end{{matrix}}"
|
||||
@@ -0,0 +1,400 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
Office Math Markup Language (OMML)
|
||||
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/omml.py
|
||||
On 25/03/2025
|
||||
"""
|
||||
|
||||
from defusedxml import ElementTree as ET
|
||||
|
||||
from .latex_dict import (
|
||||
CHARS,
|
||||
CHR,
|
||||
CHR_BO,
|
||||
CHR_DEFAULT,
|
||||
POS,
|
||||
POS_DEFAULT,
|
||||
SUB,
|
||||
SUP,
|
||||
F,
|
||||
F_DEFAULT,
|
||||
T,
|
||||
FUNC,
|
||||
D,
|
||||
D_DEFAULT,
|
||||
RAD,
|
||||
RAD_DEFAULT,
|
||||
ARR,
|
||||
LIM_FUNC,
|
||||
LIM_TO,
|
||||
LIM_UPP,
|
||||
M,
|
||||
BRK,
|
||||
BLANK,
|
||||
BACKSLASH,
|
||||
ALN,
|
||||
FUNC_PLACE,
|
||||
)
|
||||
|
||||
OMML_NS = "{http://schemas.openxmlformats.org/officeDocument/2006/math}"
|
||||
|
||||
|
||||
def load(stream):
|
||||
tree = ET.parse(stream)
|
||||
for omath in tree.findall(OMML_NS + "oMath"):
|
||||
yield oMath2Latex(omath)
|
||||
|
||||
|
||||
def load_string(string):
|
||||
root = ET.fromstring(string)
|
||||
for omath in root.findall(OMML_NS + "oMath"):
|
||||
yield oMath2Latex(omath)
|
||||
|
||||
|
||||
def escape_latex(strs):
|
||||
last = None
|
||||
new_chr = []
|
||||
strs = strs.replace(r"\\", "\\")
|
||||
for c in strs:
|
||||
if (c in CHARS) and (last != BACKSLASH):
|
||||
new_chr.append(BACKSLASH + c)
|
||||
else:
|
||||
new_chr.append(c)
|
||||
last = c
|
||||
return BLANK.join(new_chr)
|
||||
|
||||
|
||||
def get_val(key, default=None, store=CHR):
|
||||
if key is not None:
|
||||
return key if not store else store.get(key, key)
|
||||
else:
|
||||
return default
|
||||
|
||||
|
||||
class Tag2Method(object):
|
||||
def call_method(self, elm, stag=None):
|
||||
getmethod = self.tag2meth.get
|
||||
if stag is None:
|
||||
stag = elm.tag.replace(OMML_NS, "")
|
||||
method = getmethod(stag)
|
||||
if method:
|
||||
return method(self, elm)
|
||||
else:
|
||||
return None
|
||||
|
||||
def process_children_list(self, elm, include=None):
|
||||
"""
|
||||
process children of the elm,return iterable
|
||||
"""
|
||||
for _e in list(elm):
|
||||
if OMML_NS not in _e.tag:
|
||||
continue
|
||||
stag = _e.tag.replace(OMML_NS, "")
|
||||
if include and (stag not in include):
|
||||
continue
|
||||
t = self.call_method(_e, stag=stag)
|
||||
if t is None:
|
||||
t = self.process_unknow(_e, stag)
|
||||
if t is None:
|
||||
continue
|
||||
yield (stag, t, _e)
|
||||
|
||||
def process_children_dict(self, elm, include=None):
|
||||
"""
|
||||
process children of the elm,return dict
|
||||
"""
|
||||
latex_chars = dict()
|
||||
for stag, t, e in self.process_children_list(elm, include):
|
||||
latex_chars[stag] = t
|
||||
return latex_chars
|
||||
|
||||
def process_children(self, elm, include=None):
|
||||
"""
|
||||
process children of the elm,return string
|
||||
"""
|
||||
return BLANK.join(
|
||||
(
|
||||
t if not isinstance(t, Tag2Method) else str(t)
|
||||
for stag, t, e in self.process_children_list(elm, include)
|
||||
)
|
||||
)
|
||||
|
||||
def process_unknow(self, elm, stag):
|
||||
return None
|
||||
|
||||
|
||||
class Pr(Tag2Method):
|
||||
text = ""
|
||||
|
||||
__val_tags = ("chr", "pos", "begChr", "endChr", "type")
|
||||
|
||||
__innerdict = None # can't use the __dict__
|
||||
|
||||
""" common properties of element"""
|
||||
|
||||
def __init__(self, elm):
|
||||
self.__innerdict = {}
|
||||
self.text = self.process_children(elm)
|
||||
|
||||
def __str__(self):
|
||||
return self.text
|
||||
|
||||
def __unicode__(self):
|
||||
return self.__str__(self)
|
||||
|
||||
def __getattr__(self, name):
|
||||
return self.__innerdict.get(name, None)
|
||||
|
||||
def do_brk(self, elm):
|
||||
self.__innerdict["brk"] = BRK
|
||||
return BRK
|
||||
|
||||
def do_common(self, elm):
|
||||
stag = elm.tag.replace(OMML_NS, "")
|
||||
if stag in self.__val_tags:
|
||||
t = elm.get("{0}val".format(OMML_NS))
|
||||
self.__innerdict[stag] = t
|
||||
return None
|
||||
|
||||
tag2meth = {
|
||||
"brk": do_brk,
|
||||
"chr": do_common,
|
||||
"pos": do_common,
|
||||
"begChr": do_common,
|
||||
"endChr": do_common,
|
||||
"type": do_common,
|
||||
}
|
||||
|
||||
|
||||
class oMath2Latex(Tag2Method):
|
||||
"""
|
||||
Convert oMath element of omml to latex
|
||||
"""
|
||||
|
||||
_t_dict = T
|
||||
|
||||
__direct_tags = ("box", "sSub", "sSup", "sSubSup", "num", "den", "deg", "e")
|
||||
|
||||
def __init__(self, element):
|
||||
self._latex = self.process_children(element)
|
||||
|
||||
def __str__(self):
|
||||
return self.latex
|
||||
|
||||
def __unicode__(self):
|
||||
return self.__str__(self)
|
||||
|
||||
def process_unknow(self, elm, stag):
|
||||
if stag in self.__direct_tags:
|
||||
return self.process_children(elm)
|
||||
elif stag[-2:] == "Pr":
|
||||
return Pr(elm)
|
||||
else:
|
||||
return None
|
||||
|
||||
@property
|
||||
def latex(self):
|
||||
return self._latex
|
||||
|
||||
def do_acc(self, elm):
|
||||
"""
|
||||
the accent function
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
latex_s = get_val(
|
||||
c_dict["accPr"].chr, default=CHR_DEFAULT.get("ACC_VAL"), store=CHR
|
||||
)
|
||||
return latex_s.format(c_dict["e"])
|
||||
|
||||
def do_bar(self, elm):
|
||||
"""
|
||||
the bar function
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
pr = c_dict["barPr"]
|
||||
latex_s = get_val(pr.pos, default=POS_DEFAULT.get("BAR_VAL"), store=POS)
|
||||
return pr.text + latex_s.format(c_dict["e"])
|
||||
|
||||
def do_d(self, elm):
|
||||
"""
|
||||
the delimiter object
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
pr = c_dict["dPr"]
|
||||
null = D_DEFAULT.get("null")
|
||||
s_val = get_val(pr.begChr, default=D_DEFAULT.get("left"), store=T)
|
||||
e_val = get_val(pr.endChr, default=D_DEFAULT.get("right"), store=T)
|
||||
return pr.text + D.format(
|
||||
left=null if not s_val else escape_latex(s_val),
|
||||
text=c_dict["e"],
|
||||
right=null if not e_val else escape_latex(e_val),
|
||||
)
|
||||
|
||||
def do_spre(self, elm):
|
||||
"""
|
||||
the Pre-Sub-Superscript object -- Not support yet
|
||||
"""
|
||||
pass
|
||||
|
||||
def do_sub(self, elm):
|
||||
text = self.process_children(elm)
|
||||
return SUB.format(text)
|
||||
|
||||
def do_sup(self, elm):
|
||||
text = self.process_children(elm)
|
||||
return SUP.format(text)
|
||||
|
||||
def do_f(self, elm):
|
||||
"""
|
||||
the fraction object
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
pr = c_dict["fPr"]
|
||||
latex_s = get_val(pr.type, default=F_DEFAULT, store=F)
|
||||
return pr.text + latex_s.format(num=c_dict.get("num"), den=c_dict.get("den"))
|
||||
|
||||
def do_func(self, elm):
|
||||
"""
|
||||
the Function-Apply object (Examples:sin cos)
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
func_name = c_dict.get("fName")
|
||||
return func_name.replace(FUNC_PLACE, c_dict.get("e"))
|
||||
|
||||
def do_fname(self, elm):
|
||||
"""
|
||||
the func name
|
||||
"""
|
||||
latex_chars = []
|
||||
for stag, t, e in self.process_children_list(elm):
|
||||
if stag == "r":
|
||||
if FUNC.get(t):
|
||||
latex_chars.append(FUNC[t])
|
||||
else:
|
||||
raise NotImplementedError("Not support func %s" % t)
|
||||
else:
|
||||
latex_chars.append(t)
|
||||
t = BLANK.join(latex_chars)
|
||||
return t if FUNC_PLACE in t else t + FUNC_PLACE # do_func will replace this
|
||||
|
||||
def do_groupchr(self, elm):
|
||||
"""
|
||||
the Group-Character object
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
pr = c_dict["groupChrPr"]
|
||||
latex_s = get_val(pr.chr)
|
||||
return pr.text + latex_s.format(c_dict["e"])
|
||||
|
||||
def do_rad(self, elm):
|
||||
"""
|
||||
the radical object
|
||||
"""
|
||||
c_dict = self.process_children_dict(elm)
|
||||
text = c_dict.get("e")
|
||||
deg_text = c_dict.get("deg")
|
||||
if deg_text:
|
||||
return RAD.format(deg=deg_text, text=text)
|
||||
else:
|
||||
return RAD_DEFAULT.format(text=text)
|
||||
|
||||
def do_eqarr(self, elm):
|
||||
"""
|
||||
the Array object
|
||||
"""
|
||||
return ARR.format(
|
||||
text=BRK.join(
|
||||
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
|
||||
)
|
||||
)
|
||||
|
||||
def do_limlow(self, elm):
|
||||
"""
|
||||
the Lower-Limit object
|
||||
"""
|
||||
t_dict = self.process_children_dict(elm, include=("e", "lim"))
|
||||
latex_s = LIM_FUNC.get(t_dict["e"])
|
||||
if not latex_s:
|
||||
raise NotImplementedError("Not support lim %s" % t_dict["e"])
|
||||
else:
|
||||
return latex_s.format(lim=t_dict.get("lim"))
|
||||
|
||||
def do_limupp(self, elm):
|
||||
"""
|
||||
the Upper-Limit object
|
||||
"""
|
||||
t_dict = self.process_children_dict(elm, include=("e", "lim"))
|
||||
return LIM_UPP.format(lim=t_dict.get("lim"), text=t_dict.get("e"))
|
||||
|
||||
def do_lim(self, elm):
|
||||
"""
|
||||
the lower limit of the limLow object and the upper limit of the limUpp function
|
||||
"""
|
||||
return self.process_children(elm).replace(LIM_TO[0], LIM_TO[1])
|
||||
|
||||
def do_m(self, elm):
|
||||
"""
|
||||
the Matrix object
|
||||
"""
|
||||
rows = []
|
||||
for stag, t, e in self.process_children_list(elm):
|
||||
if stag == "mPr":
|
||||
pass
|
||||
elif stag == "mr":
|
||||
rows.append(t)
|
||||
return M.format(text=BRK.join(rows))
|
||||
|
||||
def do_mr(self, elm):
|
||||
"""
|
||||
a single row of the matrix m
|
||||
"""
|
||||
return ALN.join(
|
||||
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
|
||||
)
|
||||
|
||||
def do_nary(self, elm):
|
||||
"""
|
||||
the n-ary object
|
||||
"""
|
||||
res = []
|
||||
bo = ""
|
||||
for stag, t, e in self.process_children_list(elm):
|
||||
if stag == "naryPr":
|
||||
bo = get_val(t.chr, store=CHR_BO)
|
||||
else:
|
||||
res.append(t)
|
||||
return bo + BLANK.join(res)
|
||||
|
||||
def do_r(self, elm):
|
||||
"""
|
||||
Get text from 'r' element,And try convert them to latex symbols
|
||||
@todo text style support , (sty)
|
||||
@todo \text (latex pure text support)
|
||||
"""
|
||||
_str = []
|
||||
for s in elm.findtext("./{0}t".format(OMML_NS)):
|
||||
# s = s if isinstance(s,unicode) else unicode(s,'utf-8')
|
||||
_str.append(self._t_dict.get(s, s))
|
||||
return escape_latex(BLANK.join(_str))
|
||||
|
||||
tag2meth = {
|
||||
"acc": do_acc,
|
||||
"r": do_r,
|
||||
"bar": do_bar,
|
||||
"sub": do_sub,
|
||||
"sup": do_sup,
|
||||
"f": do_f,
|
||||
"func": do_func,
|
||||
"fName": do_fname,
|
||||
"groupChr": do_groupchr,
|
||||
"d": do_d,
|
||||
"rad": do_rad,
|
||||
"eqArr": do_eqarr,
|
||||
"limLow": do_limlow,
|
||||
"limUpp": do_limupp,
|
||||
"lim": do_lim,
|
||||
"m": do_m,
|
||||
"mr": do_mr,
|
||||
"nary": do_nary,
|
||||
}
|
||||
@@ -0,0 +1,156 @@
|
||||
import zipfile
|
||||
from io import BytesIO
|
||||
from typing import BinaryIO
|
||||
from xml.etree import ElementTree as ET
|
||||
|
||||
from bs4 import BeautifulSoup, Tag
|
||||
|
||||
from .math.omml import OMML_NS, oMath2Latex
|
||||
|
||||
MATH_ROOT_TEMPLATE = "".join(
|
||||
(
|
||||
"<w:document ",
|
||||
'xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" ',
|
||||
'xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" ',
|
||||
'xmlns:o="urn:schemas-microsoft-com:office:office" ',
|
||||
'xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" ',
|
||||
'xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" ',
|
||||
'xmlns:v="urn:schemas-microsoft-com:vml" ',
|
||||
'xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" ',
|
||||
'xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" ',
|
||||
'xmlns:w10="urn:schemas-microsoft-com:office:word" ',
|
||||
'xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" ',
|
||||
'xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" ',
|
||||
'xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" ',
|
||||
'xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" ',
|
||||
'xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" ',
|
||||
'xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">',
|
||||
"{0}</w:document>",
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def _convert_omath_to_latex(tag: Tag) -> str:
|
||||
"""
|
||||
Converts an OMML (Office Math Markup Language) tag to LaTeX format.
|
||||
|
||||
Args:
|
||||
tag (Tag): A BeautifulSoup Tag object representing the OMML element.
|
||||
|
||||
Returns:
|
||||
str: The LaTeX representation of the OMML element.
|
||||
"""
|
||||
# Format the tag into a complete XML document string
|
||||
math_root = ET.fromstring(MATH_ROOT_TEMPLATE.format(str(tag)))
|
||||
# Find the 'oMath' element within the XML document
|
||||
math_element = math_root.find(OMML_NS + "oMath")
|
||||
# Convert the 'oMath' element to LaTeX using the oMath2Latex function
|
||||
latex = oMath2Latex(math_element).latex
|
||||
return latex
|
||||
|
||||
|
||||
def _get_omath_tag_replacement(tag: Tag, block: bool = False) -> Tag:
|
||||
"""
|
||||
Creates a replacement tag for an OMML (Office Math Markup Language) element.
|
||||
|
||||
Args:
|
||||
tag (Tag): A BeautifulSoup Tag object representing the "oMath" element.
|
||||
block (bool, optional): If True, the LaTeX will be wrapped in double dollar signs for block mode. Defaults to False.
|
||||
|
||||
Returns:
|
||||
Tag: A BeautifulSoup Tag object representing the replacement element.
|
||||
"""
|
||||
t_tag = Tag(name="w:t")
|
||||
t_tag.string = (
|
||||
f"$${_convert_omath_to_latex(tag)}$$"
|
||||
if block
|
||||
else f"${_convert_omath_to_latex(tag)}$"
|
||||
)
|
||||
r_tag = Tag(name="w:r")
|
||||
r_tag.append(t_tag)
|
||||
return r_tag
|
||||
|
||||
|
||||
def _replace_equations(tag: Tag):
|
||||
"""
|
||||
Replaces OMML (Office Math Markup Language) elements with their LaTeX equivalents.
|
||||
|
||||
Args:
|
||||
tag (Tag): A BeautifulSoup Tag object representing the OMML element. Could be either "oMathPara" or "oMath".
|
||||
|
||||
Raises:
|
||||
ValueError: If the tag is not supported.
|
||||
"""
|
||||
if tag.name == "oMathPara":
|
||||
# Create a new paragraph tag
|
||||
p_tag = Tag(name="w:p")
|
||||
# Replace each 'oMath' child tag with its LaTeX equivalent as block equations
|
||||
for child_tag in tag.find_all("oMath"):
|
||||
p_tag.append(_get_omath_tag_replacement(child_tag, block=True))
|
||||
# Replace the original 'oMathPara' tag with the new paragraph tag
|
||||
tag.replace_with(p_tag)
|
||||
elif tag.name == "oMath":
|
||||
# Replace the 'oMath' tag with its LaTeX equivalent as inline equation
|
||||
tag.replace_with(_get_omath_tag_replacement(tag, block=False))
|
||||
else:
|
||||
raise ValueError(f"Not supported tag: {tag.name}")
|
||||
|
||||
|
||||
def _pre_process_math(content: bytes) -> bytes:
|
||||
"""
|
||||
Pre-processes the math content in a DOCX -> XML file by converting OMML (Office Math Markup Language) elements to LaTeX.
|
||||
This preprocessed content can be directly replaced in the DOCX file -> XMLs.
|
||||
|
||||
Args:
|
||||
content (bytes): The XML content of the DOCX file as bytes.
|
||||
|
||||
Returns:
|
||||
bytes: The processed content with OMML elements replaced by their LaTeX equivalents, encoded as bytes.
|
||||
"""
|
||||
soup = BeautifulSoup(content.decode(), features="xml")
|
||||
for tag in soup.find_all("oMathPara"):
|
||||
_replace_equations(tag)
|
||||
for tag in soup.find_all("oMath"):
|
||||
_replace_equations(tag)
|
||||
return str(soup).encode()
|
||||
|
||||
|
||||
def pre_process_docx(input_docx: BinaryIO) -> BinaryIO:
|
||||
"""
|
||||
Pre-processes a DOCX file with provided steps.
|
||||
|
||||
The process works by unzipping the DOCX file in memory, transforming specific XML files
|
||||
(such as converting OMML elements to LaTeX), and then zipping everything back into a
|
||||
DOCX file without writing to disk.
|
||||
|
||||
Args:
|
||||
input_docx (BinaryIO): A binary input stream representing the DOCX file.
|
||||
|
||||
Returns:
|
||||
BinaryIO: A binary output stream representing the processed DOCX file.
|
||||
"""
|
||||
output_docx = BytesIO()
|
||||
# The files that need to be pre-processed from .docx
|
||||
pre_process_enable_files = [
|
||||
"word/document.xml",
|
||||
"word/footnotes.xml",
|
||||
"word/endnotes.xml",
|
||||
]
|
||||
with zipfile.ZipFile(input_docx, mode="r") as zip_input:
|
||||
files = {name: zip_input.read(name) for name in zip_input.namelist()}
|
||||
with zipfile.ZipFile(output_docx, mode="w") as zip_output:
|
||||
zip_output.comment = zip_input.comment
|
||||
for name, content in files.items():
|
||||
if name in pre_process_enable_files:
|
||||
try:
|
||||
# Pre-process the content
|
||||
updated_content = _pre_process_math(content)
|
||||
# In the future, if there are more pre-processing steps, they can be added here
|
||||
zip_output.writestr(name, updated_content)
|
||||
except Exception:
|
||||
# If there is an error in processing the content, write the original content
|
||||
zip_output.writestr(name, content)
|
||||
else:
|
||||
zip_output.writestr(name, content)
|
||||
output_docx.seek(0)
|
||||
return output_docx
|
||||
@@ -17,8 +17,12 @@ from ._image_converter import ImageConverter
|
||||
from ._audio_converter import AudioConverter
|
||||
from ._outlook_msg_converter import OutlookMsgConverter
|
||||
from ._zip_converter import ZipConverter
|
||||
from ._doc_intel_converter import DocumentIntelligenceConverter
|
||||
from ._doc_intel_converter import (
|
||||
DocumentIntelligenceConverter,
|
||||
DocumentIntelligenceFileType,
|
||||
)
|
||||
from ._epub_converter import EpubConverter
|
||||
from ._csv_converter import CsvConverter
|
||||
|
||||
__all__ = [
|
||||
"PlainTextConverter",
|
||||
@@ -38,5 +42,7 @@ __all__ = [
|
||||
"OutlookMsgConverter",
|
||||
"ZipConverter",
|
||||
"DocumentIntelligenceConverter",
|
||||
"DocumentIntelligenceFileType",
|
||||
"EpubConverter",
|
||||
"CsvConverter",
|
||||
]
|
||||
|
||||
@@ -1,5 +1,4 @@
|
||||
import io
|
||||
from typing import Any, BinaryIO, Optional
|
||||
from typing import Any, BinaryIO
|
||||
|
||||
from ._exiftool import exiftool_metadata
|
||||
from ._transcribe_audio import transcribe_audio
|
||||
|
||||
@@ -1,9 +1,8 @@
|
||||
import io
|
||||
import re
|
||||
import base64
|
||||
import binascii
|
||||
from urllib.parse import parse_qs, urlparse
|
||||
from typing import Any, BinaryIO, Optional
|
||||
from typing import Any, BinaryIO
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
@@ -79,7 +78,7 @@ class BingSerpConverter(DocumentConverter):
|
||||
slug.extract()
|
||||
|
||||
# Parse the algorithmic results
|
||||
_markdownify = _CustomMarkdownify()
|
||||
_markdownify = _CustomMarkdownify(**kwargs)
|
||||
results = list()
|
||||
for result in soup.find_all(class_="b_algo"):
|
||||
if not hasattr(result, "find_all"):
|
||||
|
||||
@@ -0,0 +1,77 @@
|
||||
import csv
|
||||
import io
|
||||
from typing import BinaryIO, Any
|
||||
from charset_normalizer import from_bytes
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"text/csv",
|
||||
"application/csv",
|
||||
]
|
||||
ACCEPTED_FILE_EXTENSIONS = [".csv"]
|
||||
|
||||
|
||||
class CsvConverter(DocumentConverter):
|
||||
"""
|
||||
Converts CSV files to Markdown tables.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Read the file content
|
||||
if stream_info.charset:
|
||||
content = file_stream.read().decode(stream_info.charset)
|
||||
else:
|
||||
content = str(from_bytes(file_stream.read()).best())
|
||||
|
||||
# Parse CSV content
|
||||
reader = csv.reader(io.StringIO(content))
|
||||
rows = list(reader)
|
||||
|
||||
if not rows:
|
||||
return DocumentConverterResult(markdown="")
|
||||
|
||||
# Create markdown table
|
||||
markdown_table = []
|
||||
|
||||
# Add header row
|
||||
markdown_table.append("| " + " | ".join(rows[0]) + " |")
|
||||
|
||||
# Add separator row
|
||||
markdown_table.append("| " + " | ".join(["---"] * len(rows[0])) + " |")
|
||||
|
||||
# Add data rows
|
||||
for row in rows[1:]:
|
||||
# Make sure row has the same number of columns as header
|
||||
while len(row) < len(rows[0]):
|
||||
row.append("")
|
||||
# Truncate if row has more columns than header
|
||||
row = row[: len(rows[0])]
|
||||
markdown_table.append("| " + " | ".join(row) + " |")
|
||||
|
||||
result = "\n".join(markdown_table)
|
||||
|
||||
return DocumentConverterResult(markdown=result)
|
||||
@@ -1,12 +1,12 @@
|
||||
import sys
|
||||
import re
|
||||
|
||||
import os
|
||||
from typing import BinaryIO, Any, List
|
||||
from enum import Enum
|
||||
|
||||
from ._html_converter import HtmlConverter
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
from .._exceptions import MissingDependencyException
|
||||
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
@@ -18,49 +18,113 @@ try:
|
||||
AnalyzeResult,
|
||||
DocumentAnalysisFeature,
|
||||
)
|
||||
from azure.core.credentials import AzureKeyCredential, TokenCredential
|
||||
from azure.identity import DefaultAzureCredential
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
# Define these types for type hinting when the package is not available
|
||||
class AzureKeyCredential:
|
||||
pass
|
||||
|
||||
class TokenCredential:
|
||||
pass
|
||||
|
||||
class DocumentIntelligenceClient:
|
||||
pass
|
||||
|
||||
class AnalyzeDocumentRequest:
|
||||
pass
|
||||
|
||||
class AnalyzeResult:
|
||||
pass
|
||||
|
||||
class DocumentAnalysisFeature:
|
||||
pass
|
||||
|
||||
class DefaultAzureCredential:
|
||||
pass
|
||||
|
||||
|
||||
# TODO: currently, there is a bug in the document intelligence SDK with importing the "ContentFormat" enum.
|
||||
# This constant is a temporary fix until the bug is resolved.
|
||||
CONTENT_FORMAT = "markdown"
|
||||
|
||||
|
||||
OFFICE_MIME_TYPE_PREFIXES = [
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
"application/vnd.openxmlformats-officedocument.presentationml",
|
||||
"application/xhtml",
|
||||
"text/html",
|
||||
]
|
||||
class DocumentIntelligenceFileType(str, Enum):
|
||||
"""Enum of file types supported by the Document Intelligence Converter."""
|
||||
|
||||
OTHER_MIME_TYPE_PREFIXES = [
|
||||
"application/pdf",
|
||||
"application/x-pdf",
|
||||
"text/html",
|
||||
"image/",
|
||||
]
|
||||
# No OCR
|
||||
DOCX = "docx"
|
||||
PPTX = "pptx"
|
||||
XLSX = "xlsx"
|
||||
HTML = "html"
|
||||
# OCR
|
||||
PDF = "pdf"
|
||||
JPEG = "jpeg"
|
||||
PNG = "png"
|
||||
BMP = "bmp"
|
||||
TIFF = "tiff"
|
||||
|
||||
OFFICE_FILE_EXTENSIONS = [
|
||||
".docx",
|
||||
".xlsx",
|
||||
".pptx",
|
||||
".html",
|
||||
".htm",
|
||||
]
|
||||
|
||||
OTHER_FILE_EXTENSIONS = [
|
||||
".pdf",
|
||||
".jpeg",
|
||||
".jpg",
|
||||
".png",
|
||||
".bmp",
|
||||
".tiff",
|
||||
".heif",
|
||||
]
|
||||
def _get_mime_type_prefixes(types: List[DocumentIntelligenceFileType]) -> List[str]:
|
||||
"""Get the MIME type prefixes for the given file types."""
|
||||
prefixes: List[str] = []
|
||||
for type_ in types:
|
||||
if type_ == DocumentIntelligenceFileType.DOCX:
|
||||
prefixes.append(
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
||||
)
|
||||
elif type_ == DocumentIntelligenceFileType.PPTX:
|
||||
prefixes.append(
|
||||
"application/vnd.openxmlformats-officedocument.presentationml"
|
||||
)
|
||||
elif type_ == DocumentIntelligenceFileType.XLSX:
|
||||
prefixes.append(
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
)
|
||||
elif type_ == DocumentIntelligenceFileType.HTML:
|
||||
prefixes.append("text/html")
|
||||
prefixes.append("application/xhtml+xml")
|
||||
elif type_ == DocumentIntelligenceFileType.PDF:
|
||||
prefixes.append("application/pdf")
|
||||
prefixes.append("application/x-pdf")
|
||||
elif type_ == DocumentIntelligenceFileType.JPEG:
|
||||
prefixes.append("image/jpeg")
|
||||
elif type_ == DocumentIntelligenceFileType.PNG:
|
||||
prefixes.append("image/png")
|
||||
elif type_ == DocumentIntelligenceFileType.BMP:
|
||||
prefixes.append("image/bmp")
|
||||
elif type_ == DocumentIntelligenceFileType.TIFF:
|
||||
prefixes.append("image/tiff")
|
||||
return prefixes
|
||||
|
||||
|
||||
def _get_file_extensions(types: List[DocumentIntelligenceFileType]) -> List[str]:
|
||||
"""Get the file extensions for the given file types."""
|
||||
extensions: List[str] = []
|
||||
for type_ in types:
|
||||
if type_ == DocumentIntelligenceFileType.DOCX:
|
||||
extensions.append(".docx")
|
||||
elif type_ == DocumentIntelligenceFileType.PPTX:
|
||||
extensions.append(".pptx")
|
||||
elif type_ == DocumentIntelligenceFileType.XLSX:
|
||||
extensions.append(".xlsx")
|
||||
elif type_ == DocumentIntelligenceFileType.PDF:
|
||||
extensions.append(".pdf")
|
||||
elif type_ == DocumentIntelligenceFileType.JPEG:
|
||||
extensions.append(".jpg")
|
||||
extensions.append(".jpeg")
|
||||
elif type_ == DocumentIntelligenceFileType.PNG:
|
||||
extensions.append(".png")
|
||||
elif type_ == DocumentIntelligenceFileType.BMP:
|
||||
extensions.append(".bmp")
|
||||
elif type_ == DocumentIntelligenceFileType.TIFF:
|
||||
extensions.append(".tiff")
|
||||
elif type_ == DocumentIntelligenceFileType.HTML:
|
||||
extensions.append(".html")
|
||||
return extensions
|
||||
|
||||
|
||||
class DocumentIntelligenceConverter(DocumentConverter):
|
||||
@@ -71,8 +135,30 @@ class DocumentIntelligenceConverter(DocumentConverter):
|
||||
*,
|
||||
endpoint: str,
|
||||
api_version: str = "2024-07-31-preview",
|
||||
credential: AzureKeyCredential | TokenCredential | None = None,
|
||||
file_types: List[DocumentIntelligenceFileType] = [
|
||||
DocumentIntelligenceFileType.DOCX,
|
||||
DocumentIntelligenceFileType.PPTX,
|
||||
DocumentIntelligenceFileType.XLSX,
|
||||
DocumentIntelligenceFileType.PDF,
|
||||
DocumentIntelligenceFileType.JPEG,
|
||||
DocumentIntelligenceFileType.PNG,
|
||||
DocumentIntelligenceFileType.BMP,
|
||||
DocumentIntelligenceFileType.TIFF,
|
||||
],
|
||||
):
|
||||
"""
|
||||
Initialize the DocumentIntelligenceConverter.
|
||||
|
||||
Args:
|
||||
endpoint (str): The endpoint for the Document Intelligence service.
|
||||
api_version (str): The API version to use. Defaults to "2024-07-31-preview".
|
||||
credential (AzureKeyCredential | TokenCredential | None): The credential to use for authentication.
|
||||
file_types (List[DocumentIntelligenceFileType]): The file types to accept. Defaults to all supported file types.
|
||||
"""
|
||||
|
||||
super().__init__()
|
||||
self._file_types = file_types
|
||||
|
||||
# Raise an error if the dependencies are not available.
|
||||
# This is different than other converters since this one isn't even instantiated
|
||||
@@ -86,12 +172,18 @@ class DocumentIntelligenceConverter(DocumentConverter):
|
||||
_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
if credential is None:
|
||||
if os.environ.get("AZURE_API_KEY") is None:
|
||||
credential = DefaultAzureCredential()
|
||||
else:
|
||||
credential = AzureKeyCredential(os.environ["AZURE_API_KEY"])
|
||||
|
||||
self.endpoint = endpoint
|
||||
self.api_version = api_version
|
||||
self.doc_intel_client = DocumentIntelligenceClient(
|
||||
endpoint=self.endpoint,
|
||||
api_version=self.api_version,
|
||||
credential=DefaultAzureCredential(),
|
||||
credential=credential,
|
||||
)
|
||||
|
||||
def accepts(
|
||||
@@ -103,10 +195,10 @@ class DocumentIntelligenceConverter(DocumentConverter):
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in OFFICE_FILE_EXTENSIONS + OTHER_FILE_EXTENSIONS:
|
||||
if extension in _get_file_extensions(self._file_types):
|
||||
return True
|
||||
|
||||
for prefix in OFFICE_MIME_TYPE_PREFIXES + OTHER_MIME_TYPE_PREFIXES:
|
||||
for prefix in _get_mime_type_prefixes(self._file_types):
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
@@ -121,10 +213,18 @@ class DocumentIntelligenceConverter(DocumentConverter):
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in OFFICE_FILE_EXTENSIONS:
|
||||
# Types that don't support ocr
|
||||
no_ocr_types = [
|
||||
DocumentIntelligenceFileType.DOCX,
|
||||
DocumentIntelligenceFileType.PPTX,
|
||||
DocumentIntelligenceFileType.XLSX,
|
||||
DocumentIntelligenceFileType.HTML,
|
||||
]
|
||||
|
||||
if extension in _get_file_extensions(no_ocr_types):
|
||||
return []
|
||||
|
||||
for prefix in OFFICE_MIME_TYPE_PREFIXES:
|
||||
for prefix in _get_mime_type_prefixes(no_ocr_types):
|
||||
if mimetype.startswith(prefix):
|
||||
return []
|
||||
|
||||
|
||||
@@ -1,9 +1,12 @@
|
||||
import sys
|
||||
import io
|
||||
from warnings import warn
|
||||
|
||||
from typing import BinaryIO, Any
|
||||
|
||||
from ._html_converter import HtmlConverter
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from ..converter_utils.docx.pre_process import pre_process_docx
|
||||
from .._base_converter import DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
|
||||
@@ -12,6 +15,7 @@ from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
import mammoth
|
||||
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
@@ -72,6 +76,8 @@ class DocxConverter(HtmlConverter):
|
||||
)
|
||||
|
||||
style_map = kwargs.get("style_map", None)
|
||||
pre_process_stream = pre_process_docx(file_stream)
|
||||
return self._html_converter.convert_string(
|
||||
mammoth.convert_to_html(file_stream, style_map=style_map).value
|
||||
mammoth.convert_to_html(pre_process_stream, style_map=style_map).value,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@@ -1,11 +1,12 @@
|
||||
import os
|
||||
import zipfile
|
||||
import xml.dom.minidom as minidom
|
||||
from defusedxml import minidom
|
||||
from xml.dom.minidom import Document
|
||||
|
||||
from typing import BinaryIO, Any, Dict, List
|
||||
|
||||
from ._html_converter import HtmlConverter
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._base_converter import DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
@@ -128,7 +129,7 @@ class EpubConverter(HtmlConverter):
|
||||
markdown="\n\n".join(markdown_content), title=metadata["title"]
|
||||
)
|
||||
|
||||
def _get_text_from_node(self, dom: minidom.Document, tag_name: str) -> str | None:
|
||||
def _get_text_from_node(self, dom: Document, tag_name: str) -> str | None:
|
||||
"""Convenience function to extract a single occurrence of a tag (e.g., title)."""
|
||||
texts = self._get_all_texts_from_nodes(dom, tag_name)
|
||||
if len(texts) > 0:
|
||||
@@ -136,9 +137,7 @@ class EpubConverter(HtmlConverter):
|
||||
else:
|
||||
return None
|
||||
|
||||
def _get_all_texts_from_nodes(
|
||||
self, dom: minidom.Document, tag_name: str
|
||||
) -> List[str]:
|
||||
def _get_all_texts_from_nodes(self, dom: Document, tag_name: str) -> List[str]:
|
||||
"""Helper function to extract all occurrences of a tag (e.g., multiple authors)."""
|
||||
texts: List[str] = []
|
||||
for node in dom.getElementsByTagName(tag_name):
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
import json
|
||||
import subprocess
|
||||
import locale
|
||||
import sys
|
||||
import shutil
|
||||
import os
|
||||
import warnings
|
||||
from typing import BinaryIO, Any, Union
|
||||
import subprocess
|
||||
from typing import Any, BinaryIO, Union
|
||||
|
||||
|
||||
def _parse_version(version: str) -> tuple:
|
||||
return tuple(map(int, (version.split("."))))
|
||||
|
||||
|
||||
def exiftool_metadata(
|
||||
@@ -17,6 +17,24 @@ def exiftool_metadata(
|
||||
if not exiftool_path:
|
||||
return {}
|
||||
|
||||
# Verify exiftool version
|
||||
try:
|
||||
version_output = subprocess.run(
|
||||
[exiftool_path, "-ver"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=True,
|
||||
).stdout.strip()
|
||||
version = _parse_version(version_output)
|
||||
min_version = (12, 24)
|
||||
if version < min_version:
|
||||
raise RuntimeError(
|
||||
f"ExifTool version {version_output} is vulnerable to CVE-2021-22204. "
|
||||
"Please upgrade to version 12.24 or later."
|
||||
)
|
||||
except (subprocess.CalledProcessError, ValueError) as e:
|
||||
raise RuntimeError("Failed to verify ExifTool version.") from e
|
||||
|
||||
# Run exiftool
|
||||
cur_pos = file_stream.tell()
|
||||
try:
|
||||
|
||||
@@ -56,9 +56,9 @@ class HtmlConverter(DocumentConverter):
|
||||
body_elm = soup.find("body")
|
||||
webpage_text = ""
|
||||
if body_elm:
|
||||
webpage_text = _CustomMarkdownify().convert_soup(body_elm)
|
||||
webpage_text = _CustomMarkdownify(**kwargs).convert_soup(body_elm)
|
||||
else:
|
||||
webpage_text = _CustomMarkdownify().convert_soup(soup)
|
||||
webpage_text = _CustomMarkdownify(**kwargs).convert_soup(soup)
|
||||
|
||||
assert isinstance(webpage_text, str)
|
||||
|
||||
|
||||
@@ -50,8 +50,6 @@ class IpynbConverter(DocumentConverter):
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Parse and convert the notebook
|
||||
result = None
|
||||
|
||||
encoding = stream_info.charset or "utf-8"
|
||||
notebook_content = file_stream.read().decode(encoding=encoding)
|
||||
return self._convert(json.loads(notebook_content))
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
from typing import BinaryIO, Any, Union
|
||||
from typing import BinaryIO, Union
|
||||
import base64
|
||||
import mimetypes
|
||||
from .._stream_info import StreamInfo
|
||||
|
||||
@@ -17,6 +17,7 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
|
||||
|
||||
def __init__(self, **options: Any):
|
||||
options["heading_style"] = options.get("heading_style", markdownify.ATX)
|
||||
options["keep_data_uris"] = options.get("keep_data_uris", False)
|
||||
# Explicitly cast options to the expected type if necessary
|
||||
super().__init__(**options)
|
||||
|
||||
@@ -91,9 +92,11 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
|
||||
"""Same as usual converter, but removes data URIs"""
|
||||
|
||||
alt = el.attrs.get("alt", None) or ""
|
||||
src = el.attrs.get("src", None) or ""
|
||||
src = el.attrs.get("src", None) or el.attrs.get("data-src", None) or ""
|
||||
title = el.attrs.get("title", None) or ""
|
||||
title_part = ' "%s"' % title.replace('"', r"\"") if title else ""
|
||||
# Remove all line breaks from alt
|
||||
alt = alt.replace("\n", " ")
|
||||
if (
|
||||
convert_as_inline
|
||||
and el.parent.name not in self.options["keep_inline_images_in"]
|
||||
@@ -101,10 +104,23 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
|
||||
return alt
|
||||
|
||||
# Remove dataURIs
|
||||
if src.startswith("data:"):
|
||||
if src.startswith("data:") and not self.options["keep_data_uris"]:
|
||||
src = src.split(",")[0] + "..."
|
||||
|
||||
return "" % (alt, src, title_part)
|
||||
|
||||
def convert_input(
|
||||
self,
|
||||
el: Any,
|
||||
text: str,
|
||||
convert_as_inline: Optional[bool] = False,
|
||||
**kwargs,
|
||||
) -> str:
|
||||
"""Convert checkboxes to Markdown [x]/[ ] syntax."""
|
||||
|
||||
if el.get("type") == "checkbox":
|
||||
return "[x] " if el.has_attr("checked") else "[ ] "
|
||||
return ""
|
||||
|
||||
def convert_soup(self, soup: Any) -> str:
|
||||
return super().convert_soup(soup) # type: ignore
|
||||
|
||||
@@ -1,23 +1,69 @@
|
||||
import sys
|
||||
import io
|
||||
|
||||
import re
|
||||
from typing import BinaryIO, Any
|
||||
|
||||
|
||||
from ._html_converter import HtmlConverter
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
|
||||
# Pattern for MasterFormat-style partial numbering (e.g., ".1", ".2", ".10")
|
||||
PARTIAL_NUMBERING_PATTERN = re.compile(r"^\.\d+$")
|
||||
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
|
||||
def _merge_partial_numbering_lines(text: str) -> str:
|
||||
"""
|
||||
Post-process extracted text to merge MasterFormat-style partial numbering
|
||||
with the following text line.
|
||||
|
||||
MasterFormat documents use partial numbering like:
|
||||
.1 The intent of this Request for Proposal...
|
||||
.2 Available information relative to...
|
||||
|
||||
Some PDF extractors split these into separate lines:
|
||||
.1
|
||||
The intent of this Request for Proposal...
|
||||
|
||||
This function merges them back together.
|
||||
"""
|
||||
lines = text.split("\n")
|
||||
result_lines: list[str] = []
|
||||
i = 0
|
||||
|
||||
while i < len(lines):
|
||||
line = lines[i]
|
||||
stripped = line.strip()
|
||||
|
||||
# Check if this line is ONLY a partial numbering
|
||||
if PARTIAL_NUMBERING_PATTERN.match(stripped):
|
||||
# Look for the next non-empty line to merge with
|
||||
j = i + 1
|
||||
while j < len(lines) and not lines[j].strip():
|
||||
j += 1
|
||||
|
||||
if j < len(lines):
|
||||
# Merge the partial numbering with the next line
|
||||
next_line = lines[j].strip()
|
||||
result_lines.append(f"{stripped} {next_line}")
|
||||
i = j + 1 # Skip past the merged line
|
||||
else:
|
||||
# No next line to merge with, keep as is
|
||||
result_lines.append(line)
|
||||
i += 1
|
||||
else:
|
||||
result_lines.append(line)
|
||||
i += 1
|
||||
|
||||
return "\n".join(result_lines)
|
||||
|
||||
|
||||
# Load dependencies
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
import pdfminer
|
||||
import pdfminer.high_level
|
||||
import pdfplumber
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
|
||||
@@ -29,16 +75,388 @@ ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
ACCEPTED_FILE_EXTENSIONS = [".pdf"]
|
||||
|
||||
|
||||
def _to_markdown_table(table: list[list[str]], include_separator: bool = True) -> str:
|
||||
"""Convert a 2D list (rows/columns) into a nicely aligned Markdown table.
|
||||
|
||||
Args:
|
||||
table: 2D list of cell values
|
||||
include_separator: If True, include header separator row (standard markdown).
|
||||
If False, output simple pipe-separated rows.
|
||||
"""
|
||||
if not table:
|
||||
return ""
|
||||
|
||||
# Normalize None → ""
|
||||
table = [[cell if cell is not None else "" for cell in row] for row in table]
|
||||
|
||||
# Filter out empty rows
|
||||
table = [row for row in table if any(cell.strip() for cell in row)]
|
||||
|
||||
if not table:
|
||||
return ""
|
||||
|
||||
# Column widths
|
||||
col_widths = [max(len(str(cell)) for cell in col) for col in zip(*table)]
|
||||
|
||||
def fmt_row(row: list[str]) -> str:
|
||||
return (
|
||||
"|"
|
||||
+ "|".join(str(cell).ljust(width) for cell, width in zip(row, col_widths))
|
||||
+ "|"
|
||||
)
|
||||
|
||||
if include_separator:
|
||||
header, *rows = table
|
||||
md = [fmt_row(header)]
|
||||
md.append("|" + "|".join("-" * w for w in col_widths) + "|")
|
||||
for row in rows:
|
||||
md.append(fmt_row(row))
|
||||
else:
|
||||
md = [fmt_row(row) for row in table]
|
||||
|
||||
return "\n".join(md)
|
||||
|
||||
|
||||
def _extract_form_content_from_words(page: Any) -> str | None:
|
||||
"""
|
||||
Extract form-style content from a PDF page by analyzing word positions.
|
||||
This handles borderless forms/tables where words are aligned in columns.
|
||||
|
||||
Returns markdown with proper table formatting:
|
||||
- Tables have pipe-separated columns with header separator rows
|
||||
- Non-table content is rendered as plain text
|
||||
|
||||
Returns None if the page doesn't appear to be a form-style document,
|
||||
indicating that pdfminer should be used instead for better text spacing.
|
||||
"""
|
||||
words = page.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)
|
||||
if not words:
|
||||
return None
|
||||
|
||||
# Group words by their Y position (rows)
|
||||
y_tolerance = 5
|
||||
rows_by_y: dict[float, list[dict]] = {}
|
||||
for word in words:
|
||||
y_key = round(word["top"] / y_tolerance) * y_tolerance
|
||||
if y_key not in rows_by_y:
|
||||
rows_by_y[y_key] = []
|
||||
rows_by_y[y_key].append(word)
|
||||
|
||||
# Sort rows by Y position
|
||||
sorted_y_keys = sorted(rows_by_y.keys())
|
||||
page_width = page.width if hasattr(page, "width") else 612
|
||||
|
||||
# First pass: analyze each row
|
||||
row_info: list[dict] = []
|
||||
for y_key in sorted_y_keys:
|
||||
row_words = sorted(rows_by_y[y_key], key=lambda w: w["x0"])
|
||||
if not row_words:
|
||||
continue
|
||||
|
||||
first_x0 = row_words[0]["x0"]
|
||||
last_x1 = row_words[-1]["x1"]
|
||||
line_width = last_x1 - first_x0
|
||||
combined_text = " ".join(w["text"] for w in row_words)
|
||||
|
||||
# Count distinct x-position groups (columns)
|
||||
x_positions = [w["x0"] for w in row_words]
|
||||
x_groups: list[float] = []
|
||||
for x in sorted(x_positions):
|
||||
if not x_groups or x - x_groups[-1] > 50:
|
||||
x_groups.append(x)
|
||||
|
||||
# Determine row type
|
||||
is_paragraph = line_width > page_width * 0.55 and len(combined_text) > 60
|
||||
|
||||
# Check for MasterFormat-style partial numbering (e.g., ".1", ".2")
|
||||
# These should be treated as list items, not table rows
|
||||
has_partial_numbering = False
|
||||
if row_words:
|
||||
first_word = row_words[0]["text"].strip()
|
||||
if PARTIAL_NUMBERING_PATTERN.match(first_word):
|
||||
has_partial_numbering = True
|
||||
|
||||
row_info.append(
|
||||
{
|
||||
"y_key": y_key,
|
||||
"words": row_words,
|
||||
"text": combined_text,
|
||||
"x_groups": x_groups,
|
||||
"is_paragraph": is_paragraph,
|
||||
"num_columns": len(x_groups),
|
||||
"has_partial_numbering": has_partial_numbering,
|
||||
}
|
||||
)
|
||||
|
||||
# Collect ALL x-positions from rows with 3+ columns (table-like rows)
|
||||
# This gives us the global column structure
|
||||
all_table_x_positions: list[float] = []
|
||||
for info in row_info:
|
||||
if info["num_columns"] >= 3 and not info["is_paragraph"]:
|
||||
all_table_x_positions.extend(info["x_groups"])
|
||||
|
||||
if not all_table_x_positions:
|
||||
return None
|
||||
|
||||
# Compute global column boundaries
|
||||
all_table_x_positions.sort()
|
||||
global_columns: list[float] = []
|
||||
for x in all_table_x_positions:
|
||||
if not global_columns or x - global_columns[-1] > 30:
|
||||
global_columns.append(x)
|
||||
|
||||
# Too many columns suggests dense text, not a form
|
||||
if len(global_columns) > 8:
|
||||
return None
|
||||
|
||||
# Now classify each row as table row or not
|
||||
# A row is a table row if it has words that align with 2+ of the global columns
|
||||
for info in row_info:
|
||||
if info["is_paragraph"]:
|
||||
info["is_table_row"] = False
|
||||
continue
|
||||
|
||||
# Rows with partial numbering (e.g., ".1", ".2") are list items, not table rows
|
||||
if info["has_partial_numbering"]:
|
||||
info["is_table_row"] = False
|
||||
continue
|
||||
|
||||
# Count how many global columns this row's words align with
|
||||
aligned_columns: set[int] = set()
|
||||
for word in info["words"]:
|
||||
word_x = word["x0"]
|
||||
for col_idx, col_x in enumerate(global_columns):
|
||||
if abs(word_x - col_x) < 40:
|
||||
aligned_columns.add(col_idx)
|
||||
break
|
||||
|
||||
# If row uses 2+ of the established columns, it's a table row
|
||||
info["is_table_row"] = len(aligned_columns) >= 2
|
||||
|
||||
# Find table regions (consecutive table rows)
|
||||
table_regions: list[tuple[int, int]] = [] # (start_idx, end_idx)
|
||||
i = 0
|
||||
while i < len(row_info):
|
||||
if row_info[i]["is_table_row"]:
|
||||
start_idx = i
|
||||
while i < len(row_info) and row_info[i]["is_table_row"]:
|
||||
i += 1
|
||||
end_idx = i
|
||||
table_regions.append((start_idx, end_idx))
|
||||
else:
|
||||
i += 1
|
||||
|
||||
# Check if enough rows are table rows (at least 20%)
|
||||
total_table_rows = sum(end - start for start, end in table_regions)
|
||||
if len(row_info) > 0 and total_table_rows / len(row_info) < 0.2:
|
||||
return None
|
||||
|
||||
# Build output - collect table data first, then format with proper column widths
|
||||
result_lines: list[str] = []
|
||||
num_cols = len(global_columns)
|
||||
|
||||
# Helper function to extract cells from a row
|
||||
def extract_cells(info: dict) -> list[str]:
|
||||
cells: list[str] = ["" for _ in range(num_cols)]
|
||||
for word in info["words"]:
|
||||
word_x = word["x0"]
|
||||
# Find the correct column using boundary ranges
|
||||
assigned_col = num_cols - 1 # Default to last column
|
||||
for col_idx in range(num_cols - 1):
|
||||
col_end = global_columns[col_idx + 1]
|
||||
if word_x < col_end - 20:
|
||||
assigned_col = col_idx
|
||||
break
|
||||
if cells[assigned_col]:
|
||||
cells[assigned_col] += " " + word["text"]
|
||||
else:
|
||||
cells[assigned_col] = word["text"]
|
||||
return cells
|
||||
|
||||
# Process rows, collecting table data for proper formatting
|
||||
idx = 0
|
||||
while idx < len(row_info):
|
||||
info = row_info[idx]
|
||||
|
||||
# Check if this row starts a table region
|
||||
table_region = None
|
||||
for start, end in table_regions:
|
||||
if idx == start:
|
||||
table_region = (start, end)
|
||||
break
|
||||
|
||||
if table_region:
|
||||
start, end = table_region
|
||||
# Collect all rows in this table
|
||||
table_data: list[list[str]] = []
|
||||
for table_idx in range(start, end):
|
||||
cells = extract_cells(row_info[table_idx])
|
||||
table_data.append(cells)
|
||||
|
||||
# Calculate column widths for this table
|
||||
if table_data:
|
||||
col_widths = [
|
||||
max(len(row[col]) for row in table_data) for col in range(num_cols)
|
||||
]
|
||||
# Ensure minimum width of 3 for separator dashes
|
||||
col_widths = [max(w, 3) for w in col_widths]
|
||||
|
||||
# Format header row
|
||||
header = table_data[0]
|
||||
header_str = (
|
||||
"| "
|
||||
+ " | ".join(
|
||||
cell.ljust(col_widths[i]) for i, cell in enumerate(header)
|
||||
)
|
||||
+ " |"
|
||||
)
|
||||
result_lines.append(header_str)
|
||||
|
||||
# Format separator row
|
||||
separator = (
|
||||
"| "
|
||||
+ " | ".join("-" * col_widths[i] for i in range(num_cols))
|
||||
+ " |"
|
||||
)
|
||||
result_lines.append(separator)
|
||||
|
||||
# Format data rows
|
||||
for row in table_data[1:]:
|
||||
row_str = (
|
||||
"| "
|
||||
+ " | ".join(
|
||||
cell.ljust(col_widths[i]) for i, cell in enumerate(row)
|
||||
)
|
||||
+ " |"
|
||||
)
|
||||
result_lines.append(row_str)
|
||||
|
||||
idx = end # Skip to end of table region
|
||||
else:
|
||||
# Check if we're inside a table region (not at start)
|
||||
in_table = False
|
||||
for start, end in table_regions:
|
||||
if start < idx < end:
|
||||
in_table = True
|
||||
break
|
||||
|
||||
if not in_table:
|
||||
# Non-table content
|
||||
result_lines.append(info["text"])
|
||||
idx += 1
|
||||
|
||||
return "\n".join(result_lines)
|
||||
|
||||
|
||||
def _extract_tables_from_words(page: Any) -> list[list[list[str]]]:
|
||||
"""
|
||||
Extract tables from a PDF page by analyzing word positions.
|
||||
This handles borderless tables where words are aligned in columns.
|
||||
|
||||
This function is designed for structured tabular data (like invoices),
|
||||
not for multi-column text layouts in scientific documents.
|
||||
"""
|
||||
words = page.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)
|
||||
if not words:
|
||||
return []
|
||||
|
||||
# Group words by their Y position (rows)
|
||||
y_tolerance = 5
|
||||
rows_by_y: dict[float, list[dict]] = {}
|
||||
for word in words:
|
||||
y_key = round(word["top"] / y_tolerance) * y_tolerance
|
||||
if y_key not in rows_by_y:
|
||||
rows_by_y[y_key] = []
|
||||
rows_by_y[y_key].append(word)
|
||||
|
||||
# Sort rows by Y position
|
||||
sorted_y_keys = sorted(rows_by_y.keys())
|
||||
|
||||
# Find potential column boundaries by analyzing x positions across all rows
|
||||
all_x_positions = []
|
||||
for words_in_row in rows_by_y.values():
|
||||
for word in words_in_row:
|
||||
all_x_positions.append(word["x0"])
|
||||
|
||||
if not all_x_positions:
|
||||
return []
|
||||
|
||||
# Cluster x positions to find column starts
|
||||
all_x_positions.sort()
|
||||
x_tolerance_col = 20
|
||||
column_starts: list[float] = []
|
||||
for x in all_x_positions:
|
||||
if not column_starts or x - column_starts[-1] > x_tolerance_col:
|
||||
column_starts.append(x)
|
||||
|
||||
# Need at least 3 columns but not too many (likely text layout, not table)
|
||||
if len(column_starts) < 3 or len(column_starts) > 10:
|
||||
return []
|
||||
|
||||
# Find rows that span multiple columns (potential table rows)
|
||||
table_rows = []
|
||||
for y_key in sorted_y_keys:
|
||||
words_in_row = sorted(rows_by_y[y_key], key=lambda w: w["x0"])
|
||||
|
||||
# Assign words to columns
|
||||
row_data = [""] * len(column_starts)
|
||||
for word in words_in_row:
|
||||
# Find the closest column
|
||||
best_col = 0
|
||||
min_dist = float("inf")
|
||||
for i, col_x in enumerate(column_starts):
|
||||
dist = abs(word["x0"] - col_x)
|
||||
if dist < min_dist:
|
||||
min_dist = dist
|
||||
best_col = i
|
||||
|
||||
if row_data[best_col]:
|
||||
row_data[best_col] += " " + word["text"]
|
||||
else:
|
||||
row_data[best_col] = word["text"]
|
||||
|
||||
# Only include rows that have content in multiple columns
|
||||
non_empty = sum(1 for cell in row_data if cell.strip())
|
||||
if non_empty >= 2:
|
||||
table_rows.append(row_data)
|
||||
|
||||
# Validate table quality - tables should have:
|
||||
# 1. Enough rows (at least 3 including header)
|
||||
# 2. Short cell content (tables have concise data, not paragraphs)
|
||||
# 3. Consistent structure across rows
|
||||
if len(table_rows) < 3:
|
||||
return []
|
||||
|
||||
# Check if cells contain short, structured data (not long text)
|
||||
long_cell_count = 0
|
||||
total_cell_count = 0
|
||||
for row in table_rows:
|
||||
for cell in row:
|
||||
if cell.strip():
|
||||
total_cell_count += 1
|
||||
# If cell has more than 30 chars, it's likely prose text
|
||||
if len(cell.strip()) > 30:
|
||||
long_cell_count += 1
|
||||
|
||||
# If more than 30% of cells are long, this is probably not a table
|
||||
if total_cell_count > 0 and long_cell_count / total_cell_count > 0.3:
|
||||
return []
|
||||
|
||||
return [table_rows]
|
||||
|
||||
|
||||
class PdfConverter(DocumentConverter):
|
||||
"""
|
||||
Converts PDFs to Markdown. Most style information is ignored, so the results are essentially plain-text.
|
||||
Converts PDFs to Markdown.
|
||||
Supports extracting tables into aligned Markdown format (via pdfplumber).
|
||||
Falls back to pdfminer if pdfplumber is missing or fails.
|
||||
"""
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
**kwargs: Any,
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
@@ -56,9 +474,8 @@ class PdfConverter(DocumentConverter):
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult:
|
||||
# Check the dependencies
|
||||
if _dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
MISSING_DEPENDENCY_MESSAGE.format(
|
||||
@@ -66,13 +483,58 @@ class PdfConverter(DocumentConverter):
|
||||
extension=".pdf",
|
||||
feature="pdf",
|
||||
)
|
||||
) from _dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
) from _dependency_exc_info[1].with_traceback(
|
||||
_dependency_exc_info[2]
|
||||
)
|
||||
) # type: ignore[union-attr]
|
||||
|
||||
assert isinstance(file_stream, io.IOBase) # for mypy
|
||||
return DocumentConverterResult(
|
||||
markdown=pdfminer.high_level.extract_text(file_stream),
|
||||
)
|
||||
assert isinstance(file_stream, io.IOBase)
|
||||
|
||||
markdown_chunks: list[str] = []
|
||||
|
||||
# Read file stream into BytesIO for compatibility with pdfplumber
|
||||
pdf_bytes = io.BytesIO(file_stream.read())
|
||||
|
||||
try:
|
||||
# Track how many pages are form-style vs plain text
|
||||
form_pages = 0
|
||||
plain_pages = 0
|
||||
|
||||
with pdfplumber.open(pdf_bytes) as pdf:
|
||||
for page in pdf.pages:
|
||||
# Try form-style word position extraction
|
||||
page_content = _extract_form_content_from_words(page)
|
||||
|
||||
# If extraction returns None, this page is not form-style
|
||||
if page_content is None:
|
||||
plain_pages += 1
|
||||
# Extract text using pdfplumber's basic extraction for this page
|
||||
text = page.extract_text()
|
||||
if text and text.strip():
|
||||
markdown_chunks.append(text.strip())
|
||||
else:
|
||||
form_pages += 1
|
||||
if page_content.strip():
|
||||
markdown_chunks.append(page_content)
|
||||
|
||||
# If most pages are plain text, use pdfminer for better text handling
|
||||
if plain_pages > form_pages and plain_pages > 0:
|
||||
pdf_bytes.seek(0)
|
||||
markdown = pdfminer.high_level.extract_text(pdf_bytes)
|
||||
else:
|
||||
# Build markdown from chunks
|
||||
markdown = "\n\n".join(markdown_chunks).strip()
|
||||
|
||||
except Exception:
|
||||
# Fallback if pdfplumber fails
|
||||
pdf_bytes.seek(0)
|
||||
markdown = pdfminer.high_level.extract_text(pdf_bytes)
|
||||
|
||||
# Fallback if still empty
|
||||
if not markdown:
|
||||
pdf_bytes.seek(0)
|
||||
markdown = pdfminer.high_level.extract_text(pdf_bytes)
|
||||
|
||||
# Post-process to merge MasterFormat-style partial numbering with following text
|
||||
markdown = _merge_partial_numbering_lines(markdown)
|
||||
|
||||
return DocumentConverterResult(markdown=markdown)
|
||||
|
||||
@@ -9,7 +9,7 @@ from .._stream_info import StreamInfo
|
||||
# Save reporting of any exceptions for later
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
import mammoth
|
||||
import mammoth # noqa: F401
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
@@ -17,12 +17,16 @@ except ImportError:
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"text/",
|
||||
"application/json",
|
||||
"application/markdown",
|
||||
]
|
||||
|
||||
# Mimetypes to ignore (commonly confused extensions)
|
||||
IGNORE_MIME_TYPE_PREFIXES = [
|
||||
"text/vnd.in3d.spot", # .spo wich is confused with xls, doc, etc.
|
||||
"text/vnd.graphviz", # .dot which is confused with xls, doc, etc.
|
||||
ACCEPTED_FILE_EXTENSIONS = [
|
||||
".txt",
|
||||
".text",
|
||||
".md",
|
||||
".markdown",
|
||||
".json",
|
||||
".jsonl",
|
||||
]
|
||||
|
||||
|
||||
@@ -38,9 +42,14 @@ class PlainTextConverter(DocumentConverter):
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
for prefix in IGNORE_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return False
|
||||
# If we have a charset, we can safely assume it's text
|
||||
# With Magika in the earlier stages, this handles most cases
|
||||
if stream_info.charset is not None:
|
||||
return True
|
||||
|
||||
# Otherwise, check the mimetype and extension
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
|
||||
@@ -140,13 +140,20 @@ class PptxConverter(DocumentConverter):
|
||||
alt_text = re.sub(r"[\r\n\[\]]", " ", alt_text)
|
||||
alt_text = re.sub(r"\s+", " ", alt_text).strip()
|
||||
|
||||
# A placeholder name
|
||||
filename = re.sub(r"\W", "", shape.name) + ".jpg"
|
||||
md_content += "\n\n"
|
||||
# If keep_data_uris is True, use base64 encoding for images
|
||||
if kwargs.get("keep_data_uris", False):
|
||||
blob = shape.image.blob
|
||||
content_type = shape.image.content_type or "image/png"
|
||||
b64_string = base64.b64encode(blob).decode("utf-8")
|
||||
md_content += f"\n\n"
|
||||
else:
|
||||
# A placeholder name
|
||||
filename = re.sub(r"\W", "", shape.name) + ".jpg"
|
||||
md_content += "\n\n"
|
||||
|
||||
# Tables
|
||||
if self._is_table(shape):
|
||||
md_content += self._convert_table_to_markdown(shape.table)
|
||||
md_content += self._convert_table_to_markdown(shape.table, **kwargs)
|
||||
|
||||
# Charts
|
||||
if shape.has_chart:
|
||||
@@ -161,11 +168,23 @@ class PptxConverter(DocumentConverter):
|
||||
|
||||
# Group Shapes
|
||||
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
|
||||
sorted_shapes = sorted(shape.shapes, key=attrgetter("top", "left"))
|
||||
sorted_shapes = sorted(
|
||||
shape.shapes,
|
||||
key=lambda x: (
|
||||
float("-inf") if not x.top else x.top,
|
||||
float("-inf") if not x.left else x.left,
|
||||
),
|
||||
)
|
||||
for subshape in sorted_shapes:
|
||||
get_shape_content(subshape, **kwargs)
|
||||
|
||||
sorted_shapes = sorted(slide.shapes, key=attrgetter("top", "left"))
|
||||
sorted_shapes = sorted(
|
||||
slide.shapes,
|
||||
key=lambda x: (
|
||||
float("-inf") if not x.top else x.top,
|
||||
float("-inf") if not x.left else x.left,
|
||||
),
|
||||
)
|
||||
for shape in sorted_shapes:
|
||||
get_shape_content(shape, **kwargs)
|
||||
|
||||
@@ -193,7 +212,7 @@ class PptxConverter(DocumentConverter):
|
||||
return True
|
||||
return False
|
||||
|
||||
def _convert_table_to_markdown(self, table):
|
||||
def _convert_table_to_markdown(self, table, **kwargs):
|
||||
# Write the table as HTML, then convert it to Markdown
|
||||
html_table = "<html><body><table>"
|
||||
first_row = True
|
||||
@@ -208,7 +227,10 @@ class PptxConverter(DocumentConverter):
|
||||
first_row = False
|
||||
html_table += "</table></body></html>"
|
||||
|
||||
return self._html_converter.convert_string(html_table).markdown.strip() + "\n"
|
||||
return (
|
||||
self._html_converter.convert_string(html_table, **kwargs).markdown.strip()
|
||||
+ "\n"
|
||||
)
|
||||
|
||||
def _convert_chart_to_markdown(self, chart):
|
||||
try:
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
from xml.dom import minidom
|
||||
from defusedxml import minidom
|
||||
from xml.dom.minidom import Document, Element
|
||||
from typing import BinaryIO, Any, Union
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
@@ -28,6 +29,10 @@ CANDIDATE_FILE_EXTENSIONS = [
|
||||
class RssConverter(DocumentConverter):
|
||||
"""Convert RSS / Atom type to markdown"""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self._kwargs = {}
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
@@ -82,6 +87,7 @@ class RssConverter(DocumentConverter):
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
self._kwargs = kwargs
|
||||
doc = minidom.parse(file_stream)
|
||||
feed_type = self._feed_type(doc)
|
||||
|
||||
@@ -92,7 +98,7 @@ class RssConverter(DocumentConverter):
|
||||
else:
|
||||
raise ValueError("Unknown feed type")
|
||||
|
||||
def _parse_atom_type(self, doc: minidom.Document) -> DocumentConverterResult:
|
||||
def _parse_atom_type(self, doc: Document) -> DocumentConverterResult:
|
||||
"""Parse the type of an Atom feed.
|
||||
|
||||
Returns None if the feed type is not recognized or something goes wrong.
|
||||
@@ -124,7 +130,7 @@ class RssConverter(DocumentConverter):
|
||||
title=title,
|
||||
)
|
||||
|
||||
def _parse_rss_type(self, doc: minidom.Document) -> DocumentConverterResult:
|
||||
def _parse_rss_type(self, doc: Document) -> DocumentConverterResult:
|
||||
"""Parse the type of an RSS feed.
|
||||
|
||||
Returns None if the feed type is not recognized or something goes wrong.
|
||||
@@ -166,12 +172,12 @@ class RssConverter(DocumentConverter):
|
||||
try:
|
||||
# using bs4 because many RSS feeds have HTML-styled content
|
||||
soup = BeautifulSoup(content, "html.parser")
|
||||
return _CustomMarkdownify().convert_soup(soup)
|
||||
return _CustomMarkdownify(**self._kwargs).convert_soup(soup)
|
||||
except BaseException as _:
|
||||
return content
|
||||
|
||||
def _get_data_by_tag_name(
|
||||
self, element: minidom.Element, tag_name: str
|
||||
self, element: Element, tag_name: str
|
||||
) -> Union[str, None]:
|
||||
"""Get data from first child element with the given tag name.
|
||||
Returns None when no such element is found.
|
||||
|
||||
@@ -7,20 +7,14 @@ from .._exceptions import MissingDependencyException
|
||||
# Save reporting of any exceptions for later
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
# Suppress some deprecation warnings from the speech_recognition library
|
||||
# Suppress some warnings on library import
|
||||
import warnings
|
||||
|
||||
warnings.filterwarnings(
|
||||
"ignore", category=DeprecationWarning, module="speech_recognition"
|
||||
)
|
||||
warnings.filterwarnings(
|
||||
"ignore",
|
||||
category=SyntaxWarning,
|
||||
module="pydub", # TODO: Migrate away from pydub
|
||||
)
|
||||
import speech_recognition as sr
|
||||
|
||||
import pydub
|
||||
with warnings.catch_warnings():
|
||||
warnings.filterwarnings("ignore", category=DeprecationWarning)
|
||||
warnings.filterwarnings("ignore", category=SyntaxWarning)
|
||||
import speech_recognition as sr
|
||||
import pydub
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
import io
|
||||
import re
|
||||
import bs4
|
||||
from typing import Any, BinaryIO, Optional
|
||||
from typing import Any, BinaryIO
|
||||
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
@@ -76,11 +75,11 @@ class WikipediaConverter(DocumentConverter):
|
||||
main_title = title_elm.string
|
||||
|
||||
# Convert the page
|
||||
webpage_text = f"# {main_title}\n\n" + _CustomMarkdownify().convert_soup(
|
||||
body_elm
|
||||
)
|
||||
webpage_text = f"# {main_title}\n\n" + _CustomMarkdownify(
|
||||
**kwargs
|
||||
).convert_soup(body_elm)
|
||||
else:
|
||||
webpage_text = _CustomMarkdownify().convert_soup(soup)
|
||||
webpage_text = _CustomMarkdownify(**kwargs).convert_soup(soup)
|
||||
|
||||
return DocumentConverterResult(
|
||||
markdown=webpage_text,
|
||||
|
||||
@@ -10,14 +10,14 @@ from .._stream_info import StreamInfo
|
||||
_xlsx_dependency_exc_info = None
|
||||
try:
|
||||
import pandas as pd
|
||||
import openpyxl
|
||||
import openpyxl # noqa: F401
|
||||
except ImportError:
|
||||
_xlsx_dependency_exc_info = sys.exc_info()
|
||||
|
||||
_xls_dependency_exc_info = None
|
||||
try:
|
||||
import pandas as pd
|
||||
import xlrd
|
||||
import pandas as pd # noqa: F811
|
||||
import xlrd # noqa: F401
|
||||
except ImportError:
|
||||
_xls_dependency_exc_info = sys.exc_info()
|
||||
|
||||
@@ -86,7 +86,9 @@ class XlsxConverter(DocumentConverter):
|
||||
md_content += f"## {s}\n"
|
||||
html_content = sheets[s].to_html(index=False)
|
||||
md_content += (
|
||||
self._html_converter.convert_string(html_content).markdown.strip()
|
||||
self._html_converter.convert_string(
|
||||
html_content, **kwargs
|
||||
).markdown.strip()
|
||||
+ "\n\n"
|
||||
)
|
||||
|
||||
@@ -146,7 +148,9 @@ class XlsConverter(DocumentConverter):
|
||||
md_content += f"## {s}\n"
|
||||
html_content = sheets[s].to_html(index=False)
|
||||
md_content += (
|
||||
self._html_converter.convert_string(html_content).markdown.strip()
|
||||
self._html_converter.convert_string(
|
||||
html_content, **kwargs
|
||||
).markdown.strip()
|
||||
+ "\n\n"
|
||||
)
|
||||
|
||||
|
||||
@@ -1,25 +1,22 @@
|
||||
import sys
|
||||
import json
|
||||
import time
|
||||
import io
|
||||
import re
|
||||
import bs4
|
||||
import warnings
|
||||
from typing import Any, BinaryIO, Optional, Dict, List, Union
|
||||
from typing import Any, BinaryIO, Dict, List, Union
|
||||
from urllib.parse import parse_qs, urlparse, unquote
|
||||
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from ._markdownify import _CustomMarkdownify
|
||||
|
||||
# Optional YouTube transcription support
|
||||
try:
|
||||
warnings.filterwarnings(
|
||||
"ignore",
|
||||
category=SyntaxWarning,
|
||||
module="youtube_transcript_api", # Patch submitted to youtube-transcript-api
|
||||
)
|
||||
from youtube_transcript_api import YouTubeTranscriptApi
|
||||
# Suppress some warnings on library import
|
||||
import warnings
|
||||
|
||||
with warnings.catch_warnings():
|
||||
warnings.filterwarnings("ignore", category=SyntaxWarning)
|
||||
# Patch submitted upstream to fix the SyntaxWarning
|
||||
from youtube_transcript_api import YouTubeTranscriptApi
|
||||
|
||||
IS_YOUTUBE_TRANSCRIPT_CAPABLE = True
|
||||
except ModuleNotFoundError:
|
||||
@@ -148,32 +145,46 @@ class YouTubeConverter(DocumentConverter):
|
||||
webpage_text += f"\n### Description\n{description}\n"
|
||||
|
||||
if IS_YOUTUBE_TRANSCRIPT_CAPABLE:
|
||||
ytt_api = YouTubeTranscriptApi()
|
||||
transcript_text = ""
|
||||
parsed_url = urlparse(stream_info.url) # type: ignore
|
||||
params = parse_qs(parsed_url.query) # type: ignore
|
||||
if "v" in params and params["v"][0]:
|
||||
video_id = str(params["v"][0])
|
||||
transcript_list = ytt_api.list(video_id)
|
||||
languages = ["en"]
|
||||
for transcript in transcript_list:
|
||||
languages.append(transcript.language_code)
|
||||
break
|
||||
try:
|
||||
youtube_transcript_languages = kwargs.get(
|
||||
"youtube_transcript_languages", ("en",)
|
||||
"youtube_transcript_languages", languages
|
||||
)
|
||||
# Retry the transcript fetching operation
|
||||
transcript = self._retry_operation(
|
||||
lambda: YouTubeTranscriptApi.get_transcript(
|
||||
lambda: ytt_api.fetch(
|
||||
video_id, languages=youtube_transcript_languages
|
||||
),
|
||||
retries=3, # Retry 3 times
|
||||
delay=2, # 2 seconds delay between retries
|
||||
)
|
||||
|
||||
if transcript:
|
||||
transcript_text = " ".join(
|
||||
[part["text"] for part in transcript]
|
||||
[part.text for part in transcript]
|
||||
) # type: ignore
|
||||
# Alternative formatting:
|
||||
# formatter = TextFormatter()
|
||||
# formatter.format_transcript(transcript)
|
||||
except Exception as e:
|
||||
print(f"Error fetching transcript: {e}")
|
||||
# No transcript available
|
||||
if len(languages) == 1:
|
||||
print(f"Error fetching transcript: {e}")
|
||||
else:
|
||||
# Translate transcript into first kwarg
|
||||
transcript = (
|
||||
transcript_list.find_transcript(languages)
|
||||
.translate(youtube_transcript_languages[0])
|
||||
.fetch()
|
||||
)
|
||||
transcript_text = " ".join([part.text for part in transcript])
|
||||
if transcript_text:
|
||||
webpage_text += f"\n### Transcript\n{transcript_text}\n"
|
||||
|
||||
|
||||
@@ -1,4 +1,3 @@
|
||||
import sys
|
||||
import zipfile
|
||||
import io
|
||||
import os
|
||||
|
||||
@@ -25,8 +25,11 @@ GENERAL_TEST_VECTORS = [
|
||||
"# Abstract",
|
||||
"# Introduction",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"data:image/png;base64...",
|
||||
],
|
||||
must_not_include=[
|
||||
"data:image/png;base64,iVBORw0KGgoAAAANSU",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test.xlsx",
|
||||
@@ -65,8 +68,9 @@ GENERAL_TEST_VECTORS = [
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"a3f6004b-6f4f-4ea8-bee3-3741f4dc385f", # chart title
|
||||
"2003", # chart value
|
||||
"",
|
||||
],
|
||||
must_not_include=[],
|
||||
must_not_include=["data:image/jpeg;base64,/9j/4AAQSkZJRgABAQE"],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test_outlook_msg.msg",
|
||||
@@ -140,10 +144,11 @@ GENERAL_TEST_VECTORS = [
|
||||
charset="cp932",
|
||||
url=None,
|
||||
must_include=[
|
||||
"名前,年齢,住所",
|
||||
"佐藤太郎,30,東京",
|
||||
"三木英子,25,大阪",
|
||||
"髙橋淳,35,名古屋",
|
||||
"| 名前 | 年齢 | 住所 |",
|
||||
"| --- | --- | --- |",
|
||||
"| 佐藤太郎 | 30 | 東京 |",
|
||||
"| 三木英子 | 25 | 大阪 |",
|
||||
"| 髙橋淳 | 35 | 名古屋 |",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
@@ -230,3 +235,45 @@ GENERAL_TEST_VECTORS = [
|
||||
must_not_include=[],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
DATA_URI_TEST_VECTORS = [
|
||||
FileTestVector(
|
||||
filename="test.docx",
|
||||
mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
|
||||
"49e168b7-d2ae-407f-a055-2167576f39a1",
|
||||
"## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
|
||||
"# Abstract",
|
||||
"# Introduction",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"data:image/png;base64,iVBORw0KGgoAAAANSU",
|
||||
],
|
||||
must_not_include=[
|
||||
"data:image/png;base64...",
|
||||
],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test.pptx",
|
||||
mimetype="application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"2cdda5c8-e50e-4db4-b5f0-9722a649f455",
|
||||
"04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
|
||||
"44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
|
||||
"1b92870d-e3b5-4e65-8153-919f4ff45592",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"a3f6004b-6f4f-4ea8-bee3-3741f4dc385f", # chart title
|
||||
"2003", # chart value
|
||||
"![This phrase of the caption is Human-written.]", # image caption
|
||||
"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQE",
|
||||
],
|
||||
must_not_include=[
|
||||
"",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
@@ -1,6 +1,5 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
import subprocess
|
||||
import pytest
|
||||
from markitdown import __version__
|
||||
|
||||
# This file contains CLI tests that are not directly tested by the FileTestVectors.
|
||||
@@ -24,8 +23,8 @@ def test_invalid_flag() -> None:
|
||||
assert result.returncode != 0, f"CLI exited with error: {result.stderr}"
|
||||
assert (
|
||||
"unrecognized arguments" in result.stderr
|
||||
), f"Expected 'unrecognized arguments' to appear in STDERR"
|
||||
assert "SYNTAX" in result.stderr, f"Expected 'SYNTAX' to appear in STDERR"
|
||||
), "Expected 'unrecognized arguments' to appear in STDERR"
|
||||
assert "SYNTAX" in result.stderr, "Expected 'SYNTAX' to appear in STDERR"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -7,16 +7,17 @@ import locale
|
||||
from typing import List
|
||||
|
||||
if __name__ == "__main__":
|
||||
from _test_vectors import GENERAL_TEST_VECTORS, FileTestVector
|
||||
from _test_vectors import (
|
||||
GENERAL_TEST_VECTORS,
|
||||
DATA_URI_TEST_VECTORS,
|
||||
FileTestVector,
|
||||
)
|
||||
else:
|
||||
from ._test_vectors import GENERAL_TEST_VECTORS, FileTestVector
|
||||
|
||||
from markitdown import (
|
||||
MarkItDown,
|
||||
UnsupportedFormatException,
|
||||
FileConversionException,
|
||||
StreamInfo,
|
||||
)
|
||||
from ._test_vectors import (
|
||||
GENERAL_TEST_VECTORS,
|
||||
DATA_URI_TEST_VECTORS,
|
||||
FileTestVector,
|
||||
)
|
||||
|
||||
skip_remote = (
|
||||
True if os.environ.get("GITHUB_ACTIONS") else False
|
||||
@@ -132,8 +133,6 @@ def test_convert_url(shared_tmp_dir, test_vector):
|
||||
"""Test the conversion of a stream with no stream info."""
|
||||
# Note: tmp_dir is not used here, but is needed to match the signature
|
||||
|
||||
markitdown = MarkItDown()
|
||||
|
||||
time.sleep(1) # Ensure we don't hit rate limits
|
||||
result = subprocess.run(
|
||||
["python", "-m", "markitdown", TEST_FILES_URL + "/" + test_vector.filename],
|
||||
@@ -149,13 +148,46 @@ def test_convert_url(shared_tmp_dir, test_vector):
|
||||
assert test_string not in stdout
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", DATA_URI_TEST_VECTORS)
|
||||
def test_output_to_file_with_data_uris(shared_tmp_dir, test_vector) -> None:
|
||||
"""Test CLI functionality when keep_data_uris is enabled"""
|
||||
|
||||
output_file = os.path.join(shared_tmp_dir, test_vector.filename + ".output")
|
||||
result = subprocess.run(
|
||||
[
|
||||
"python",
|
||||
"-m",
|
||||
"markitdown",
|
||||
"--keep-data-uris",
|
||||
"-o",
|
||||
output_file,
|
||||
os.path.join(TEST_FILES_DIR, test_vector.filename),
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
|
||||
assert os.path.exists(output_file), f"Output file not created: {output_file}"
|
||||
|
||||
with open(output_file, "r") as f:
|
||||
output_data = f.read()
|
||||
for test_string in test_vector.must_include:
|
||||
assert test_string in output_data
|
||||
for test_string in test_vector.must_not_include:
|
||||
assert test_string not in output_data
|
||||
|
||||
os.remove(output_file)
|
||||
assert not os.path.exists(output_file), f"Output file not deleted: {output_file}"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
import tempfile
|
||||
|
||||
"""Runs this file's tests from the command line."""
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
# General tests
|
||||
for test_function in [
|
||||
test_output_to_stdout,
|
||||
test_output_to_file,
|
||||
@@ -169,4 +201,17 @@ if __name__ == "__main__":
|
||||
)
|
||||
test_function(tmp_dir, test_vector)
|
||||
print("OK")
|
||||
|
||||
# Data URI tests
|
||||
for test_function in [
|
||||
test_output_to_file_with_data_uris,
|
||||
]:
|
||||
for test_vector in DATA_URI_TEST_VECTORS:
|
||||
print(
|
||||
f"Running {test_function.__name__} on {test_vector.filename}...",
|
||||
end="",
|
||||
)
|
||||
test_function(tmp_dir, test_vector)
|
||||
print("OK")
|
||||
|
||||
print("All tests passed!")
|
||||
|
||||
@@ -0,0 +1,26 @@
|
||||
import io
|
||||
from markitdown.converters._doc_intel_converter import (
|
||||
DocumentIntelligenceConverter,
|
||||
DocumentIntelligenceFileType,
|
||||
)
|
||||
from markitdown._stream_info import StreamInfo
|
||||
|
||||
|
||||
def _make_converter(file_types):
|
||||
conv = DocumentIntelligenceConverter.__new__(DocumentIntelligenceConverter)
|
||||
conv._file_types = file_types
|
||||
return conv
|
||||
|
||||
|
||||
def test_docintel_accepts_html_extension():
|
||||
conv = _make_converter([DocumentIntelligenceFileType.HTML])
|
||||
stream_info = StreamInfo(mimetype=None, extension=".html")
|
||||
assert conv.accepts(io.BytesIO(b""), stream_info)
|
||||
|
||||
|
||||
def test_docintel_accepts_html_mimetype():
|
||||
conv = _make_converter([DocumentIntelligenceFileType.HTML])
|
||||
stream_info = StreamInfo(mimetype="text/html", extension=None)
|
||||
assert conv.accepts(io.BytesIO(b""), stream_info)
|
||||
stream_info = StreamInfo(mimetype="application/xhtml+xml", extension=None)
|
||||
assert conv.accepts(io.BytesIO(b""), stream_info)
|
||||
BIN
Binary file not shown.
+97
@@ -0,0 +1,97 @@
|
||||
%PDF-1.4
|
||||
%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com
|
||||
1 0 obj
|
||||
<<
|
||||
/F1 2 0 R /F2 3 0 R /F3 4 0 R /F4 5 0 R
|
||||
>>
|
||||
endobj
|
||||
2 0 obj
|
||||
<<
|
||||
/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font
|
||||
>>
|
||||
endobj
|
||||
3 0 obj
|
||||
<<
|
||||
/BaseFont /Helvetica-Bold /Encoding /WinAnsiEncoding /Name /F2 /Subtype /Type1 /Type /Font
|
||||
>>
|
||||
endobj
|
||||
4 0 obj
|
||||
<<
|
||||
/BaseFont /Courier /Encoding /WinAnsiEncoding /Name /F3 /Subtype /Type1 /Type /Font
|
||||
>>
|
||||
endobj
|
||||
5 0 obj
|
||||
<<
|
||||
/BaseFont /Courier-Bold /Encoding /WinAnsiEncoding /Name /F4 /Subtype /Type1 /Type /Font
|
||||
>>
|
||||
endobj
|
||||
6 0 obj
|
||||
<<
|
||||
/BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /FlateDecode ] /Height 70 /Length 4491 /Subtype /Image
|
||||
/Type /XObject /Width 200
|
||||
>>
|
||||
stream
|
||||
Gb"/lq,^Nc)M\9OkX:DBZ5YT>'!op&`0lHCEL`PXM2DFT$QuCdPsfSJ4#%$gW49i\1e&eZ\Acg:2bhaPc+^Q$/Shs#,1&Qu>83CBh729[%A$M]]Z8KL]Cu-OpO.1`\pCPboa2!3#sCC+4Yg#.W)"\K&i)doY5WCH.J-G0E$]%J"BRoZ88okcKEP@C7S%JEA:t(e6:OLb-"MZ3=$fAIE$]%J"BRoZ88okcKEP@C7S%JEA:t(e6:OKV=+u3F\<8ml>&.H@]EH:@TAeD]);ilGV3,!4G2kP0XRN"#r/]G!Kqo1c9Redk1d%Hh[t82=;lMkgMFr4J2#lTY[siCuNDGT]h^te4%1fj10r$&D--;(2UPbQ(Ze:VUL"2G=%qPZVOlc,tegG*BO,:$mI%mCAJ^8q2gWSn>)Ui.KQ!A5_(Z;l?_%(Xb28AGIU0SY?bVp%B6=$P2*I!1?W8WL>aAVWc$%-nk=D8ZjE$stW]LJ-LqIj9qZoFV/lj$Uf)=b`nfl*ANkt_qpb2t'P`;D#\h6o+g'M+"j4Sf:d#_jjZdTS>mnQJm^7>(S3Bq"m(kH5i3;0`YK<5e.0$k"4XA4UX)rfXH+2OamR360'cX$&"Dp"DdSkh^Q;?+H)fBc@YMp?\]UZuQ*lgt@kDS'dARs/]`Roa+]eXm%&TJ4e[RCKq6kQ:5Zpa[hLTEE,M/UR\C]SS.K'HJL7F)F5Ts63hWCKs+aTqi3TN!,P7#o$@'a?^`M.9&=d,$WJ*\b^*N1fR$JFV.s`.W*WfCgS\9h@C2/uC<b(1#bB73rR3UmcP"%)_DZ#=)TA1KkfUBT8F;=Yoc[BR6[ZkS\Y$n/.@mf!9WK1Hj]o3[f>DiFrdD&a`iRQ\df2(Xc53$=i@@upbP,MJ@.sDgm%`^.8+5u//"Hhn%^kIb$5o\2B7%r>DD\NZ:L;otY0]:)'6l[M<&ctoM"($Q_XW1'!4OB>g/3dF]mD],eF*&&'itQ;2e$/VWZ/QmdogQ0&d7ePkDGP[PZkk8TtUWkaJYa$Q)c6I+l<preqG)K\U>pY5H])D-lHdp52<d:Isd8)X0&b+pKUugDNb2NIW.aD_PLN/i(r9N&<3?,2br'%?gT'_i;n9VUeeM#>ko&@JS]_pP&PR0@L`i*pbXB"rgcI`#'#>-Njfe@8+ZC7hHU>Qm2oCj$^ATs<7[sZ@5*,@qslQ\p1m1#p6XrGL'F^?ok?+\fDe1,#<0n78&1'&KK/85E7IuiRklZ$<tM_`dQfdI@$2&Sj]k*=&V0n1RVEA-6,(U`EJ9674P)af%Z>l=<9-YKka`e!9&oUk]3u7Y);o;!X[W(:<D3M@"gDG)[;",CR0eT^r/fRnB+ob8JM\tp9\JoC\=uqFPX9nU`?:Y,eJf!6E95r@KI_iekuY/-+j6DIrXFQW>i@m+VqQff?4r\fn@@4QXN6[dWdtV8:B`3X2:bgH!rR8-r^sf)#EN%"`F/q0Heh_C7H6l'.@I3l<Jr.Q!as3DB-9V*+/'h,_<T8?^2u*t.p$h8d%"Dd\P<5M`MEg>7W_M8qB0Sd$'o&pWH!XFNS.JRZ%[WY$N:rl5tLIb;#&1u\'nOCIB]161$bC,Uuf;ZK6dl()epY<39X_LaOAXU:WAZiYqZk5Tq+hN`O"QdZ7[jLdf^cf`?9i4T#=]JO$'0fC4#Y=^M%VOouL.PZ6/V;r+XoF1Ls*YXu`6'4?,j47_u_U=.T*IX0ed;@5JN=Qlc\gS!W?r;#%jA)NSUh\`='l:HWsF<K@<`EqO<,Ht[H.@PGU,p6$s&YbEgb;cfG9YK,6Fh]@t(EUD@78Ob6ui6[#pIBoZn<UH)N"PrIeP3Y!qa-k8bP>_rQ_q'7l3]f,=As5FN;rm6/&IWa@[9HFr3YtU'N%=ZPr)s`!!&o!IbLLsBH7VnG&"_&hSn3;Nr^mpSZk^2i_aD8<g*:f)-)1j6@3KSHbb_c1PDAXpnkGE:H3Fs0m?uXff0>H]^Oi^Wq(2*ak3>^mA^!FkG4$-Vq(BH"U+YSR;%(5j(bnT,&RrR1d]\O5_42^f/Xa:4msf,Oms&5F6()XE"p6mS/Yc\Ga&`hC/3XdsM;'cTMl(uV@DiFY5AA_VWS4T'&^D<.7.S,`B?:^&!Q[ZVaCi0^$[E#=Xt_;^\;l;M#]`$4;sLf^6$u5)gpB'7TO-@*HXbXF]H[ID>=n%&8-T:f)&:]?hm\RN5/B5sdW)PM[X@>2Jq/jQ%m%0Pk%J`<],L8cn3_`B)dE8ng`*C-";2ro.7o.B2:3d$4r#LEqr#8((eIunkG+V@25V=%+_!:Z/,UQca-*F<WEc]8T!rP(2A>g9*GW$LPXTF:ER(d"o8oX31"!B8VtcoXnY$9m&'30Um_8lI0aO.LY_m,l[3VfKlBY*[!$I#3=:"_\lGs6U"<,-kF4HFN$6[_SQpG)7_H3mnKRB'C9#q8EY(Vaqi(D&r$*Jr?OPiaP#RRYeN0)sia9W*TKT)#N9#q8EY(Vaqi(D&r$*Jr?OPiaP#k^2Zeu*-9hp_.0@)f6Em^OKjQe!N6R'K"tCBm#IB,_F0s/@WUCJX)`/6>DF6M_)Kk.s>$UZPX\r/#8I8q/RiHL2liUAA0Yf`%Ld=068s:8FSq=\;#qPO.M6gNJ<fcJ_B7NfVHbGb(Z#%=9b&C:^$<Xe!_fU:_#gd*5iB@GeEYKud7*YgG3W,jj8\q'/4@l'Y-g\)rX-m;H77/R]Zb%A!YN(tcDfkrJ&o7($5o=b[,R0&h>j\UuBER`krd-YJajg!X[eDSKG#rM875C*#IQWm+O@aOf*TU2U('gKn4DA1c#r\17+;`V`#Kgpm8RC[Ee$Ac&[r[O5Rd>4]?HeBRku2;R,jekQW/J-QCRW&^t>C\3e,:[W(TT!T:2JEND71MtHndISe*l)E$(\gO^@6Z"5Q<4O)uM0mY8/2q_>,e#m"SaP0HaXDih_2UgJZH0.io.<EHam+)Ba'CJ3$,Ve\2W\TOCDf>!]XNj5RmJj5Qb:1EKuS*^?b+;s)^?6l>JE0QrGSUpouIlZac9kX41>cja=/SDr<cA`ZGg.-dcfEr]UahTX\[1!?g.Q+Bc7gR:Jn+!PFgH?<j_Cphp%2cK2o3EmuaRlLL,1SFd.EYSt<=je7H>"[/Y4/S?T:Ij;Z5Vm!N#O9oGbTmC/.8"?)WA+NO#l!d__>E^^P^^5SPK3f#a-$jDf&=>7p3h\GbV.%e"aa`Nr8:-((uUWjP?5h<QpA5fPW)4^9]4ZVP3n=aOl2n\TH8,2g(<%d(S%%7D#]P?GegI?jC3uo^tm=sN6Bp.;hI!R:5>*[4qX1g^KT=J\eG*=-r!s:Xj;^b_=6X9pT)tmOPN+SP\Y9d?!CS[k5Z"1@+N1#RUIImf:>$etSOp!A$%9Npa7^LW;"'>&DLN*LT#=""p0WOpjXfI:C##@j@lDSU=Xe\&,@B>QE*Z4EfD\=1"h4F8$&QPkBC:BBC"p$N?^/:S9o*P:eF4j>`WT<EG,f\ln6T<>&V*-UR?1+=iZ&Y:YJ)Y_O1Q!(3MOc:&,lEK0KT>gBrM!Oo7g$R0Z=@n</A>l[op[I#4)k.C3f6pb.hq#9_d636^F1(A%Us@pB(WNXIQ#TKD-`)0%k[fj0?XEDjbhd[m6#LpLW.'sd9_sqo4)9,(HjMXDMbKZE5`!!P4XRB@/4TISu+m2&%RiFj^;=JGE6IZZZ3Iq9u;tP9Ze)spB!TH+!k0kjm?9TaEqM'"N]I#K68.sENFpG-:BpL"k5<Kf;Cll]p\0.VBgJXYV7bkGELTah(>RWGsjL14<<sE:X3JW]):L"_a3kcBH&.(2Ui6d<sTVSq6Hb2oGX:N_j+E^^?#]Brd-YC/q4&j48+1kK<)&.;T[(aqqDC]Z:"7NhLIO5&BeU-SXc>m&CBpaKs!LDjG%ZThfU"?o'^kNBIDSK`iDtti"$XI@8UFlc6'SU-*bNc0g0qZU.!3k5:g&J5%_GC<>>?gMnQALpYLh_[CK2YgZMfbcG=f9!FE+GUto#YS.Ms0oqhppBVFIgKikS73+[l77Q]oKP@6W$g?K-&@CV\IU2l9TjknpF,mi!B#31b$WjNK[_SbrBad@)rY(?tsa<O6[4Wd_rrCQDK9Fn>?noi;W9O[OS+IIT/JfXc1#&UNmnp:7_c6/aW-".*cBVjc)"DgbY/sJI7I:YTo4n^?6D*J='=Z.q\rms\C-WjI*igXS;Vfo_`Y\`_c8K6r=SE5*WPmG+p6'k&&0eDn]e<kl$//^-6n?[M0Wg:@`F!W*m'jM%_,JfY,&JA=T)'Qh]O:`+1#oOo&Q&lRj>R;8k_3L)o&mP_\+Z6EUKR,;fQ&lT(!/=AJY_cg)ZFd_iV[)$d`W(]Q<7PapAWM]!Y4qoZ_Z9e(Tts:W\F3=&Fc,Mp(`=cJUG]7+gkp\4)_AlnCa6gYn$_U<4&\@+ERR^ZIFpqbN%;:=3k=P0?fC_:0&Ug9TYUE[kI#Bq;+mc:C\/7H:]h-hq^[P:`jZ\paVNd3BCYF4eruZ)J"F0tYQ"^`hf0;~>endstream
|
||||
endobj
|
||||
7 0 obj
|
||||
<<
|
||||
/Contents 11 0 R /MediaBox [ 0 0 216 792 ] /Parent 10 0 R /Resources <<
|
||||
/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject <<
|
||||
/FormXob.2a351979d8c75d073b2ea4bfb74718f9 6 0 R
|
||||
>>
|
||||
>> /Rotate 0 /Trans <<
|
||||
|
||||
>>
|
||||
/Type /Page
|
||||
>>
|
||||
endobj
|
||||
8 0 obj
|
||||
<<
|
||||
/PageMode /UseNone /Pages 10 0 R /Type /Catalog
|
||||
>>
|
||||
endobj
|
||||
9 0 obj
|
||||
<<
|
||||
/Author (\(anonymous\)) /CreationDate (D:20251205104951+01'00') /Creator (\(unspecified\)) /Keywords () /ModDate (D:20251205104951+01'00') /Producer (ReportLab PDF Library - www.reportlab.com)
|
||||
/Subject (\(unspecified\)) /Title (\(anonymous\)) /Trapped /False
|
||||
>>
|
||||
endobj
|
||||
10 0 obj
|
||||
<<
|
||||
/Count 1 /Kids [ 7 0 R ] /Type /Pages
|
||||
>>
|
||||
endobj
|
||||
11 0 obj
|
||||
<<
|
||||
/Filter [ /ASCII85Decode /FlateDecode ] /Length 1981
|
||||
>>
|
||||
stream
|
||||
GauI8D/\/e&:hOi=55KJ8;^%$X%4*F0ZR4[TQ-.JlIl=8,25emeWsH3DOUP#SK,Tt>!%\QTf-@>5)IUB4PgI,!l6hOHqp".7kqY?VEc$;f0G$C+?kE+IduY+D6B[cY<`?1&+Bf$SSWa.DI28'F?:CpG_mY"TO]hkXiFku&",h"4G/8GlCSK3UDT`3W#q'Qc;4>t6\tbspa(/l?]"D>nboQo,(,[\*-A;J=Ru^j=[Nu\:iMk7q<+PGo*gWpT4_C)j)t7Oc[5MlffWrhj!99;,0?]r3R(ns^B^I*KQ#f[([aS1g);Q.rrBep&6)sVJs=\.1^pkCa(tBfECI75_;C0LC.)*n<3;@eFZT<Brd%CYO%*fRCl_%R2PLtnF>0lg_SFKN$O.X\o%U7_58YJ,X[`p,%PUL^1]EgT]T\4*B3hOrEZA:[=ui88pZGht%klE^OC$2=@$GiMoTO<eR\C2;10r\O.%_7=c.`)*@0N>,CLh>2ZDq?"(LrLS)ajJ<DG(:N]5MPuT)E6J8.)!Ud7D>Je7M&(V1i'>Z@qg3/WJ@PpL4nr1qU$V_#jqpM+M<[H(LE:pW6uTQK`^P%Q'T&[*Y1\T_7.O:+n^Y)3+d56\hGsIrnqB85q[4L1WG#"Uo.d]Zr"jG`qiT=AU8b5]p3(N2IfnWHVO&$rnNe6[_$[o(m=2Sq-C[bbNOS,qIb?:TGYHhQjjcCe!9%*cuscgU*Eea^B?#^HtoE9p(jd_GR#E1\LTm:MVS5e]+<LZRQ]0^iZNnTs\Iq5l6H+?80%j?IX^UR28jY=Vr!:#Jf=D$QdR#4X6Q%Z^E6hq4[p#rHu/mN!PgeQCn_hEI9M3(_b.pAid<e0?KUakkhL+Sq1A+!S(V0h+OF8nn#&[1+6p+D5^<OP@s\HS\itN&+apX>;a8<=<fVm-cM(u=Q31UTuXZiRNk/X^o=e`8?ha=(l,J:AYGq&k'81mIs68U9).dPb@tBY#5s6I1(;=p-FNV8JLO6-b%6]BUEO#*P:YXQ,+2T8!GeA%OFD*^H'I<qo6Q\KGcc$9<-q;oCHo2eF[F/t'nG3p8bj]<0qUd^A*Un<D3^J]5-EeYaHZL0]dZ(aldZ?U]EL]o1@j;L=l_$._u&B5KRKtY900d3EO'mY6&5WB,D7o]o+7,#h.[N58L+)*ks!_/dIq7L<$Q/>:Ym/3(NJmP3]c2J81f'[9A229?.>nW.Y"uioK$/X(RLnTFa0nhiu#_V(M%6pL-[3&IEZO^iW'pcgSC4%cs*UfWL8=h@<SF-6Ml.SK#\%/6pL;XKVP08]+YR4.1^h^3g+iL6p2jNFKi##9N\7TS:EE')be-k57a"IMZoUV#>]Eoq_3uS@i.]*ai31P3"'$;S,Q%'"=$Vq"-_!pX<>bA]?nd=dOpZa\$"k!cpI9L2SO3gBd7]TKi)s)3F/ADnb^3N)iF')M[\Fq1\^PAl):!YpJp!B\s/2LkZr*`(o%fTO.qa?N[7\P_Dj!-3=OAO3]DtNKn-R1Hc#$?$h;RW[;B(k%DrOQ4lA(kZ`8[,.5E\H/%&9RE1k-pKk^'?Wseh?':/9RD3&&Xf\j=9;Sdd#l!b5li$Q.?@#FtJ9r"D*THt>o=+h,ei<3VCI\e<F.YjMklHmQ252@7%.$dl?FX[.5Ru_<cOnWObGU$sud3sn0Nm?VSip=&P_8=3<b5l"NFchqcT66k!jof;<?"kHRj[>Q+FB(V85-;*H\(QoM*>m2@9WKA`dV0$2F]lQ!^cKY?-F<<RYBD9:P`&#<:DJpSA0L]L_Q`8=5'6'r`p_54_;lcH+H=)4\l8B7YE#pX>K&Mf4jEn:L@C'pmu(T(NAo?onFtPTH*Mah:OJII8OF6<oMipM;1-5S+stnT,o"n]+UmpI>_OX,SeHS'86i`]nN=oW_Hjm1lb%agT!)1^3rJWom\/,?BYNjThVR,cQ'opa8Q#<G9<qeSN'GRRO*(AC(K$'<9uMACIm=MV?Mk2Q*3P"~>endstream
|
||||
endobj
|
||||
xref
|
||||
0 12
|
||||
0000000000 65535 f
|
||||
0000000073 00000 n
|
||||
0000000134 00000 n
|
||||
0000000241 00000 n
|
||||
0000000353 00000 n
|
||||
0000000458 00000 n
|
||||
0000000568 00000 n
|
||||
0000005249 00000 n
|
||||
0000005507 00000 n
|
||||
0000005576 00000 n
|
||||
0000005859 00000 n
|
||||
0000005919 00000 n
|
||||
trailer
|
||||
<<
|
||||
/ID
|
||||
[<4800d64fefba4dd902e51197c7da4e88><4800d64fefba4dd902e51197c7da4e88>]
|
||||
% ReportLab generated PDF document -- digest (http://www.reportlab.com)
|
||||
|
||||
/Info 9 0 R
|
||||
/Root 8 0 R
|
||||
/Size 12
|
||||
>>
|
||||
startxref
|
||||
7992
|
||||
%%EOF
|
||||
Binary file not shown.
+115
File diff suppressed because one or more lines are too long
Binary file not shown.
@@ -0,0 +1,74 @@
|
||||
%PDF-1.3
|
||||
%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com
|
||||
1 0 obj
|
||||
<<
|
||||
/F1 2 0 R /F2 3 0 R
|
||||
>>
|
||||
endobj
|
||||
2 0 obj
|
||||
<<
|
||||
/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font
|
||||
>>
|
||||
endobj
|
||||
3 0 obj
|
||||
<<
|
||||
/BaseFont /Helvetica-Bold /Encoding /WinAnsiEncoding /Name /F2 /Subtype /Type1 /Type /Font
|
||||
>>
|
||||
endobj
|
||||
4 0 obj
|
||||
<<
|
||||
/Contents 8 0 R /MediaBox [ 0 0 612 792 ] /Parent 7 0 R /Resources <<
|
||||
/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ]
|
||||
>> /Rotate 0 /Trans <<
|
||||
|
||||
>>
|
||||
/Type /Page
|
||||
>>
|
||||
endobj
|
||||
5 0 obj
|
||||
<<
|
||||
/PageMode /UseNone /Pages 7 0 R /Type /Catalog
|
||||
>>
|
||||
endobj
|
||||
6 0 obj
|
||||
<<
|
||||
/Author (anonymous) /CreationDate (D:20260108192537+01'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:20260108192537+01'00') /Producer (ReportLab PDF Library - www.reportlab.com)
|
||||
/Subject (unspecified) /Title (untitled) /Trapped /False
|
||||
>>
|
||||
endobj
|
||||
7 0 obj
|
||||
<<
|
||||
/Count 1 /Kids [ 4 0 R ] /Type /Pages
|
||||
>>
|
||||
endobj
|
||||
8 0 obj
|
||||
<<
|
||||
/Filter [ /ASCII85Decode /FlateDecode ] /Length 670
|
||||
>>
|
||||
stream
|
||||
Gat$td;IYl'Rf-pcJpsZ/27V[H_WEoW#\5sVS2I3Jt]?;R+`$Ms*f.>6<=3APUNhTmQL<9F,pFup'KGk=TR,7^>/u!#kAE+l;?UQ8Fg(+-O>;^54HWJ*kXdl'VdsI]Y^$-G(GWPR)iGMeWbg3)F'+jfWpCb"rU?d?8?q_r!E2N'0sM)J>=XD.jgunBuga\Wi4MX$WV/b)1F@bC8Nj8(0*)"ZK06BSqlu1$[^37A;/aK=mfgqg$&i),2OH&%^\"B1%B\dd_V>$5OtPri4rcEe3LoBUeL6QAPnpQr+R-t0f]ZSYc?BTAKQ?A&+J#J*N*=6;'?@Cp*>auj0",hDS3bH4[hVs3O="&bk&U@>+8c1&c2iDg6R*%q%iEZq'-!FNSB8#C*'po69R8$S(:.=-$N6'!_[1/jV<$@V3Z_"gd!g!MJMT)mTUN4cWjUQQj]HT_m]0*R=YgTmcl@k>*b/SBce9?.m,bEi#?PI:=r_6G.auM&FtP,>O7T%Z<$f#=g6(2+d@;8?"$8cdI38ZZ>hq5b2_pQY:M\.Kod,pl)ZX7a7Gc'Mf_'SB1X3*L[-51a8`h4)KjJQjLfm/3TIeQY?2+?^.r^HNafjHp<5,1M=W'N>8sb=dB#FC5M`7L91"BC@CfEckPe`M5O:#!Fj$K]s(Gs8rW$>H7gK~>endstream
|
||||
endobj
|
||||
xref
|
||||
0 9
|
||||
0000000000 65535 f
|
||||
0000000073 00000 n
|
||||
0000000114 00000 n
|
||||
0000000221 00000 n
|
||||
0000000333 00000 n
|
||||
0000000526 00000 n
|
||||
0000000594 00000 n
|
||||
0000000890 00000 n
|
||||
0000000949 00000 n
|
||||
trailer
|
||||
<<
|
||||
/ID
|
||||
[<5467fcd5093f18002be6af3fb13ce6c3><5467fcd5093f18002be6af3fb13ce6c3>]
|
||||
% ReportLab generated PDF document -- digest (http://www.reportlab.com)
|
||||
|
||||
/Info 6 0 R
|
||||
/Root 5 0 R
|
||||
/Size 9
|
||||
>>
|
||||
startxref
|
||||
1709
|
||||
%%EOF
|
||||
BIN
Binary file not shown.
BIN
Binary file not shown.
@@ -1,9 +1,12 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
import io
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import openai
|
||||
import pytest
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
from markitdown._uri_utils import parse_data_uri, file_uri_to_path
|
||||
|
||||
from markitdown import (
|
||||
MarkItDown,
|
||||
@@ -176,9 +179,80 @@ def test_stream_info_operations() -> None:
|
||||
assert updated_stream_info.url == "url.1"
|
||||
|
||||
|
||||
def test_docx_comments() -> None:
|
||||
markitdown = MarkItDown()
|
||||
def test_data_uris() -> None:
|
||||
# Test basic parsing of data URIs
|
||||
data_uri = "data:text/plain;base64,SGVsbG8sIFdvcmxkIQ=="
|
||||
mime_type, attributes, data = parse_data_uri(data_uri)
|
||||
assert mime_type == "text/plain"
|
||||
assert len(attributes) == 0
|
||||
assert data == b"Hello, World!"
|
||||
|
||||
data_uri = "data:base64,SGVsbG8sIFdvcmxkIQ=="
|
||||
mime_type, attributes, data = parse_data_uri(data_uri)
|
||||
assert mime_type is None
|
||||
assert len(attributes) == 0
|
||||
assert data == b"Hello, World!"
|
||||
|
||||
data_uri = "data:text/plain;charset=utf-8;base64,SGVsbG8sIFdvcmxkIQ=="
|
||||
mime_type, attributes, data = parse_data_uri(data_uri)
|
||||
assert mime_type == "text/plain"
|
||||
assert len(attributes) == 1
|
||||
assert attributes["charset"] == "utf-8"
|
||||
assert data == b"Hello, World!"
|
||||
|
||||
data_uri = "data:,Hello%2C%20World%21"
|
||||
mime_type, attributes, data = parse_data_uri(data_uri)
|
||||
assert mime_type is None
|
||||
assert len(attributes) == 0
|
||||
assert data == b"Hello, World!"
|
||||
|
||||
data_uri = "data:text/plain,Hello%2C%20World%21"
|
||||
mime_type, attributes, data = parse_data_uri(data_uri)
|
||||
assert mime_type == "text/plain"
|
||||
assert len(attributes) == 0
|
||||
assert data == b"Hello, World!"
|
||||
|
||||
data_uri = "data:text/plain;charset=utf-8,Hello%2C%20World%21"
|
||||
mime_type, attributes, data = parse_data_uri(data_uri)
|
||||
assert mime_type == "text/plain"
|
||||
assert len(attributes) == 1
|
||||
assert attributes["charset"] == "utf-8"
|
||||
assert data == b"Hello, World!"
|
||||
|
||||
|
||||
def test_file_uris() -> None:
|
||||
# Test file URI with an empty host
|
||||
file_uri = "file:///path/to/file.txt"
|
||||
netloc, path = file_uri_to_path(file_uri)
|
||||
assert netloc is None
|
||||
assert path == "/path/to/file.txt"
|
||||
|
||||
# Test file URI with no host
|
||||
file_uri = "file:/path/to/file.txt"
|
||||
netloc, path = file_uri_to_path(file_uri)
|
||||
assert netloc is None
|
||||
assert path == "/path/to/file.txt"
|
||||
|
||||
# Test file URI with localhost
|
||||
file_uri = "file://localhost/path/to/file.txt"
|
||||
netloc, path = file_uri_to_path(file_uri)
|
||||
assert netloc == "localhost"
|
||||
assert path == "/path/to/file.txt"
|
||||
|
||||
# Test file URI with query parameters
|
||||
file_uri = "file:///path/to/file.txt?param=value"
|
||||
netloc, path = file_uri_to_path(file_uri)
|
||||
assert netloc is None
|
||||
assert path == "/path/to/file.txt"
|
||||
|
||||
# Test file URI with fragment
|
||||
file_uri = "file:///path/to/file.txt#fragment"
|
||||
netloc, path = file_uri_to_path(file_uri)
|
||||
assert netloc is None
|
||||
assert path == "/path/to/file.txt"
|
||||
|
||||
|
||||
def test_docx_comments() -> None:
|
||||
# Test DOCX processing, with comments and setting style_map on init
|
||||
markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
|
||||
result = markitdown_with_style_map.convert(
|
||||
@@ -187,6 +261,19 @@ def test_docx_comments() -> None:
|
||||
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
|
||||
|
||||
|
||||
def test_docx_equations() -> None:
|
||||
markitdown = MarkItDown()
|
||||
docx_file = os.path.join(TEST_FILES_DIR, "equations.docx")
|
||||
result = markitdown.convert(docx_file)
|
||||
|
||||
# Check for inline equation m=1 (wrapped with single $) is present
|
||||
assert "$m=1$" in result.text_content, "Inline equation $m=1$ not found"
|
||||
|
||||
# Find block equations wrapped with double $$ and check if they are present
|
||||
block_equations = re.findall(r"\$\$(.+?)\$\$", result.text_content)
|
||||
assert block_equations, "No block equations found in the document."
|
||||
|
||||
|
||||
def test_input_as_strings() -> None:
|
||||
markitdown = MarkItDown()
|
||||
|
||||
@@ -201,6 +288,47 @@ def test_input_as_strings() -> None:
|
||||
assert "# Test" in result.text_content
|
||||
|
||||
|
||||
def test_doc_rlink() -> None:
|
||||
# Test for: CVE-2025-11849
|
||||
markitdown = MarkItDown()
|
||||
|
||||
# Document with rlink
|
||||
docx_file = os.path.join(TEST_FILES_DIR, "rlink.docx")
|
||||
|
||||
# Directory containing the target rlink file
|
||||
rlink_tmp_dir = os.path.abspath(os.sep + "tmp")
|
||||
|
||||
# Ensure the tmp directory exists
|
||||
if not os.path.exists(rlink_tmp_dir):
|
||||
pytest.skip(f"Skipping rlink test; {rlink_tmp_dir} directory does not exist.")
|
||||
return
|
||||
|
||||
rlink_file_path = os.path.join(rlink_tmp_dir, "test_rlink.txt")
|
||||
rlink_content = "de658225-569e-4e3d-9ed2-cfb6abf927fc"
|
||||
b64_prefix = (
|
||||
"ZGU2NTgyMjUtNTY5ZS00ZTNkLTllZDItY2ZiNmFiZjk" # base64 prefix of rlink_content
|
||||
)
|
||||
|
||||
if os.path.exists(rlink_file_path):
|
||||
with open(rlink_file_path, "r", encoding="utf-8") as f:
|
||||
existing_content = f.read()
|
||||
if existing_content != rlink_content:
|
||||
raise ValueError(
|
||||
f"Existing {rlink_file_path} content does not match expected content."
|
||||
)
|
||||
else:
|
||||
with open(rlink_file_path, "w", encoding="utf-8") as f:
|
||||
f.write(rlink_content)
|
||||
|
||||
try:
|
||||
result = markitdown.convert(docx_file, keep_data_uris=True).text_content
|
||||
assert (
|
||||
b64_prefix not in result
|
||||
) # Make sure the target file was NOT embedded in the output
|
||||
finally:
|
||||
os.remove(rlink_file_path)
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_remote,
|
||||
reason="do not run tests that query external urls",
|
||||
@@ -214,9 +342,9 @@ def test_markitdown_remote() -> None:
|
||||
assert test_string in result.text_content
|
||||
|
||||
# Youtube
|
||||
result = markitdown.convert(YOUTUBE_TEST_URL)
|
||||
for test_string in YOUTUBE_TEST_STRINGS:
|
||||
assert test_string in result.text_content
|
||||
# result = markitdown.convert(YOUTUBE_TEST_URL)
|
||||
# for test_string in YOUTUBE_TEST_STRINGS:
|
||||
# assert test_string in result.text_content
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
@@ -284,6 +412,50 @@ def test_markitdown_exiftool() -> None:
|
||||
assert target in result.text_content
|
||||
|
||||
|
||||
def test_markitdown_llm_parameters() -> None:
|
||||
"""Test that LLM parameters are correctly passed to the client."""
|
||||
mock_client = MagicMock()
|
||||
mock_response = MagicMock()
|
||||
mock_response.choices = [
|
||||
MagicMock(
|
||||
message=MagicMock(
|
||||
content="Test caption with red circle and blue square 5bda1dd6"
|
||||
)
|
||||
)
|
||||
]
|
||||
mock_client.chat.completions.create.return_value = mock_response
|
||||
|
||||
test_prompt = "You are a professional test prompt."
|
||||
markitdown = MarkItDown(
|
||||
llm_client=mock_client, llm_model="gpt-4o", llm_prompt=test_prompt
|
||||
)
|
||||
|
||||
# Test image file
|
||||
markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
|
||||
|
||||
# Verify the prompt was passed to the OpenAI API
|
||||
assert mock_client.chat.completions.create.called
|
||||
call_args = mock_client.chat.completions.create.call_args
|
||||
messages = call_args[1]["messages"]
|
||||
assert len(messages) == 1
|
||||
assert messages[0]["content"][0]["text"] == test_prompt
|
||||
|
||||
# Reset the mock for the next test
|
||||
mock_client.chat.completions.create.reset_mock()
|
||||
|
||||
# TODO: may only use one test after the llm caption method duplicate has been removed:
|
||||
# https://github.com/microsoft/markitdown/pull/1254
|
||||
# Test PPTX file
|
||||
markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
|
||||
|
||||
# Verify the prompt was passed to the OpenAI API for PPTX images too
|
||||
assert mock_client.chat.completions.create.called
|
||||
call_args = mock_client.chat.completions.create.call_args
|
||||
messages = call_args[1]["messages"]
|
||||
assert len(messages) == 1
|
||||
assert messages[0]["content"][0]["text"] == test_prompt
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_llm,
|
||||
reason="do not run llm tests without a key",
|
||||
@@ -314,12 +486,16 @@ if __name__ == "__main__":
|
||||
"""Runs this file's tests from the command line."""
|
||||
for test in [
|
||||
test_stream_info_operations,
|
||||
test_data_uris,
|
||||
test_file_uris,
|
||||
test_docx_comments,
|
||||
test_input_as_strings,
|
||||
test_markitdown_remote,
|
||||
test_speech_transcription,
|
||||
test_exceptions,
|
||||
test_doc_rlink,
|
||||
test_markitdown_exiftool,
|
||||
test_markitdown_llm_parameters,
|
||||
test_markitdown_llm,
|
||||
]:
|
||||
print(f"Running {test.__name__}...", end="")
|
||||
|
||||
@@ -2,18 +2,17 @@
|
||||
import os
|
||||
import time
|
||||
import pytest
|
||||
import codecs
|
||||
import base64
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
if __name__ == "__main__":
|
||||
from _test_vectors import GENERAL_TEST_VECTORS
|
||||
from _test_vectors import GENERAL_TEST_VECTORS, DATA_URI_TEST_VECTORS
|
||||
else:
|
||||
from ._test_vectors import GENERAL_TEST_VECTORS
|
||||
from ._test_vectors import GENERAL_TEST_VECTORS, DATA_URI_TEST_VECTORS
|
||||
|
||||
from markitdown import (
|
||||
MarkItDown,
|
||||
UnsupportedFormatException,
|
||||
FileConversionException,
|
||||
StreamInfo,
|
||||
)
|
||||
|
||||
@@ -108,8 +107,8 @@ def test_convert_stream_without_hints(test_vector):
|
||||
reason="do not run tests that query external urls",
|
||||
)
|
||||
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
|
||||
def test_convert_url(test_vector):
|
||||
"""Test the conversion of a stream with no stream info."""
|
||||
def test_convert_http_uri(test_vector):
|
||||
"""Test the conversion of an HTTP:// or HTTPS:// URI."""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
time.sleep(1) # Ensure we don't hit rate limits
|
||||
@@ -124,16 +123,94 @@ def test_convert_url(test_vector):
|
||||
assert string not in result.markdown
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
|
||||
def test_convert_file_uri(test_vector):
|
||||
"""Test the conversion of a file:// URI."""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
result = markitdown.convert(
|
||||
Path(os.path.join(TEST_FILES_DIR, test_vector.filename)).as_uri(),
|
||||
url=test_vector.url,
|
||||
)
|
||||
for string in test_vector.must_include:
|
||||
assert string in result.markdown
|
||||
for string in test_vector.must_not_include:
|
||||
assert string not in result.markdown
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
|
||||
def test_convert_data_uri(test_vector):
|
||||
"""Test the conversion of a data URI."""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
data = ""
|
||||
with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
|
||||
data = base64.b64encode(stream.read()).decode("utf-8")
|
||||
mimetype = test_vector.mimetype
|
||||
data_uri = f"data:{mimetype};base64,{data}"
|
||||
|
||||
result = markitdown.convert(
|
||||
data_uri,
|
||||
url=test_vector.url,
|
||||
)
|
||||
for string in test_vector.must_include:
|
||||
assert string in result.markdown
|
||||
for string in test_vector.must_not_include:
|
||||
assert string not in result.markdown
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", DATA_URI_TEST_VECTORS)
|
||||
def test_convert_keep_data_uris(test_vector):
|
||||
"""Test API functionality when keep_data_uris is enabled"""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
# Test local file conversion
|
||||
result = markitdown.convert(
|
||||
os.path.join(TEST_FILES_DIR, test_vector.filename),
|
||||
keep_data_uris=True,
|
||||
url=test_vector.url,
|
||||
)
|
||||
|
||||
for string in test_vector.must_include:
|
||||
assert string in result.markdown
|
||||
for string in test_vector.must_not_include:
|
||||
assert string not in result.markdown
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", DATA_URI_TEST_VECTORS)
|
||||
def test_convert_stream_keep_data_uris(test_vector):
|
||||
"""Test the conversion of a stream with no stream info."""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
stream_info = StreamInfo(
|
||||
extension=os.path.splitext(test_vector.filename)[1],
|
||||
mimetype=test_vector.mimetype,
|
||||
charset=test_vector.charset,
|
||||
)
|
||||
|
||||
with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
|
||||
result = markitdown.convert(
|
||||
stream, stream_info=stream_info, keep_data_uris=True, url=test_vector.url
|
||||
)
|
||||
|
||||
for string in test_vector.must_include:
|
||||
assert string in result.markdown
|
||||
for string in test_vector.must_not_include:
|
||||
assert string not in result.markdown
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
"""Runs this file's tests from the command line."""
|
||||
|
||||
# General tests
|
||||
for test_function in [
|
||||
test_guess_stream_info,
|
||||
test_convert_local,
|
||||
test_convert_stream_with_hints,
|
||||
test_convert_stream_without_hints,
|
||||
test_convert_url,
|
||||
test_convert_http_uri,
|
||||
test_convert_file_uri,
|
||||
test_convert_data_uri,
|
||||
]:
|
||||
for test_vector in GENERAL_TEST_VECTORS:
|
||||
print(
|
||||
@@ -141,4 +218,17 @@ if __name__ == "__main__":
|
||||
)
|
||||
test_function(test_vector)
|
||||
print("OK")
|
||||
|
||||
# Data URI tests
|
||||
for test_function in [
|
||||
test_convert_keep_data_uris,
|
||||
test_convert_stream_keep_data_uris,
|
||||
]:
|
||||
for test_vector in DATA_URI_TEST_VECTORS:
|
||||
print(
|
||||
f"Running {test_function.__name__} on {test_vector.filename}...", end=""
|
||||
)
|
||||
test_function(test_vector)
|
||||
print("OK")
|
||||
|
||||
print("All tests passed!")
|
||||
|
||||
@@ -0,0 +1,171 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
"""Tests for MasterFormat-style partial numbering in PDF conversion."""
|
||||
|
||||
import os
|
||||
import re
|
||||
import pytest
|
||||
|
||||
from markitdown import MarkItDown
|
||||
from markitdown.converters._pdf_converter import PARTIAL_NUMBERING_PATTERN
|
||||
|
||||
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
|
||||
|
||||
|
||||
class TestMasterFormatPartialNumbering:
|
||||
"""Test handling of MasterFormat-style partial numbering (.1, .2, etc.)."""
|
||||
|
||||
def test_partial_numbering_pattern_regex(self):
|
||||
"""Test that the partial numbering regex pattern correctly matches."""
|
||||
|
||||
# Should match partial numbering patterns
|
||||
assert PARTIAL_NUMBERING_PATTERN.match(".1") is not None
|
||||
assert PARTIAL_NUMBERING_PATTERN.match(".2") is not None
|
||||
assert PARTIAL_NUMBERING_PATTERN.match(".10") is not None
|
||||
assert PARTIAL_NUMBERING_PATTERN.match(".99") is not None
|
||||
|
||||
# Should NOT match other patterns
|
||||
assert PARTIAL_NUMBERING_PATTERN.match("1.") is None
|
||||
assert PARTIAL_NUMBERING_PATTERN.match("1.2") is None
|
||||
assert PARTIAL_NUMBERING_PATTERN.match(".1.2") is None
|
||||
assert PARTIAL_NUMBERING_PATTERN.match("text") is None
|
||||
assert PARTIAL_NUMBERING_PATTERN.match(".a") is None
|
||||
assert PARTIAL_NUMBERING_PATTERN.match("") is None
|
||||
|
||||
def test_masterformat_partial_numbering_not_split(self):
|
||||
"""Test that MasterFormat partial numbering stays with associated text.
|
||||
|
||||
MasterFormat documents use partial numbering like:
|
||||
.1 The intent of this Request for Proposal...
|
||||
.2 Available information relative to...
|
||||
|
||||
These should NOT be split into separate table columns, but kept
|
||||
as coherent text lines with the number followed by its description.
|
||||
"""
|
||||
pdf_path = os.path.join(TEST_FILES_DIR, "masterformat_partial_numbering.pdf")
|
||||
|
||||
markitdown = MarkItDown()
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Partial numberings should NOT appear isolated on their own lines
|
||||
# If they're isolated, it means the parser incorrectly split them from their text
|
||||
lines = text_content.split("\n")
|
||||
isolated_numberings = []
|
||||
for line in lines:
|
||||
stripped = line.strip()
|
||||
# Check if line contains ONLY a partial numbering (with possible whitespace/pipes)
|
||||
cleaned = stripped.replace("|", "").strip()
|
||||
if cleaned in [".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8", ".9", ".10"]:
|
||||
isolated_numberings.append(stripped)
|
||||
|
||||
assert len(isolated_numberings) == 0, (
|
||||
f"Partial numberings should not be isolated from their text. "
|
||||
f"Found isolated: {isolated_numberings}"
|
||||
)
|
||||
|
||||
# Verify that partial numberings appear WITH following text on the same line
|
||||
# Look for patterns like ".1 The intent" or ".1 Some text"
|
||||
partial_with_text = re.findall(r"\.\d+\s+\w+", text_content)
|
||||
assert (
|
||||
len(partial_with_text) > 0
|
||||
), "Expected to find partial numberings followed by text on the same line"
|
||||
|
||||
def test_masterformat_content_preserved(self):
|
||||
"""Test that MasterFormat document content is fully preserved."""
|
||||
pdf_path = os.path.join(TEST_FILES_DIR, "masterformat_partial_numbering.pdf")
|
||||
|
||||
markitdown = MarkItDown()
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Verify key content from the MasterFormat document is preserved
|
||||
expected_content = [
|
||||
"RFP for Construction Management Services",
|
||||
"Section 00 00 43",
|
||||
"Instructions to Respondents",
|
||||
"Ken Sargent House",
|
||||
"INTENT",
|
||||
"Request for Proposal",
|
||||
"KEN SARGENT HOUSE",
|
||||
"GRANDE PRAIRIE, ALBERTA",
|
||||
"Section 00 00 45",
|
||||
]
|
||||
|
||||
for content in expected_content:
|
||||
assert (
|
||||
content in text_content
|
||||
), f"Expected content '{content}' not found in extracted text"
|
||||
|
||||
# Verify partial numbering is followed by text on the same line
|
||||
# .1 should be followed by "The intent" on the same line
|
||||
assert re.search(
|
||||
r"\.1\s+The intent", text_content
|
||||
), "Partial numbering .1 should be followed by 'The intent' text"
|
||||
|
||||
# .2 should be followed by "Available information" on the same line
|
||||
assert re.search(
|
||||
r"\.2\s+Available information", text_content
|
||||
), "Partial numbering .2 should be followed by 'Available information' text"
|
||||
|
||||
# Ensure text content is not empty and has reasonable length
|
||||
assert (
|
||||
len(text_content.strip()) > 100
|
||||
), "MasterFormat document should have substantial text content"
|
||||
|
||||
def test_merge_partial_numbering_with_empty_lines_between(self):
|
||||
"""Test that partial numberings merge correctly even with empty lines between.
|
||||
|
||||
When PDF extractors produce output like:
|
||||
.1
|
||||
|
||||
The intent of this Request...
|
||||
|
||||
The merge logic should still combine them properly.
|
||||
"""
|
||||
pdf_path = os.path.join(TEST_FILES_DIR, "masterformat_partial_numbering.pdf")
|
||||
|
||||
markitdown = MarkItDown()
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# The merged result should have .1 and .2 followed by text
|
||||
# Check that we don't have patterns like ".1\n\nThe intent" (unmerged)
|
||||
lines = text_content.split("\n")
|
||||
|
||||
for i, line in enumerate(lines):
|
||||
stripped = line.strip()
|
||||
# If we find an isolated partial numbering, the merge failed
|
||||
if stripped in [".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8"]:
|
||||
# Check if next non-empty line exists and wasn't merged
|
||||
for j in range(i + 1, min(i + 3, len(lines))):
|
||||
if lines[j].strip():
|
||||
pytest.fail(
|
||||
f"Partial numbering '{stripped}' on line {i} was not "
|
||||
f"merged with following text '{lines[j].strip()[:30]}...'"
|
||||
)
|
||||
break
|
||||
|
||||
def test_multiple_partial_numberings_all_merged(self):
|
||||
"""Test that all partial numberings in a document are properly merged."""
|
||||
pdf_path = os.path.join(TEST_FILES_DIR, "masterformat_partial_numbering.pdf")
|
||||
|
||||
markitdown = MarkItDown()
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Count occurrences of merged partial numberings (number followed by text)
|
||||
merged_count = len(re.findall(r"\.\d+\s+[A-Za-z]", text_content))
|
||||
|
||||
# Count isolated partial numberings (number alone on a line)
|
||||
isolated_count = 0
|
||||
for line in text_content.split("\n"):
|
||||
stripped = line.strip()
|
||||
if re.match(r"^\.\d+$", stripped):
|
||||
isolated_count += 1
|
||||
|
||||
assert (
|
||||
merged_count >= 2
|
||||
), f"Expected at least 2 merged partial numberings, found {merged_count}"
|
||||
assert (
|
||||
isolated_count == 0
|
||||
), f"Found {isolated_count} isolated partial numberings that weren't merged"
|
||||
@@ -0,0 +1,871 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
"""Tests for PDF table extraction functionality."""
|
||||
import os
|
||||
import re
|
||||
import pytest
|
||||
|
||||
from markitdown import MarkItDown
|
||||
|
||||
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
|
||||
|
||||
|
||||
# --- Helper Functions ---
|
||||
def validate_strings(result, expected_strings, exclude_strings=None):
|
||||
"""Validate presence or absence of specific strings."""
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
for string in expected_strings:
|
||||
assert string in text_content, f"Expected string not found: {string}"
|
||||
if exclude_strings:
|
||||
for string in exclude_strings:
|
||||
assert string not in text_content, f"Excluded string found: {string}"
|
||||
|
||||
|
||||
def validate_markdown_table(result, expected_headers, expected_data_samples):
|
||||
"""Validate that a markdown table exists with expected headers and data."""
|
||||
text_content = result.text_content
|
||||
|
||||
# Check for markdown table structure (| header | header |)
|
||||
assert "|" in text_content, "No markdown table markers found"
|
||||
|
||||
# Check headers are present
|
||||
for header in expected_headers:
|
||||
assert header in text_content, f"Expected table header not found: {header}"
|
||||
|
||||
# Check some data values are present
|
||||
for data in expected_data_samples:
|
||||
assert data in text_content, f"Expected table data not found: {data}"
|
||||
|
||||
|
||||
def extract_markdown_tables(text_content):
|
||||
"""
|
||||
Extract all markdown tables from text content.
|
||||
Returns a list of tables, where each table is a list of rows,
|
||||
and each row is a list of cell values.
|
||||
"""
|
||||
tables = []
|
||||
lines = text_content.split("\n")
|
||||
current_table = []
|
||||
in_table = False
|
||||
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
if line.startswith("|") and line.endswith("|"):
|
||||
# Skip separator rows (contain only dashes and pipes)
|
||||
if re.match(r"^\|[\s\-|]+\|$", line):
|
||||
continue
|
||||
# Parse cells from the row
|
||||
cells = [cell.strip() for cell in line.split("|")[1:-1]]
|
||||
current_table.append(cells)
|
||||
in_table = True
|
||||
else:
|
||||
if in_table and current_table:
|
||||
tables.append(current_table)
|
||||
current_table = []
|
||||
in_table = False
|
||||
|
||||
# Don't forget the last table
|
||||
if current_table:
|
||||
tables.append(current_table)
|
||||
|
||||
return tables
|
||||
|
||||
|
||||
def validate_table_structure(table):
|
||||
"""
|
||||
Validate that a table has consistent structure:
|
||||
- All rows have the same number of columns
|
||||
- Has at least a header row and one data row
|
||||
"""
|
||||
if not table:
|
||||
return False, "Table is empty"
|
||||
|
||||
if len(table) < 2:
|
||||
return False, "Table should have at least header and one data row"
|
||||
|
||||
num_cols = len(table[0])
|
||||
if num_cols < 2:
|
||||
return False, f"Table should have at least 2 columns, found {num_cols}"
|
||||
|
||||
for i, row in enumerate(table):
|
||||
if len(row) != num_cols:
|
||||
return False, f"Row {i} has {len(row)} columns, expected {num_cols}"
|
||||
|
||||
return True, "Table structure is valid"
|
||||
|
||||
|
||||
class TestPdfTableExtraction:
|
||||
"""Test PDF table extraction with various PDF types."""
|
||||
|
||||
@pytest.fixture
|
||||
def markitdown(self):
|
||||
"""Create MarkItDown instance."""
|
||||
return MarkItDown()
|
||||
|
||||
def test_borderless_table_extraction(self, markitdown):
|
||||
"""Test extraction of borderless tables from SPARSE inventory PDF.
|
||||
|
||||
Expected output structure:
|
||||
- Header: INVENTORY RECONCILIATION REPORT with Report ID, Warehouse, Date, Prepared By
|
||||
- Pipe-separated rows with inventory data
|
||||
- Text section: Variance Analysis with Summary Statistics
|
||||
- More pipe-separated rows with extended inventory review
|
||||
- Footer: Recommendations section
|
||||
"""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Validate document header content
|
||||
expected_strings = [
|
||||
"INVENTORY RECONCILIATION REPORT",
|
||||
"Report ID: SPARSE-2024-INV-1234",
|
||||
"Warehouse: Distribution Center East",
|
||||
"Report Date: 2024-11-15",
|
||||
"Prepared By: Sarah Martinez",
|
||||
]
|
||||
validate_strings(result, expected_strings)
|
||||
|
||||
# Validate pipe-separated format is used
|
||||
assert "|" in text_content, "Should have pipe separators for form-style data"
|
||||
|
||||
# --- Validate First Table Data (Inventory Variance) ---
|
||||
# Validate table headers are present
|
||||
first_table_headers = [
|
||||
"Product Code",
|
||||
"Location",
|
||||
"Expected",
|
||||
"Actual",
|
||||
"Variance",
|
||||
"Status",
|
||||
]
|
||||
for header in first_table_headers:
|
||||
assert header in text_content, f"Should contain header '{header}'"
|
||||
|
||||
# Validate first table has all expected SKUs
|
||||
first_table_skus = ["SKU-8847", "SKU-9201", "SKU-4563", "SKU-7728"]
|
||||
for sku in first_table_skus:
|
||||
assert sku in text_content, f"Should contain {sku}"
|
||||
|
||||
# Validate first table has correct status values
|
||||
expected_statuses = ["OK", "CRITICAL"]
|
||||
for status in expected_statuses:
|
||||
assert status in text_content, f"Should contain status '{status}'"
|
||||
|
||||
# Validate first table has location codes
|
||||
expected_locations = ["A-12", "B-07", "C-15", "D-22", "A-08"]
|
||||
for loc in expected_locations:
|
||||
assert loc in text_content, f"Should contain location '{loc}'"
|
||||
|
||||
# --- Validate Second Table Data (Extended Inventory Review) ---
|
||||
# Validate second table headers
|
||||
second_table_headers = [
|
||||
"Category",
|
||||
"Unit Cost",
|
||||
"Total Value",
|
||||
"Last Audit",
|
||||
"Notes",
|
||||
]
|
||||
for header in second_table_headers:
|
||||
assert header in text_content, f"Should contain header '{header}'"
|
||||
|
||||
# Validate second table has all expected SKUs (10 products)
|
||||
second_table_skus = [
|
||||
"SKU-8847",
|
||||
"SKU-9201",
|
||||
"SKU-4563",
|
||||
"SKU-7728",
|
||||
"SKU-3345",
|
||||
"SKU-5512",
|
||||
"SKU-6678",
|
||||
"SKU-7789",
|
||||
"SKU-2234",
|
||||
"SKU-1123",
|
||||
]
|
||||
for sku in second_table_skus:
|
||||
assert sku in text_content, f"Should contain {sku}"
|
||||
|
||||
# Validate second table has categories
|
||||
expected_categories = ["Electronics", "Hardware", "Software", "Accessories"]
|
||||
for category in expected_categories:
|
||||
assert category in text_content, f"Should contain category '{category}'"
|
||||
|
||||
# Validate second table has cost values (spot check)
|
||||
expected_costs = ["$45.00", "$32.50", "$120.00", "$15.75"]
|
||||
for cost in expected_costs:
|
||||
assert cost in text_content, f"Should contain cost '{cost}'"
|
||||
|
||||
# Validate second table has note values
|
||||
expected_notes = ["Verified", "Critical", "Pending"]
|
||||
for note in expected_notes:
|
||||
assert note in text_content, f"Should contain note '{note}'"
|
||||
|
||||
# --- Validate Analysis Text Section ---
|
||||
analysis_strings = [
|
||||
"Variance Analysis:",
|
||||
"Summary Statistics:",
|
||||
"Total Variance Cost: $4,287.50",
|
||||
"Critical Items: 1",
|
||||
"Overall Accuracy: 97.2%",
|
||||
"Recommendations:",
|
||||
]
|
||||
validate_strings(result, analysis_strings)
|
||||
|
||||
# --- Validate Document Structure Order ---
|
||||
# Verify sections appear in correct order
|
||||
# Note: Using flexible patterns since column merging may occur based on gap detection
|
||||
import re
|
||||
|
||||
header_pos = text_content.find("INVENTORY RECONCILIATION REPORT")
|
||||
# Look for Product Code header - may be in same column as Location or separate
|
||||
first_table_match = re.search(r"\|\s*Product Code", text_content)
|
||||
variance_pos = text_content.find("Variance Analysis:")
|
||||
extended_review_pos = text_content.find("Extended Inventory Review:")
|
||||
# Second table - look for SKU entries after extended review section
|
||||
# The table may not have pipes on every row due to paragraph detection
|
||||
second_table_pos = -1
|
||||
if extended_review_pos != -1:
|
||||
# Look for either "| Product Code" or "Product Code" as table header
|
||||
second_table_match = re.search(
|
||||
r"Product Code.*Category", text_content[extended_review_pos:]
|
||||
)
|
||||
if second_table_match:
|
||||
# Adjust position to be relative to full text
|
||||
second_table_pos = extended_review_pos + second_table_match.start()
|
||||
recommendations_pos = text_content.find("Recommendations:")
|
||||
|
||||
positions = {
|
||||
"header": header_pos,
|
||||
"first_table": first_table_match.start() if first_table_match else -1,
|
||||
"variance_analysis": variance_pos,
|
||||
"extended_review": extended_review_pos,
|
||||
"second_table": second_table_pos,
|
||||
"recommendations": recommendations_pos,
|
||||
}
|
||||
|
||||
# All sections should be found
|
||||
for name, pos in positions.items():
|
||||
assert pos != -1, f"Section '{name}' not found in output"
|
||||
|
||||
# Verify correct order
|
||||
assert (
|
||||
positions["header"] < positions["first_table"]
|
||||
), "Header should come before first table"
|
||||
assert (
|
||||
positions["first_table"] < positions["variance_analysis"]
|
||||
), "First table should come before Variance Analysis"
|
||||
assert (
|
||||
positions["variance_analysis"] < positions["extended_review"]
|
||||
), "Variance Analysis should come before Extended Review"
|
||||
assert (
|
||||
positions["extended_review"] < positions["second_table"]
|
||||
), "Extended Review should come before second table"
|
||||
assert (
|
||||
positions["second_table"] < positions["recommendations"]
|
||||
), "Second table should come before Recommendations"
|
||||
|
||||
def test_borderless_table_no_duplication(self, markitdown):
|
||||
"""Test that borderless table content is not duplicated excessively."""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Count occurrences of unique table data - should not be excessively duplicated
|
||||
# SKU-8847 appears in both tables, plus possibly once in summary text
|
||||
sku_count = text_content.count("SKU-8847")
|
||||
# Should appear at most 4 times (2 tables + minor text references), not more
|
||||
assert (
|
||||
sku_count <= 4
|
||||
), f"SKU-8847 appears too many times ({sku_count}), suggests duplication issue"
|
||||
|
||||
def test_borderless_table_correct_position(self, markitdown):
|
||||
"""Test that tables appear in correct positions relative to text."""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Verify content order - header should come before table content, which should come before analysis
|
||||
header_pos = text_content.find("Prepared By: Sarah Martinez")
|
||||
# Look for Product Code in any pipe-separated format
|
||||
product_code_pos = text_content.find("Product Code")
|
||||
variance_pos = text_content.find("Variance Analysis:")
|
||||
|
||||
assert header_pos != -1, "Header should be found"
|
||||
assert product_code_pos != -1, "Product Code should be found"
|
||||
assert variance_pos != -1, "Variance Analysis should be found"
|
||||
|
||||
assert (
|
||||
header_pos < product_code_pos < variance_pos
|
||||
), "Product data should appear between header and Variance Analysis"
|
||||
|
||||
# Second table content should appear after "Extended Inventory Review"
|
||||
extended_review_pos = text_content.find("Extended Inventory Review:")
|
||||
# Look for Category header which is in second table
|
||||
category_pos = text_content.find("Category")
|
||||
recommendations_pos = text_content.find("Recommendations:")
|
||||
|
||||
if (
|
||||
extended_review_pos != -1
|
||||
and category_pos != -1
|
||||
and recommendations_pos != -1
|
||||
):
|
||||
# Find Category position after Extended Inventory Review
|
||||
category_after_review = text_content.find("Category", extended_review_pos)
|
||||
if category_after_review != -1:
|
||||
assert (
|
||||
extended_review_pos < category_after_review < recommendations_pos
|
||||
), "Extended review table should appear between Extended Inventory Review and Recommendations"
|
||||
|
||||
def test_receipt_pdf_extraction(self, markitdown):
|
||||
"""Test extraction of receipt PDF (no tables, formatted text).
|
||||
|
||||
Expected output structure:
|
||||
- Store header: TECHMART ELECTRONICS with address
|
||||
- Transaction info: Store #, date, TXN, Cashier, Register
|
||||
- Line items: 6 products with prices and member discounts
|
||||
- Totals: Subtotal, Member Discount, Sales Tax, Rewards, TOTAL
|
||||
- Payment info: Visa Card, Auth, Ref
|
||||
- Rewards member info: Name, ID, Points
|
||||
- Return policy and footer
|
||||
"""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "RECEIPT-2024-TXN-98765_retail_purchase.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# --- Validate Store Header ---
|
||||
store_header = [
|
||||
"TECHMART ELECTRONICS",
|
||||
"4567 Innovation Blvd",
|
||||
"San Francisco, CA 94103",
|
||||
"(415) 555-0199",
|
||||
]
|
||||
validate_strings(result, store_header)
|
||||
|
||||
# --- Validate Transaction Info ---
|
||||
transaction_info = [
|
||||
"Store #0342 - Downtown SF",
|
||||
"11/23/2024",
|
||||
"TXN: TXN-98765-2024",
|
||||
"Cashier: Emily Rodriguez",
|
||||
"Register: POS-07",
|
||||
]
|
||||
validate_strings(result, transaction_info)
|
||||
|
||||
# --- Validate Line Items (6 products) ---
|
||||
line_items = [
|
||||
# Product 1: Headphones
|
||||
"Wireless Noise-Cancelling",
|
||||
"Headphones - Premium Black",
|
||||
"AUDIO-5521",
|
||||
"$349.99",
|
||||
"$299.99",
|
||||
# Product 2: USB-C Hub
|
||||
"USB-C Hub 7-in-1 Adapter",
|
||||
"ACC-8834",
|
||||
"$79.99",
|
||||
"$159.98",
|
||||
# Product 3: Portable SSD
|
||||
"Portable SSD 2TB",
|
||||
"STOR-2241",
|
||||
"$289.00",
|
||||
"$260.00",
|
||||
# Product 4: Wireless Mouse
|
||||
"Ergonomic Wireless Mouse",
|
||||
"ACC-9012",
|
||||
"$59.99",
|
||||
# Product 5: Screen Cleaning Kit
|
||||
"Screen Cleaning Kit",
|
||||
"CARE-1156",
|
||||
"$12.99",
|
||||
"$38.97",
|
||||
# Product 6: HDMI Cable
|
||||
"HDMI 2.1 Cable 6ft",
|
||||
"CABLE-7789",
|
||||
"$24.99",
|
||||
"$44.98",
|
||||
]
|
||||
validate_strings(result, line_items)
|
||||
|
||||
# --- Validate Totals ---
|
||||
totals = [
|
||||
"SUBTOTAL",
|
||||
"$863.91",
|
||||
"Member Discount",
|
||||
"Sales Tax (8.5%)",
|
||||
"$66.23",
|
||||
"Rewards Applied",
|
||||
"-$25.00",
|
||||
"TOTAL",
|
||||
"$821.14",
|
||||
]
|
||||
validate_strings(result, totals)
|
||||
|
||||
# --- Validate Payment Info ---
|
||||
payment_info = [
|
||||
"PAYMENT METHOD",
|
||||
"Visa Card ending in 4782",
|
||||
"Auth: 847392",
|
||||
"REF-20241123-98765",
|
||||
]
|
||||
validate_strings(result, payment_info)
|
||||
|
||||
# --- Validate Rewards Member Info ---
|
||||
rewards_info = [
|
||||
"REWARDS MEMBER",
|
||||
"Sarah Mitchell",
|
||||
"ID: TM-447821",
|
||||
"Points Earned: 821",
|
||||
"Total Points: 3,247",
|
||||
]
|
||||
validate_strings(result, rewards_info)
|
||||
|
||||
# --- Validate Return Policy & Footer ---
|
||||
footer_info = [
|
||||
"RETURN POLICY",
|
||||
"Returns within 30 days",
|
||||
"Receipt required",
|
||||
"Thank you for shopping!",
|
||||
"www.techmart.example.com",
|
||||
]
|
||||
validate_strings(result, footer_info)
|
||||
|
||||
# --- Validate Document Structure Order ---
|
||||
positions = {
|
||||
"store_header": text_content.find("TECHMART ELECTRONICS"),
|
||||
"transaction": text_content.find("TXN: TXN-98765-2024"),
|
||||
"first_item": text_content.find("Wireless Noise-Cancelling"),
|
||||
"subtotal": text_content.find("SUBTOTAL"),
|
||||
"total": text_content.find("TOTAL"),
|
||||
"payment": text_content.find("PAYMENT METHOD"),
|
||||
"rewards": text_content.find("REWARDS MEMBER"),
|
||||
"return_policy": text_content.find("RETURN POLICY"),
|
||||
}
|
||||
|
||||
# All sections should be found
|
||||
for name, pos in positions.items():
|
||||
assert pos != -1, f"Section '{name}' not found in output"
|
||||
|
||||
# Verify correct order
|
||||
assert (
|
||||
positions["store_header"] < positions["transaction"]
|
||||
), "Store header should come before transaction"
|
||||
assert (
|
||||
positions["transaction"] < positions["first_item"]
|
||||
), "Transaction should come before items"
|
||||
assert (
|
||||
positions["first_item"] < positions["subtotal"]
|
||||
), "Items should come before subtotal"
|
||||
assert (
|
||||
positions["subtotal"] < positions["total"]
|
||||
), "Subtotal should come before total"
|
||||
assert (
|
||||
positions["total"] < positions["payment"]
|
||||
), "Total should come before payment"
|
||||
assert (
|
||||
positions["payment"] < positions["rewards"]
|
||||
), "Payment should come before rewards"
|
||||
assert (
|
||||
positions["rewards"] < positions["return_policy"]
|
||||
), "Rewards should come before return policy"
|
||||
|
||||
def test_multipage_invoice_extraction(self, markitdown):
|
||||
"""Test extraction of multipage invoice PDF with form-style layout.
|
||||
|
||||
Expected output: Pipe-separated format with clear cell boundaries.
|
||||
Form data should be extracted with pipes indicating column separations.
|
||||
"""
|
||||
pdf_path = os.path.join(TEST_FILES_DIR, "REPAIR-2022-INV-001_multipage.pdf")
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Validate basic content is extracted
|
||||
expected_strings = [
|
||||
"ZAVA AUTO REPAIR",
|
||||
"Collision Repair",
|
||||
"Redmond, WA",
|
||||
"Gabriel Diaz",
|
||||
"Jeep",
|
||||
"Grand Cherokee",
|
||||
"Parts",
|
||||
"Body Labor",
|
||||
"Paint Labor",
|
||||
"GRAND TOTAL",
|
||||
# Second page content
|
||||
"Bruce Wayne",
|
||||
"Batmobile",
|
||||
]
|
||||
validate_strings(result, expected_strings)
|
||||
|
||||
# Validate pipe-separated table format
|
||||
# Form-style documents should use pipes to separate cells
|
||||
assert "|" in text_content, "Form-style PDF should contain pipe separators"
|
||||
|
||||
# Validate key form fields are properly separated
|
||||
# These patterns check that label and value are in separate cells
|
||||
# Note: cells may have padding spaces for column alignment
|
||||
import re
|
||||
|
||||
assert re.search(
|
||||
r"\| Insured name\s*\|", text_content
|
||||
), "Insured name should be in its own cell"
|
||||
assert re.search(
|
||||
r"\| Gabriel Diaz\s*\|", text_content
|
||||
), "Gabriel Diaz should be in its own cell"
|
||||
assert re.search(
|
||||
r"\| Year\s*\|", text_content
|
||||
), "Year label should be in its own cell"
|
||||
assert re.search(
|
||||
r"\| 2022\s*\|", text_content
|
||||
), "Year value should be in its own cell"
|
||||
|
||||
# Validate table structure for estimate totals
|
||||
assert (
|
||||
re.search(r"\| Hours\s*\|", text_content) or "Hours |" in text_content
|
||||
), "Hours column header should be present"
|
||||
assert (
|
||||
re.search(r"\| Rate\s*\|", text_content) or "Rate |" in text_content
|
||||
), "Rate column header should be present"
|
||||
assert (
|
||||
re.search(r"\| Cost\s*\|", text_content) or "Cost |" in text_content
|
||||
), "Cost column header should be present"
|
||||
|
||||
# Validate numeric values are extracted
|
||||
assert "2,100" in text_content, "Parts cost should be extracted"
|
||||
assert "300" in text_content, "Body labor cost should be extracted"
|
||||
assert "225" in text_content, "Paint labor cost should be extracted"
|
||||
assert "5,738" in text_content, "Grand total should be extracted"
|
||||
|
||||
# Validate second page content (Bruce Wayne invoice)
|
||||
assert "Bruce Wayne" in text_content, "Second page customer name"
|
||||
assert "Batmobile" in text_content, "Second page vehicle model"
|
||||
assert "211,522" in text_content, "Second page grand total"
|
||||
|
||||
# Validate disclaimer text is NOT in table format (long paragraph)
|
||||
# The disclaimer should be extracted as plain text, not pipe-separated
|
||||
assert (
|
||||
"preliminary estimate" in text_content.lower()
|
||||
), "Disclaimer text should be present"
|
||||
|
||||
def test_academic_pdf_extraction(self, markitdown):
|
||||
"""Test extraction of academic paper PDF (scientific document).
|
||||
|
||||
Expected output: Plain text without tables or pipe characters.
|
||||
Scientific documents should be extracted as flowing text with proper spacing,
|
||||
not misinterpreted as tables.
|
||||
"""
|
||||
pdf_path = os.path.join(TEST_FILES_DIR, "test.pdf")
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Validate academic paper content with proper spacing
|
||||
expected_strings = [
|
||||
"Introduction",
|
||||
"Large language models", # Should have proper spacing, not "Largelanguagemodels"
|
||||
"agents",
|
||||
"multi-agent", # Should be properly hyphenated
|
||||
]
|
||||
validate_strings(result, expected_strings)
|
||||
|
||||
# Validate proper text formatting (words separated by spaces)
|
||||
assert "LLMs" in text_content, "Should contain 'LLMs' acronym"
|
||||
assert "reasoning" in text_content, "Should contain 'reasoning'"
|
||||
assert "observations" in text_content, "Should contain 'observations'"
|
||||
|
||||
# Ensure content is not empty and has proper length
|
||||
assert len(text_content) > 1000, "Academic PDF should have substantial content"
|
||||
|
||||
# Scientific documents should NOT have tables or pipe characters
|
||||
assert (
|
||||
"|" not in text_content
|
||||
), "Scientific document should not contain pipe characters (no tables)"
|
||||
|
||||
# Verify no markdown tables were extracted
|
||||
tables = extract_markdown_tables(text_content)
|
||||
assert (
|
||||
len(tables) == 0
|
||||
), f"Scientific document should have no tables, found {len(tables)}"
|
||||
|
||||
# Verify text is properly formatted with spaces between words
|
||||
# Check that common phrases are NOT joined together (which would indicate bad extraction)
|
||||
assert (
|
||||
"Largelanguagemodels" not in text_content
|
||||
), "Text should have proper spacing, not joined words"
|
||||
assert (
|
||||
"multiagentconversations" not in text_content.lower()
|
||||
), "Text should have proper spacing between words"
|
||||
|
||||
def test_scanned_pdf_handling(self, markitdown):
|
||||
"""Test handling of scanned/image-based PDF (no text layer).
|
||||
|
||||
Expected output: Empty - scanned PDFs without OCR have no text layer.
|
||||
"""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "MEDRPT-2024-PAT-3847_medical_report_scan.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
|
||||
# Scanned PDFs without OCR have no text layer, so extraction should be empty
|
||||
assert (
|
||||
result is not None
|
||||
), "Converter should return a result even for scanned PDFs"
|
||||
assert result.text_content is not None, "text_content should not be None"
|
||||
|
||||
# Verify extraction is empty (no text layer in scanned PDF)
|
||||
assert (
|
||||
result.text_content.strip() == ""
|
||||
), f"Scanned PDF should have empty extraction, got: '{result.text_content[:100]}...'"
|
||||
|
||||
|
||||
class TestPdfTableMarkdownFormat:
|
||||
"""Test that extracted tables have proper markdown formatting."""
|
||||
|
||||
@pytest.fixture
|
||||
def markitdown(self):
|
||||
"""Create MarkItDown instance."""
|
||||
return MarkItDown()
|
||||
|
||||
def test_markdown_table_has_pipe_format(self, markitdown):
|
||||
"""Test that form-style PDFs have pipe-separated format."""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Find rows with pipes
|
||||
lines = text_content.split("\n")
|
||||
pipe_rows = [
|
||||
line for line in lines if line.startswith("|") and line.endswith("|")
|
||||
]
|
||||
|
||||
assert len(pipe_rows) > 0, "Should have pipe-separated rows"
|
||||
|
||||
# Check that Product Code appears in a pipe-separated row
|
||||
product_code_found = any("Product Code" in row for row in pipe_rows)
|
||||
assert product_code_found, "Product Code should be in pipe-separated format"
|
||||
|
||||
def test_markdown_table_columns_have_pipes(self, markitdown):
|
||||
"""Test that form-style PDF columns are separated with pipes."""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Find table rows and verify column structure
|
||||
lines = text_content.split("\n")
|
||||
table_rows = [
|
||||
line for line in lines if line.startswith("|") and line.endswith("|")
|
||||
]
|
||||
|
||||
assert len(table_rows) > 0, "Should have markdown table rows"
|
||||
|
||||
# Check that at least some rows have multiple columns (pipes)
|
||||
multi_col_rows = [row for row in table_rows if row.count("|") >= 3]
|
||||
assert (
|
||||
len(multi_col_rows) > 5
|
||||
), f"Should have rows with multiple columns, found {len(multi_col_rows)}"
|
||||
|
||||
|
||||
class TestPdfTableStructureConsistency:
|
||||
"""Test that extracted tables have consistent structure across all PDF types."""
|
||||
|
||||
@pytest.fixture
|
||||
def markitdown(self):
|
||||
"""Create MarkItDown instance."""
|
||||
return MarkItDown()
|
||||
|
||||
def test_borderless_table_structure(self, markitdown):
|
||||
"""Test that borderless table PDF has pipe-separated structure."""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Should have pipe-separated content
|
||||
assert "|" in text_content, "Borderless table PDF should have pipe separators"
|
||||
|
||||
# Check that key content is present
|
||||
assert "Product Code" in text_content, "Should contain Product Code"
|
||||
assert "SKU-8847" in text_content, "Should contain first SKU"
|
||||
assert "SKU-9201" in text_content, "Should contain second SKU"
|
||||
|
||||
def test_multipage_invoice_table_structure(self, markitdown):
|
||||
"""Test that multipage invoice PDF has pipe-separated format."""
|
||||
pdf_path = os.path.join(TEST_FILES_DIR, "REPAIR-2022-INV-001_multipage.pdf")
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
text_content = result.text_content
|
||||
|
||||
# Should have pipe-separated content
|
||||
assert "|" in text_content, "Invoice PDF should have pipe separators"
|
||||
|
||||
# Find rows with pipes
|
||||
lines = text_content.split("\n")
|
||||
pipe_rows = [
|
||||
line for line in lines if line.startswith("|") and line.endswith("|")
|
||||
]
|
||||
|
||||
assert (
|
||||
len(pipe_rows) > 10
|
||||
), f"Should have multiple pipe-separated rows, found {len(pipe_rows)}"
|
||||
|
||||
# Check that some rows have multiple columns
|
||||
multi_col_rows = [row for row in pipe_rows if row.count("|") >= 4]
|
||||
assert len(multi_col_rows) > 5, "Should have rows with 3+ columns"
|
||||
|
||||
def test_receipt_has_no_tables(self, markitdown):
|
||||
"""Test that receipt PDF doesn't incorrectly extract tables from formatted text."""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "RECEIPT-2024-TXN-98765_retail_purchase.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
tables = extract_markdown_tables(result.text_content)
|
||||
|
||||
# Receipt should not have markdown tables extracted
|
||||
# (it's formatted text, not tabular data)
|
||||
# If tables are extracted, they should be minimal/empty
|
||||
total_table_rows = sum(len(t) for t in tables)
|
||||
assert (
|
||||
total_table_rows < 5
|
||||
), f"Receipt should not have significant tables, found {total_table_rows} rows"
|
||||
|
||||
def test_scanned_pdf_no_tables(self, markitdown):
|
||||
"""Test that scanned PDF has empty extraction and no tables."""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "MEDRPT-2024-PAT-3847_medical_report_scan.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
|
||||
# Scanned PDF with no text layer should have empty extraction
|
||||
assert (
|
||||
result.text_content.strip() == ""
|
||||
), "Scanned PDF should have empty extraction"
|
||||
|
||||
tables = extract_markdown_tables(result.text_content)
|
||||
|
||||
# Scanned PDF with no text layer should have no tables
|
||||
assert len(tables) == 0, "Scanned PDF should have no extracted tables"
|
||||
|
||||
def test_all_pdfs_table_rows_consistent(self, markitdown):
|
||||
"""Test that all PDF tables have rows with pipe-separated content.
|
||||
|
||||
Note: With gap-based column detection, rows may have different column counts
|
||||
depending on how content is spaced in the PDF. What's important is that each
|
||||
row has pipe separators and the content is readable.
|
||||
"""
|
||||
pdf_files = [
|
||||
"SPARSE-2024-INV-1234_borderless_table.pdf",
|
||||
"REPAIR-2022-INV-001_multipage.pdf",
|
||||
"RECEIPT-2024-TXN-98765_retail_purchase.pdf",
|
||||
"test.pdf",
|
||||
]
|
||||
|
||||
for pdf_file in pdf_files:
|
||||
pdf_path = os.path.join(TEST_FILES_DIR, pdf_file)
|
||||
if not os.path.exists(pdf_path):
|
||||
continue
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
tables = extract_markdown_tables(result.text_content)
|
||||
|
||||
for table_idx, table in enumerate(tables):
|
||||
if not table:
|
||||
continue
|
||||
|
||||
# Verify each row has at least one column (pipe-separated content)
|
||||
for row_idx, row in enumerate(table):
|
||||
assert (
|
||||
len(row) >= 1
|
||||
), f"{pdf_file}: Table {table_idx}, row {row_idx} has no columns"
|
||||
|
||||
# Verify the row has non-empty content
|
||||
row_content = " ".join(cell.strip() for cell in row)
|
||||
assert (
|
||||
len(row_content.strip()) > 0
|
||||
), f"{pdf_file}: Table {table_idx}, row {row_idx} is empty"
|
||||
|
||||
def test_borderless_table_data_integrity(self, markitdown):
|
||||
"""Test that borderless table extraction preserves data integrity."""
|
||||
pdf_path = os.path.join(
|
||||
TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
|
||||
)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
pytest.skip(f"Test file not found: {pdf_path}")
|
||||
|
||||
result = markitdown.convert(pdf_path)
|
||||
tables = extract_markdown_tables(result.text_content)
|
||||
|
||||
assert len(tables) >= 2, "Should have at least 2 tables"
|
||||
|
||||
# Check first table has expected SKU data
|
||||
first_table = tables[0]
|
||||
table_text = str(first_table)
|
||||
assert "SKU-8847" in table_text, "First table should contain SKU-8847"
|
||||
assert "SKU-9201" in table_text, "First table should contain SKU-9201"
|
||||
|
||||
# Check second table has expected category data
|
||||
second_table = tables[1]
|
||||
table_text = str(second_table)
|
||||
assert "Electronics" in table_text, "Second table should contain Electronics"
|
||||
assert "Hardware" in table_text, "Second table should contain Hardware"
|
||||
Reference in New Issue
Block a user