95 Commits

Author SHA1 Message Date
afourney 63cbbd9de6 Updated warning about binding to non-local interfaces. (#1653) 2026-03-30 10:17:52 -07:00
lesyk a6c8ac46a6 Fix O(n) memory growth in PDF conversion by calling page.close() afte… (#1612)
* Fix O(n) memory growth in PDF conversion by calling page.close() after each page

* Refactor PDF memory optimization tests for improved readability and consistency

* Add memory benchmarking tests for PDF conversion with page.close() fix

* Remove unnecessary blank lines in PDF memory optimization tests for cleaner code

* Bump version to 0.1.6b2 in __about__.py

* Update PDF conversion tests to include mimetype in StreamInfo
2026-03-16 10:35:24 -07:00
lesyk c6308dc822 [MS] Add OCR layer service for embedded images and PDF scans (#1541)
* Add OCR test data and implement tests for various document formats

- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

* Enhance OCR functionality and validation in document converters

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.

* Add support for scanned PDFs with full-page OCR fallback and implement tests

* Bump version to 0.1.6b1 in __about__.py

* Refactor OCR services to support LLM Vision, update README and tests accordingly

* Add OCR-enabled converters and ensure consistent OCR format across document types

* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters

* Refactor exception imports for consistency across converters and tests

* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling

* Bump version to 0.1.6b1 in __about__.py

* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

* Add comprehensive OCR test suite for various document formats

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.

* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs

* Revert

* Revert

* Update REDMEs

* Refactor import statements for consistency and improve formatting in converter and test files
2026-03-10 09:17:17 -07:00
afourney 4a5340f93b Bump version for release. (#1564) 2026-02-20 11:40:57 -08:00
Bas Nijholt 6b0fd15e60 Remove onnxruntime<=1.20.1 Windows pin (#1551) 2026-02-16 15:05:37 -08:00
afourney 2b6ec9f315 Add text/markdown to Accept header (#1554) 2026-02-13 11:53:01 -08:00
lesyk c83de14a9c [MS] Extend table support for wide tables (#1552)
* feat: enhance PDF table extraction to support complex forms and add new test cases
* feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases
* fix: correct formatting and improve assertions in PDF table tests
2026-02-13 10:45:39 -08:00
lesyk 7fdaefb724 Fix: PDF parsing doesn't support partially numbered lists (#1525)
* Fix: PDF parsing doesn't support partially numbered lists

* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file

* Refactor: Improve assertion formatting in partial numbering tests
2026-01-08 15:15:22 -08:00
lesyk 251dddcf0c [MS] Update PDF table extraction to support aligned Markdown (#1499)
* Added PDF table extraction feature with aligned Markdown (#1419)

* Add PDF test files and enhance extraction tests

- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.

* fix: update dependencies for PDF processing and improve table extraction logic

* Bumped version of pdfminer.six
---------

Authored-by: Ashok <ashh010101@gmail.com>
2026-01-07 16:38:45 -08:00
afourney dde250a456 Bump versions of mammoth and pdfminer.six (#1492)
* Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.
2025-12-01 10:11:24 -08:00
afourney 3d4fe3cdcc Upgrade mammoth to 1.11.0 (#1452) 2025-10-20 16:07:39 -07:00
afourney 447c047731 Test if mammoth resolves rlinks. (#1451) 2025-10-20 15:54:05 -07:00
Meirna 8a9d8f1593 feat: add checkbox support to Markdown converter (#1208)
This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
2025-08-26 15:30:47 -07:00
Richard Ye 17365654c9 Handle PPTX shapes where position is None (#1161)
* Handle shapes where position is None
* Fixed recursion error, and place no-coord shapes at front
2025-08-26 15:28:17 -07:00
Yuzhong Zhang 59eb60f8cb fix docx parse error(\n in alt) (#1163) 2025-08-26 15:20:17 -07:00
Dmitry 459d462f29 docs: correct minor typos (#1173) 2025-08-26 15:15:23 -07:00
Noah Zhu c3f6cb356c Adding support for data-src Attribute (#1226)
* supportfordata-src
2025-08-26 15:11:53 -07:00
Ebrahim Tayabali 0c4d3945a0 Update README.md (#1191)
Fix: Subtle spelling mistake fixed.
2025-08-26 15:07:27 -07:00
Utkarsh kumar f8b60b5403 Update README.md (#1350)
ISSUE #1339
2025-08-26 15:02:56 -07:00
Stefan Rink b81a387616 fix: correctly pass custom llm prompt parameter (#1319)
* fix: correctly pass custom llm prompt parameter
2025-08-26 14:51:10 -07:00
safen0s ea1a3dfb60 Add HTML support to DocumentIntelligenceConverter (#1352) 2025-08-26 14:34:43 -07:00
t3tra fb1ad24833 Ensure safe ExifTool usage: require >= 12.24 (#1399)
* feat: add version verification for ExifTool to ensure security compliance
* fix: improve ExifTool version verification

---------
2025-08-26 14:25:13 -07:00
JonahDelman 1178c2e211 Fixed documentation typos in _base_converter.py (#1393) 2025-08-26 14:23:10 -07:00
afourney 9278119bb3 Resolved an issue with linked images in docx [mammoth] (#1405) 2025-08-26 14:20:29 -07:00
afourney 3bfb821c09 Have the MarkItDown MCP server read MARKITDOWN_ENABLE_PLUGINS from ENV (#1273)
* Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ

* Update the Dockerfile to enable plugins. No puglins are installed by default.
2025-06-03 09:35:33 -07:00
Tomasz Kalinowski 62b72284fe pin onnxruntime on Windows (#1274)
closes #1266
2025-05-28 13:13:51 -07:00
afourney 1dd3c83339 Promoting 0.1.2a1 to 0.1.2 (#1272) 2025-05-28 10:04:42 -07:00
afourney 9dc982a3b1 Small changes to favor streamable HTTP over deprecated SSE (#1264) 2025-05-23 11:39:41 -07:00
afourney effde4767b Preparing a pre-release of 0.1.2 (#1260) 2025-05-21 15:24:56 -07:00
rtpacks 04bf831209 docs: fix typos (#1201) 2025-05-21 15:12:22 -07:00
Betula-L 9fd680c366 support streamable http mcp (#1245)
Co-authored-by: luhualin
2025-05-21 14:34:50 -07:00
Yi-Cheng Wang 131f0c7739 feat: add Document Intelligence API version selection via kwargs (#1253)
Co-authored-by: Yi-Cheng Wang <yicheng.wang@heph-ai.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:22:08 -07:00
JoshClark-git 56f7579ce2 FIX YouTube transcript errors (#1241)
* FIX YouTube transcript errors

* Fixed formatting.

---------

Co-authored-by: Josh <jca351@sfu.ca>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:17:57 -07:00
t3tra cb421cf9ea Chore: Make linter happy (#1256)
* refactor: remove unused imports

* fix: replace NotImplemented with NotImplementedError

* refactor: resolve E722 (do not use bare 'except')

* refactor: remove unused variable

* refactor: remove unused imports

* refactor: ignore unused imports that will be used in the future

* refactor: resolve W293 (blank line contains whitespace)

* refactor: resolve F541 (f-string is missing placeholders)

---------

Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:02:16 -07:00
kira-offgrid 39e7252940 fix: python.lang.security.use-defused-xml-parse.use-defused-xml-parse-packages-markitdown-src-markitdown-converter_utils-docx-math-omml.py (#1251) 2025-05-21 09:57:21 -07:00
afourney bbcf876b18 Switched from the stdlib minidom parser to defusedxml. (#1259) 2025-05-21 09:47:14 -07:00
createcentury 041be54471 Update README.md (#1187)
updated subtle misspelling.
2025-04-13 09:31:40 -07:00
Turdıbek 8576f1d915 Add CSV to Markdown table conversion - fixes #1144 (#1176)
* feat: Add CSV to Markdown table converter

- Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class

----

Thanks also to @benny123tw who submitted a very similar PR in #1171
2025-04-13 09:19:00 -07:00
Sathindu 3fcd48cdfc feat: render math equations in .docx documents (#1160)
* feat: math equation rendering in .docx files
* fix: import fix on .docx pre processing
* test: add test cases for docx equation rendering
* docs: add ThirdPartyNotices.md
* refactor: reformatted with black
2025-03-28 15:36:38 -07:00
afourney 9e067c42b6 Make it easier to use AzureKeyCredentials with Azure Doc Intelligence (#1151)
* Make it easier to use AzureKeyCredentials with Azure Doc Intelligence
* Fixed mypy type error.
* Added more fine-grained options over types.
* Pass doc intel options further up the stack.
2025-03-26 10:44:11 -07:00
afourney 73b9d57312 Update badges (#1157)
* Update badges in subpackages.
2025-03-25 14:52:24 -07:00
afourney 3ca57986ef Basic SSE MCP Server for MarkItDown (#1155)
* Added an initial minimal MCP server for MarkItDown
* Added STDIO default option.
* Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop
* Pin mcp version.
2025-03-25 14:38:22 -07:00
afourney c1f9a323ee Bump version. (#1154) 2025-03-24 23:26:30 -07:00
afourney e928b43afb convert_url renamed to convert_uri, and now handles data and file URIs (#1153) 2025-03-24 21:43:04 -07:00
afourney 2ffe6ea591 Bump version. (#1150) 2025-03-22 11:21:32 -07:00
afourney efc55b260d Bump version and resolve a console encoding error. (#1149) 2025-03-21 09:27:25 -07:00
Yuzhong Zhang 52432bd228 Add support for preserving base64 encoded images (#1140)
* optional reserve base64 string in markdown _CustomMarkdownify and pptx
* add other converter para support
* fix linter
* Use *kwarg to pass keep_data_uri para.
* Add module cli vector tests
* Fixed formatting, and adjusted tests.
2025-03-20 18:50:23 -07:00
afourney c0a511ecff Updated docx file to include an image. (#1146) 2025-03-20 12:25:56 -07:00
afourney cd6aa41361 Adjust warning filters and update dependencies (#1143)
Adjusts warning filters to be more contextual
Updates dependencies for magika and youtube-transcript-api
Updates the version to 0.1.0a5 in __about__.py
2025-03-19 22:09:14 -07:00
afourney 716f74dcb9 Consider anything with a charset as plain text-convertible. (#1142) 2025-03-19 20:46:35 -07:00