298 Commits

Author SHA1 Message Date
lesyk 7fdaefb724 Fix: PDF parsing doesn't support partially numbered lists (#1525)
* Fix: PDF parsing doesn't support partially numbered lists

* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file

* Refactor: Improve assertion formatting in partial numbering tests
v0.1.5b1
2026-01-08 15:15:22 -08:00
lesyk 251dddcf0c [MS] Update PDF table extraction to support aligned Markdown (#1499)
* Added PDF table extraction feature with aligned Markdown (#1419)

* Add PDF test files and enhance extraction tests

- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.

* fix: update dependencies for PDF processing and improve table extraction logic

* Bumped version of pdfminer.six
---------

Authored-by: Ashok <ashh010101@gmail.com>
2026-01-07 16:38:45 -08:00
afourney dde250a456 Bump versions of mammoth and pdfminer.six (#1492)
* Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.
v0.1.4
2025-12-01 10:11:24 -08:00
afourney 3d4fe3cdcc Upgrade mammoth to 1.11.0 (#1452) 2025-10-20 16:07:39 -07:00
afourney 447c047731 Test if mammoth resolves rlinks. (#1451) 2025-10-20 15:54:05 -07:00
Meirna 8a9d8f1593 feat: add checkbox support to Markdown converter (#1208)
This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
v0.1.3
2025-08-26 15:30:47 -07:00
Richard Ye 17365654c9 Handle PPTX shapes where position is None (#1161)
* Handle shapes where position is None
* Fixed recursion error, and place no-coord shapes at front
2025-08-26 15:28:17 -07:00
Yuzhong Zhang 59eb60f8cb fix docx parse error(\n in alt) (#1163) 2025-08-26 15:20:17 -07:00
Dmitry 459d462f29 docs: correct minor typos (#1173) 2025-08-26 15:15:23 -07:00
Noah Zhu c3f6cb356c Adding support for data-src Attribute (#1226)
* supportfordata-src
2025-08-26 15:11:53 -07:00
Ebrahim Tayabali 0c4d3945a0 Update README.md (#1191)
Fix: Subtle spelling mistake fixed.
2025-08-26 15:07:27 -07:00
Utkarsh kumar f8b60b5403 Update README.md (#1350)
ISSUE #1339
2025-08-26 15:02:56 -07:00
[W]DOS_ 16ca285d30 Update README.md (#1335)
Fix typo in README.md
2025-08-26 14:55:58 -07:00
Stefan Rink b81a387616 fix: correctly pass custom llm prompt parameter (#1319)
* fix: correctly pass custom llm prompt parameter
2025-08-26 14:51:10 -07:00
safen0s ea1a3dfb60 Add HTML support to DocumentIntelligenceConverter (#1352) 2025-08-26 14:34:43 -07:00
dependabot[bot] b6e5da8874 Bump actions/checkout from 4 to 5 (#1394)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
2025-08-26 14:27:38 -07:00
t3tra fb1ad24833 Ensure safe ExifTool usage: require >= 12.24 (#1399)
* feat: add version verification for ExifTool to ensure security compliance
* fix: improve ExifTool version verification

---------
2025-08-26 14:25:13 -07:00
JonahDelman 1178c2e211 Fixed documentation typos in _base_converter.py (#1393) 2025-08-26 14:23:10 -07:00
afourney 9278119bb3 Resolved an issue with linked images in docx [mammoth] (#1405) 2025-08-26 14:20:29 -07:00
onefloid da7bcea527 docs: rephrase sentence (#1278) 2025-06-03 21:09:25 -07:00
afourney 3bfb821c09 Have the MarkItDown MCP server read MARKITDOWN_ENABLE_PLUGINS from ENV (#1273)
* Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ

* Update the Dockerfile to enable plugins. No puglins are installed by default.
2025-06-03 09:35:33 -07:00
Tomasz Kalinowski 62b72284fe pin onnxruntime on Windows (#1274)
closes #1266
2025-05-28 13:13:51 -07:00
afourney 1dd3c83339 Promoting 0.1.2a1 to 0.1.2 (#1272) v0.1.2 2025-05-28 10:04:42 -07:00
afourney 9dc982a3b1 Small changes to favor streamable HTTP over deprecated SSE (#1264) 2025-05-23 11:39:41 -07:00
afourney effde4767b Preparing a pre-release of 0.1.2 (#1260) v0.1.2a1 2025-05-21 15:24:56 -07:00
rtpacks 04bf831209 docs: fix typos (#1201) 2025-05-21 15:12:22 -07:00
Betula-L 9fd680c366 support streamable http mcp (#1245)
Co-authored-by: luhualin
2025-05-21 14:34:50 -07:00
一I 38261fd31c Update Python version requirement and add .cursorrules to .gitignore (#1249)
* update markdown
* Update and install Python version suggestions
* Update README with prerequisites.
---------

Co-authored-by: Lucas Liu <lucas@LucasdeMacBook-Pro.local>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:47:29 -07:00
Yi-Cheng Wang 131f0c7739 feat: add Document Intelligence API version selection via kwargs (#1253)
Co-authored-by: Yi-Cheng Wang <yicheng.wang@heph-ai.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:22:08 -07:00
JoshClark-git 56f7579ce2 FIX YouTube transcript errors (#1241)
* FIX YouTube transcript errors

* Fixed formatting.

---------

Co-authored-by: Josh <jca351@sfu.ca>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:17:57 -07:00
t3tra cb421cf9ea Chore: Make linter happy (#1256)
* refactor: remove unused imports

* fix: replace NotImplemented with NotImplementedError

* refactor: resolve E722 (do not use bare 'except')

* refactor: remove unused variable

* refactor: remove unused imports

* refactor: ignore unused imports that will be used in the future

* refactor: resolve W293 (blank line contains whitespace)

* refactor: resolve F541 (f-string is missing placeholders)

---------

Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:02:16 -07:00
kira-offgrid 39e7252940 fix: python.lang.security.use-defused-xml-parse.use-defused-xml-parse-packages-markitdown-src-markitdown-converter_utils-docx-math-omml.py (#1251) 2025-05-21 09:57:21 -07:00
afourney bbcf876b18 Switched from the stdlib minidom parser to defusedxml. (#1259) 2025-05-21 09:47:14 -07:00
createcentury 041be54471 Update README.md (#1187)
updated subtle misspelling.
2025-04-13 09:31:40 -07:00
lentil32 ebe2684b3d chore: fix typo in README.md (#1175)
* chore: fix typo in README.md
2025-04-13 09:29:16 -07:00
Turdıbek 8576f1d915 Add CSV to Markdown table conversion - fixes #1144 (#1176)
* feat: Add CSV to Markdown table converter

- Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class

----

Thanks also to @benny123tw who submitted a very similar PR in #1171
2025-04-13 09:19:00 -07:00
Sathindu 3fcd48cdfc feat: render math equations in .docx documents (#1160)
* feat: math equation rendering in .docx files
* fix: import fix on .docx pre processing
* test: add test cases for docx equation rendering
* docs: add ThirdPartyNotices.md
* refactor: reformatted with black
2025-03-28 15:36:38 -07:00
afourney 9e067c42b6 Make it easier to use AzureKeyCredentials with Azure Doc Intelligence (#1151)
* Make it easier to use AzureKeyCredentials with Azure Doc Intelligence
* Fixed mypy type error.
* Added more fine-grained options over types.
* Pass doc intel options further up the stack.
2025-03-26 10:44:11 -07:00
afourney 9a951055f0 Update readme to point to the mcp package. (#1158)
* Updated readme with link to the MCP package.
2025-03-25 15:00:04 -07:00
afourney 73b9d57312 Update badges (#1157)
* Update badges in subpackages.
2025-03-25 14:52:24 -07:00
afourney 3ca57986ef Basic SSE MCP Server for MarkItDown (#1155)
* Added an initial minimal MCP server for MarkItDown
* Added STDIO default option.
* Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop
* Pin mcp version.
2025-03-25 14:38:22 -07:00
afourney c1f9a323ee Bump version. (#1154) v0.1.1 2025-03-24 23:26:30 -07:00
afourney e928b43afb convert_url renamed to convert_uri, and now handles data and file URIs (#1153) 2025-03-24 21:43:04 -07:00
afourney 2ffe6ea591 Bump version. (#1150) v0.1.0 2025-03-22 11:21:32 -07:00
afourney efc55b260d Bump version and resolve a console encoding error. (#1149) v0.1.0a6 2025-03-21 09:27:25 -07:00
Yuzhong Zhang 52432bd228 Add support for preserving base64 encoded images (#1140)
* optional reserve base64 string in markdown _CustomMarkdownify and pptx
* add other converter para support
* fix linter
* Use *kwarg to pass keep_data_uri para.
* Add module cli vector tests
* Fixed formatting, and adjusted tests.
2025-03-20 18:50:23 -07:00
afourney c0a511ecff Updated docx file to include an image. (#1146) 2025-03-20 12:25:56 -07:00
afourney cd6aa41361 Adjust warning filters and update dependencies (#1143)
Adjusts warning filters to be more contextual
Updates dependencies for magika and youtube-transcript-api
Updates the version to 0.1.0a5 in __about__.py
v0.1.0a5
2025-03-19 22:09:14 -07:00
afourney 716f74dcb9 Consider anything with a charset as plain text-convertible. (#1142) 2025-03-19 20:46:35 -07:00
afourney a93e0567e6 EPub Support. Adapted #123 to not use epublib. (#1131)
* Adapted #123 to not use epublib.
* Updated README.md
v0.1.0a4
2025-03-17 07:48:15 -07:00