markitdown

Author	SHA1	Message	Date
afourney	63cbbd9de6	Updated warning about binding to non-local interfaces. (#1653 )	2026-03-30 10:17:52 -07:00
lesyk	a6c8ac46a6	Fix O(n) memory growth in PDF conversion by calling page.close() afte… (#1612 ) * Fix O(n) memory growth in PDF conversion by calling page.close() after each page * Refactor PDF memory optimization tests for improved readability and consistency * Add memory benchmarking tests for PDF conversion with page.close() fix * Remove unnecessary blank lines in PDF memory optimization tests for cleaner code * Bump version to 0.1.6b2 in __about__.py * Update PDF conversion tests to include mimetype in StreamInfo	2026-03-16 10:35:24 -07:00
lesyk	c6308dc822	[MS] Add OCR layer service for embedded images and PDF scans (#1541 ) * Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files	2026-03-10 09:17:17 -07:00
afourney	4a5340f93b	Bump version for release. (#1564 )	2026-02-20 11:40:57 -08:00
Bas Nijholt	6b0fd15e60	Remove onnxruntime<=1.20.1 Windows pin (#1551 )	2026-02-16 15:05:37 -08:00
afourney	2b6ec9f315	Add text/markdown to Accept header (#1554 )	2026-02-13 11:53:01 -08:00
lesyk	c83de14a9c	[MS] Extend table support for wide tables (#1552 ) * feat: enhance PDF table extraction to support complex forms and add new test cases * feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases * fix: correct formatting and improve assertions in PDF table tests	2026-02-13 10:45:39 -08:00
lesyk	7fdaefb724	Fix: PDF parsing doesn't support partially numbered lists (#1525 ) * Fix: PDF parsing doesn't support partially numbered lists * Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file * Refactor: Improve assertion formatting in partial numbering tests	2026-01-08 15:15:22 -08:00
lesyk	251dddcf0c	[MS] Update PDF table extraction to support aligned Markdown (#1499 ) * Added PDF table extraction feature with aligned Markdown (#1419) * Add PDF test files and enhance extraction tests - Added a medical report scan PDF for testing scanned PDF handling. - Included a retail purchase receipt PDF to validate receipt extraction functionality. - Introduced a multipage invoice PDF to test extraction of complex invoice structures. - Added a borderless table PDF for testing inventory reconciliation report extraction. - Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity. - Enhanced existing tests to validate the order and presence of extracted content across various PDF types. * fix: update dependencies for PDF processing and improve table extraction logic * Bumped version of pdfminer.six --------- Authored-by: Ashok <ashh010101@gmail.com>	2026-01-07 16:38:45 -08:00
afourney	dde250a456	Bump versions of mammoth and pdfminer.six (#1492 ) * Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.	2025-12-01 10:11:24 -08:00
afourney	3d4fe3cdcc	Upgrade mammoth to 1.11.0 (#1452 )	2025-10-20 16:07:39 -07:00
afourney	447c047731	Test if mammoth resolves rlinks. (#1451 )	2025-10-20 15:54:05 -07:00
Meirna	8a9d8f1593	feat: add checkbox support to Markdown converter (#1208 ) This change introduces functionality to convert HTML checkbox input elements (<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]). Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>	2025-08-26 15:30:47 -07:00
Richard Ye	17365654c9	Handle PPTX shapes where position is None (#1161 ) * Handle shapes where position is None * Fixed recursion error, and place no-coord shapes at front	2025-08-26 15:28:17 -07:00
Yuzhong Zhang	59eb60f8cb	fix docx parse error(\n in alt) (#1163 )	2025-08-26 15:20:17 -07:00
Dmitry	459d462f29	docs: correct minor typos (#1173 )	2025-08-26 15:15:23 -07:00
Noah Zhu	c3f6cb356c	Adding support for data-src Attribute (#1226 ) * supportfordata-src	2025-08-26 15:11:53 -07:00
Ebrahim Tayabali	0c4d3945a0	Update README.md (#1191 ) Fix: Subtle spelling mistake fixed.	2025-08-26 15:07:27 -07:00
Utkarsh kumar	f8b60b5403	Update README.md (#1350 ) ISSUE #1339	2025-08-26 15:02:56 -07:00
Stefan Rink	b81a387616	fix: correctly pass custom llm prompt parameter (#1319 ) * fix: correctly pass custom llm prompt parameter	2025-08-26 14:51:10 -07:00
safen0s	ea1a3dfb60	Add HTML support to DocumentIntelligenceConverter (#1352 )	2025-08-26 14:34:43 -07:00
t3tra	fb1ad24833	Ensure safe ExifTool usage: require >= 12.24 (#1399 ) * feat: add version verification for ExifTool to ensure security compliance * fix: improve ExifTool version verification ---------	2025-08-26 14:25:13 -07:00
JonahDelman	1178c2e211	Fixed documentation typos in _base_converter.py (#1393 )	2025-08-26 14:23:10 -07:00
afourney	9278119bb3	Resolved an issue with linked images in docx [mammoth] (#1405 )	2025-08-26 14:20:29 -07:00
afourney	3bfb821c09	Have the MarkItDown MCP server read MARKITDOWN_ENABLE_PLUGINS from ENV (#1273 ) * Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ * Update the Dockerfile to enable plugins. No puglins are installed by default.	2025-06-03 09:35:33 -07:00
Tomasz Kalinowski	62b72284fe	pin onnxruntime on Windows (#1274 ) closes #1266	2025-05-28 13:13:51 -07:00
afourney	1dd3c83339	Promoting 0.1.2a1 to 0.1.2 (#1272 )	2025-05-28 10:04:42 -07:00
afourney	9dc982a3b1	Small changes to favor streamable HTTP over deprecated SSE (#1264 )	2025-05-23 11:39:41 -07:00
afourney	effde4767b	Preparing a pre-release of 0.1.2 (#1260 )	2025-05-21 15:24:56 -07:00
rtpacks	04bf831209	docs: fix typos (#1201 )	2025-05-21 15:12:22 -07:00
Betula-L	9fd680c366	support streamable http mcp (#1245 ) Co-authored-by: luhualin	2025-05-21 14:34:50 -07:00
Yi-Cheng Wang	131f0c7739	feat: add Document Intelligence API version selection via kwargs (#1253 ) Co-authored-by: Yi-Cheng Wang <yicheng.wang@heph-ai.com> Co-authored-by: afourney <adamfo@microsoft.com>	2025-05-21 10:22:08 -07:00
JoshClark-git	56f7579ce2	FIX YouTube transcript errors (#1241 ) * FIX YouTube transcript errors * Fixed formatting. --------- Co-authored-by: Josh <jca351@sfu.ca> Co-authored-by: afourney <adamfo@microsoft.com>	2025-05-21 10:17:57 -07:00
t3tra	cb421cf9ea	Chore: Make linter happy (#1256 ) * refactor: remove unused imports * fix: replace NotImplemented with NotImplementedError * refactor: resolve E722 (do not use bare 'except') * refactor: remove unused variable * refactor: remove unused imports * refactor: ignore unused imports that will be used in the future * refactor: resolve W293 (blank line contains whitespace) * refactor: resolve F541 (f-string is missing placeholders) --------- Co-authored-by: afourney <adamfo@microsoft.com>	2025-05-21 10:02:16 -07:00
kira-offgrid	39e7252940	fix: python.lang.security.use-defused-xml-parse.use-defused-xml-parse-packages-markitdown-src-markitdown-converter_utils-docx-math-omml.py (#1251 )	2025-05-21 09:57:21 -07:00
afourney	bbcf876b18	Switched from the stdlib minidom parser to defusedxml. (#1259 )	2025-05-21 09:47:14 -07:00
createcentury	041be54471	Update README.md (#1187 ) updated subtle misspelling.	2025-04-13 09:31:40 -07:00
Turdıbek	8576f1d915	Add CSV to Markdown table conversion - fixes #1144 (#1176 ) * feat: Add CSV to Markdown table converter - Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class ---- Thanks also to @benny123tw who submitted a very similar PR in #1171	2025-04-13 09:19:00 -07:00
Sathindu	3fcd48cdfc	feat: render math equations in .docx documents (#1160 ) * feat: math equation rendering in .docx files * fix: import fix on .docx pre processing * test: add test cases for docx equation rendering * docs: add ThirdPartyNotices.md * refactor: reformatted with black	2025-03-28 15:36:38 -07:00
afourney	9e067c42b6	Make it easier to use AzureKeyCredentials with Azure Doc Intelligence (#1151 ) * Make it easier to use AzureKeyCredentials with Azure Doc Intelligence * Fixed mypy type error. * Added more fine-grained options over types. * Pass doc intel options further up the stack.	2025-03-26 10:44:11 -07:00
afourney	73b9d57312	Update badges (#1157 ) * Update badges in subpackages.	2025-03-25 14:52:24 -07:00
afourney	3ca57986ef	Basic SSE MCP Server for MarkItDown (#1155 ) * Added an initial minimal MCP server for MarkItDown * Added STDIO default option. * Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop * Pin mcp version.	2025-03-25 14:38:22 -07:00
afourney	c1f9a323ee	Bump version. (#1154 )	2025-03-24 23:26:30 -07:00
afourney	e928b43afb	convert_url renamed to convert_uri, and now handles data and file URIs (#1153 )	2025-03-24 21:43:04 -07:00
afourney	2ffe6ea591	Bump version. (#1150 )	2025-03-22 11:21:32 -07:00
afourney	efc55b260d	Bump version and resolve a console encoding error. (#1149 )	2025-03-21 09:27:25 -07:00
Yuzhong Zhang	52432bd228	Add support for preserving base64 encoded images (#1140 ) * optional reserve base64 string in markdown _CustomMarkdownify and pptx * add other converter para support * fix linter * Use kwarg to pass keep_data_uri para. Add module cli vector tests * Fixed formatting, and adjusted tests.	2025-03-20 18:50:23 -07:00
afourney	c0a511ecff	Updated docx file to include an image. (#1146 )	2025-03-20 12:25:56 -07:00
afourney	cd6aa41361	Adjust warning filters and update dependencies (#1143 ) Adjusts warning filters to be more contextual Updates dependencies for magika and youtube-transcript-api Updates the version to 0.1.0a5 in __about__.py	2025-03-19 22:09:14 -07:00
afourney	716f74dcb9	Consider anything with a charset as plain text-convertible. (#1142 )	2025-03-19 20:46:35 -07:00

1 2

95 Commits