markitdown

Author	SHA1	Message	Date
lesyk	c6308dc822	[MS] Add OCR layer service for embedded images and PDF scans (#1541 ) * Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files	2026-03-10 09:17:17 -07:00
[W]DOS_	16ca285d30	Update README.md (#1335 ) Fix typo in README.md	2025-08-26 14:55:58 -07:00
Stefan Rink	b81a387616	fix: correctly pass custom llm prompt parameter (#1319 ) * fix: correctly pass custom llm prompt parameter	2025-08-26 14:51:10 -07:00
onefloid	da7bcea527	docs: rephrase sentence (#1278 )	2025-06-03 21:09:25 -07:00
一I	38261fd31c	Update Python version requirement and add .cursorrules to .gitignore (#1249 ) * update markdown * Update and install Python version suggestions * Update README with prerequisites. --------- Co-authored-by: Lucas Liu <lucas@LucasdeMacBook-Pro.local> Co-authored-by: afourney <adamfo@microsoft.com>	2025-05-21 10:47:29 -07:00
lentil32	ebe2684b3d	chore: fix typo in README.md (#1175 ) * chore: fix typo in README.md	2025-04-13 09:29:16 -07:00
afourney	9a951055f0	Update readme to point to the mcp package. (#1158 ) * Updated readme with link to the MCP package.	2025-03-25 15:00:04 -07:00
afourney	2ffe6ea591	Bump version. (#1150 )	2025-03-22 11:21:32 -07:00
afourney	a93e0567e6	EPub Support. Adapted #123 to not use epublib. (#1131 ) * Adapted #123 to not use epublib. * Updated README.md	2025-03-17 07:48:15 -07:00
Richard Ye	0229ff6cb7	feat: sort pptx shapes to be parsed in top-to-bottom, left-to-right order (#1104 ) * Sort PPTX shapes to be read in top-to-bottom, left-to-right order Referenced from https://github.com/ssine/pptx2md/blob/39bef65b312035baeade932aad8d221e37daae5f/pptx2md/parser.py#L249 * Update README.md * Fixed formatting. * Added missing import	2025-03-07 15:45:14 -08:00
Andrea Pietrobon	80baa5db18	fix(README): correct pip install command formatting (#1090 ) Added missing quotes around `markitdown[all]` in the installation command to ensure proper package resolution by pip.	2025-03-05 23:21:10 -08:00
Adam Fourney	00a65e8f8b	Fixed version in README.	2025-03-05 23:10:21 -08:00
afourney	e921497f79	Update converter API, user streams rather than file paths (#1088 ) * Updated DocumentConverter interface * Updated all DocumentConverter classes * Added support for various new audio files. * Updated sample plugin to new DocumentConverter interface. * Updated project README with notes about changes, and use-cases. * Updated DocumentConverter documentation. * Move priority to outside DocumentConverter, allowing them to be reprioritized, and keeping the DocumentConverter interface simple. --------- Co-authored-by: Kenny Zhang <kzhang678@gmail.com>	2025-03-05 21:16:55 -08:00
afourney	c5cd659f63	Exploring ways to allow Optional dependencies (#1079 ) * Enable optional dependencies. Starting with pptx. * Fix CLI tests.... have them install [all] * Added .docx to optional dependencies * Reuse error messages for missing dependencies. * Added xlsx and xls * Added pdfs * Added Ole files. * Updated READMEs, and finished remaining feature-categories. * Move OpenAI to hatch-test environment.	2025-03-03 09:06:19 -08:00
Nima Akbarzadeh	a394cc7c27	fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue (#1035 ) * fix: add error handling, refactor _findKey to use json.items() * fix: improve metadata and description extraction logic * fix: improve YouTube transcript extraction reliability * fix: implement retry logic for YouTube transcript fetching and fix URL decoding issue * fix(readme): add youtube URLs as markitdown supports	2025-02-27 23:17:54 -08:00
afourney	c73afcffea	Cleanup and refactor, in preparation for plugin support. (#318 ) * Work started moving converters to individual files. * Significant cleanup and refactor. * Moved everything to a packages subfolder. * Added sample plugin. * Added instructions to the README.md * Bumped version, and added a note about compatibility.	2025-02-10 15:21:44 -08:00
KennyZhang1	bf6a15e9b5	Kennyzhang/docintel docs (#312 ) * updated docs to include doc intelligence * include reference to doc intel setup docs	2025-01-31 22:23:26 -08:00
afourney	265aea2edf	Removed the holiday away message from README.md (#266 )	2025-01-06 09:06:21 -08:00
Ikko Eltociear Ashimine	125e206047	docs: update README.md (#182 ) faciliate -> facilitate	2024-12-21 01:51:30 -08:00
gagb	857a2d160d	Update README.md (#180 )	2024-12-20 14:49:20 -08:00
Soulter	1123392306	fix: support -o param to avoid encoding issues (#116 ) * perf: cli supports -o param * doc: update README --------- Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-20 14:43:00 -08:00
gagb	7e6c36c5d4	docs: add contribution guidelines to README (#176 )	2024-12-20 14:08:58 -08:00
gagb	c295dee5e4	Merge branch 'main' into patch-1	2024-12-19 13:22:51 -08:00
afourney	535147b2e8	Added holiday notice. Added holiday notice.	2024-12-19 11:11:54 -08:00
gagb	5c776bda70	Update README.md	2024-12-19 10:30:53 -08:00
gagb	423a01844a	Merge branch 'main' into patch-1	2024-12-19 10:30:10 -08:00
Petr@AP Consulting	b28f380a47	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-19 09:23:15 +01:00
gagb	a2743a5314	Add downloads badge	2024-12-18 14:26:36 -08:00
Petr@AP Consulting	f6e75c46d4	Update README.md I changed command for running script from Mac version (python3) to Windows version (python)	2024-12-18 21:17:47 +01:00
Petr@AP Consulting	f4471d96e2	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-18 21:08:10 +01:00
Petr@AP Consulting	088007338d	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-18 21:07:55 +01:00
Petr@AP Consulting	bb929629f3	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-18 21:05:36 +01:00
Petr@AP Consulting	233ba679b8	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-18 21:05:04 +01:00
gagb	46b7f043d3	Merge branch 'main' into patch-1	2024-12-18 11:57:57 -08:00
Petr@AP Consulting	224f1df0fc	Update README.md I collapsed section about batch processing as was suggested	2024-12-18 09:28:18 +01:00
gagb	524aa0da75	Update README.md	2024-12-17 17:25:40 -08:00
gagb	de1b54d79f	Update README.md	2024-12-17 17:25:13 -08:00
gagb	1e7806a7ac	Simplify	2024-12-17 17:21:39 -08:00
gagb	3bcf2bdae7	Update README.md	2024-12-17 16:54:17 -08:00
gagb	f1e399eee4	Merge branch 'main' into add-devcontainer-config	2024-12-17 16:50:32 -08:00
Petr@AP Consulting	f398f3d443	Update README.md I added description and script for batch of files processing	2024-12-17 10:26:09 +01:00
lumin	e0a30295ff	docs: update README with Devcontainer instructions Add instructions for using Dev to run tests.Remove the install script it is no longer needed. Update trademark section for clarity.	2024-12-17 17:04:31 +09:00
diya155	14bd8d319a	Update README.md	2024-12-17 09:16:40 +05:30
gagb	24b52b2b8f	Improve readme	2024-12-16 17:35:47 -08:00
gagb	09159aa04e	Merge branch 'main' into main	2024-12-16 17:24:47 -08:00
gagb	736e7d9a7e	Merge branch 'main' into patch-1	2024-12-16 16:53:58 -08:00
gagb	360c2dd95f	Merge branch 'main' into main	2024-12-16 16:35:50 -08:00
gagb	ae4669107c	Merge branch 'main' into main	2024-12-16 16:01:59 -08:00
afourney	fa1f496d51	Merge branch 'main' into patch-1	2024-12-16 14:18:20 -08:00
gagb	9e6a19987b	Merge branch 'main' into main	2024-12-16 13:51:39 -08:00

1 2

70 Commits