- app.py: Streamlit-based web UI for file-to-markdown conversion
- Dockerfile.ui: Docker image for UI (ffmpeg, ExifTool, port 8501)
- .dockerignore: whitelist app.py for Docker build
- CLAUDE.md: project instructions for Claude Code
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix O(n) memory growth in PDF conversion by calling page.close() after each page
* Refactor PDF memory optimization tests for improved readability and consistency
* Add memory benchmarking tests for PDF conversion with page.close() fix
* Remove unnecessary blank lines in PDF memory optimization tests for cleaner code
* Bump version to 0.1.6b2 in __about__.py
* Update PDF conversion tests to include mimetype in StreamInfo
* Add OCR test data and implement tests for various document formats
- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.
* Enhance OCR functionality and validation in document converters
- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.
* Add support for scanned PDFs with full-page OCR fallback and implement tests
* Bump version to 0.1.6b1 in __about__.py
* Refactor OCR services to support LLM Vision, update README and tests accordingly
* Add OCR-enabled converters and ensure consistent OCR format across document types
* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters
* Refactor exception imports for consistency across converters and tests
* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling
* Bump version to 0.1.6b1 in __about__.py
* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing
* Add comprehensive OCR test suite for various document formats
- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.
* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency
- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.
* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs
* Revert
* Revert
* Update REDMEs
* Refactor import statements for consistency and improve formatting in converter and test files
* feat: enhance PDF table extraction to support complex forms and add new test cases
* feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases
* fix: correct formatting and improve assertions in PDF table tests
* Fix: PDF parsing doesn't support partially numbered lists
* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file
* Refactor: Improve assertion formatting in partial numbering tests
* Added PDF table extraction feature with aligned Markdown (#1419)
* Add PDF test files and enhance extraction tests
- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.
* fix: update dependencies for PDF processing and improve table extraction logic
* Bumped version of pdfminer.six
---------
Authored-by: Ashok <ashh010101@gmail.com>
This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
* Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ
* Update the Dockerfile to enable plugins. No puglins are installed by default.
* update markdown
* Update and install Python version suggestions
* Update README with prerequisites.
---------
Co-authored-by: Lucas Liu <lucas@LucasdeMacBook-Pro.local>
Co-authored-by: afourney <adamfo@microsoft.com>
* refactor: remove unused imports
* fix: replace NotImplemented with NotImplementedError
* refactor: resolve E722 (do not use bare 'except')
* refactor: remove unused variable
* refactor: remove unused imports
* refactor: ignore unused imports that will be used in the future
* refactor: resolve W293 (blank line contains whitespace)
* refactor: resolve F541 (f-string is missing placeholders)
---------
Co-authored-by: afourney <adamfo@microsoft.com>
* feat: Add CSV to Markdown table converter
- Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class
----
Thanks also to @benny123tw who submitted a very similar PR in #1171
* feat: math equation rendering in .docx files
* fix: import fix on .docx pre processing
* test: add test cases for docx equation rendering
* docs: add ThirdPartyNotices.md
* refactor: reformatted with black
* Make it easier to use AzureKeyCredentials with Azure Doc Intelligence
* Fixed mypy type error.
* Added more fine-grained options over types.
* Pass doc intel options further up the stack.
* Added an initial minimal MCP server for MarkItDown
* Added STDIO default option.
* Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop
* Pin mcp version.