6 Commits

Author SHA1 Message Date
lesyk 251dddcf0c [MS] Update PDF table extraction to support aligned Markdown (#1499)
* Added PDF table extraction feature with aligned Markdown (#1419)

* Add PDF test files and enhance extraction tests

- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.

* fix: update dependencies for PDF processing and improve table extraction logic

* Bumped version of pdfminer.six
---------

Authored-by: Ashok <ashh010101@gmail.com>
2026-01-07 16:38:45 -08:00
一I 38261fd31c Update Python version requirement and add .cursorrules to .gitignore (#1249)
* update markdown
* Update and install Python version suggestions
* Update README with prerequisites.
---------

Co-authored-by: Lucas Liu <lucas@LucasdeMacBook-Pro.local>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:47:29 -07:00
Sugato Ray 6f3c762526 Merge branch 'main' into update_commandline_help 2024-12-18 17:50:07 -05:00
Sugato Ray 1384e80725 update .gitignore to exclude .vscode folder 2024-12-18 21:46:06 +00:00
Joel Esler 6e4caac70d Safeguard against path traversal for ZipConverter
fix: prevent path traversal vulnerabilities in ZipConverter

Added a secure check for path traversal vulnerabilities in the ZipConverter class.
Now validates extracted file paths using `os.path.commonprefix` to ensure all files
remain within the intended extraction directory. Raises a `ValueError` if a
path traversal attempt is detected.

- Normalized file paths using `os.path.normpath`.
- Added specific exception handling for `zipfile.BadZipFile` and traversal errors.
- Ensured cleanup of extracted files after processing when `cleanup_extracted` is enabled.
2024-12-18 13:12:55 -05:00
microsoft-github-operations[bot] f454a6d3c8 Initial commit 2024-11-13 19:56:40 +00:00