[MS] Add OCR layer service for embedded images and PDF scans (#1541)
* Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files
This commit is contained in:
@@ -0,0 +1,79 @@
|
||||
%PDF-1.3
|
||||
%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com
|
||||
1 0 obj
|
||||
<<
|
||||
/F1 2 0 R
|
||||
>>
|
||||
endobj
|
||||
2 0 obj
|
||||
<<
|
||||
/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font
|
||||
>>
|
||||
endobj
|
||||
3 0 obj
|
||||
<<
|
||||
/BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [ /ASCII85Decode /FlateDecode ] /Height 100 /Length 4720 /Subtype /Image
|
||||
/Type /XObject /Width 500
|
||||
>>
|
||||
stream
|
||||
Gb"0VH#OJ:qoA4A'Hn[Z$4u82K`j4ZR%PRX#F(qaK&V?K8<b&(;#dar'M8Oni!CfV<*>q$(4m``.P7J2`EVk#7#VtC6:(*s$)76DC7esMC3FfEP21e=J%f9>Up;d>4l&7WcZJp*DIk*ozzzzzzzzzzzzzzzzzzzzzzz!!#9dqY/lsQRl8pCtPt8mFiS/o[0Y;WL90Bj2R)5Y[N/Gfn0f!(^fES:Hsio97D?(Xn?5jq!"]KU?6X;&P"ZofW]f$p(q"VdAg3IP#\)UWg`=MO$9TD0>@3js(id(lnKeb'nVdJ7e;beRl:iu3cs8nIEn"+F\0s:]mE81*hAmoIah4b[;OfHn`%MZXAic<H:=+SS^nBEXOMIOI2D3Y<1k'jGjr"Mb7m#8djq#sB[J"W0CSVhD_EOgna,8=^]'+;Q/-Q29o7K`^M3`IrI+P7M:JgD-;8@g?QuaS(L!S'NU#&pVmQm&;+ooep7=EoT&nD(?bbsoCn2i[<[UJujmhj_M<u#m'ah(clITBmFV`P'C`bB@K_BHIO[n\%kN(^*?b\d]&LhJ.U*@&CM]/WZY.jbti9?:_k*Y>(J)7jH@XF(Q5('jYY[u"DM\[nu[VacM!sePdg%40X+-%@'f%P<0baG`E4oP$%Q/JgWmY\Um466eV$?Y)Q9)nh\c]Mq3f`',ShaFWth-oBcOd=q3cTY"+4B:C5D=,9ZUiS/hg5kVA4*K+[1fUQ5bbN_se_nC2++F"$]j*.j.s0:>;7?2OB:nigYB[?_a,Ulb<cmfe?*!8BYQ*mgYI_'G:#:9hc!%'nIu9\#K.OFt>Aq1i\Q1k(X_QRsW=DbSc%F'bgtZ/2>e+n:Kbn'oJ*sr;^;r/1Z+Vo?T!W4\7L>qdS!IH-WeB#r>ch5>%TRL&'caJA=^o?na4Y*tXgMf5H"N01jlPU;HhZ+Fp?gW#Ip5H[Y;u'ao8X`t6%]C0Gr*LD?+V"4C6Y<]^1GKRW1+$NV&M@2Zk6GD=d]J*qS.+7cN!h6O!e)cfIf(11iD*Y7*8G.!Fu"f5Q5o`Fk=$<gK"E?j,ZG(Q<S6H:\dHiQ8`a=>Zb+\X]m_AEjKB&FH%hV\FAFmKDpQr8j8Fc:%F6j4nD>GFYUI\FW`_flD4$L7>hqmCTh$Uf"^_*P&AuHA>O*GCsJP2O$EX=EQ9*O\8c#4&JZa4ARj7@85!.BB?cmQFmIWVr;9Tt>%neNS8ud$:Htthg-nk9OnPg[;$TI`>0jlL=-.a+_hJV!J^kl(gQd$+PUS\<me#jih7@a_WGZZ(.4K"nV+[.iR&jScFk0\%2K,Zhh0o%R)Nq/WF?l+\*Dd3krNhA#g[-\oC]W"'hnHd4_hLeQjHEBn&n644.4m-ZIetY!]Maa6"3/b.DR^j3BPL<$4Z,)s+'5XPm7A'P[OZn]2^N_0O[lEQSs)o18I6(3r!Qc+D!gf8aN/&O]Qq2:ob7jW;;79<m7q;t-1bA`T7?icP9s!jiRtHa:-1$`1XkWli8@t0Uu_-o6OtUq>"[U''if?a#-Wq5c7%ag\H8!"ALF'oU?-RQD7B?-GJ]">g6^'>CFf@5s8D[^<m$#K27ioe<`YMIH[l(oGML?\W`P:J[(9Uijd"PR!k=->Y$F-4CB"/,6\Z#s5Dlg2HM"F7VJkA+mbDo1>N-;k32'EW?6)(KY`B.a^\mY\PAK*gH+(7uU^F-po`(RMK06EPHHdD0;Bn\le,c[Y^V3lFa4(IC]mFtt!K@dP<o88m]iqJ+@)2E/k&hb!@XF*Gief89[^t6$$47L+p[?u]I+odK?/mT/=%`2+)fO@AogV9q*r!U1m1g?N]&Ti<e"o\RXmOcGj);^2<k\(R>\obnltl>ED4q/%Sh>o`U)Q4>YWf)aUl].\e=Ep??@2&sT>Dj(+.kQ]\905L.Bs_jQMg@#5Ad*SY8q<&o<LoLF#$U&]1eeYfg_3L=pCsBn2Zo8@8IZ+Xg>-8CKR8V]"=q/;I!IC.2@XQ-aimNpYWG+0>@4U4t=ghn%S,KY6Ikm?-DX7<>iL_CI@3\nVAbo+bpOJCAW-_HQp]RX&=4gH(-^/Z@s5UCp8LSn\c))=o$*Q*<M^D^#4JMJtt=95Q%_uUo1-F7q-h)d`H"f/Jt%*3B9)\WKo2Erql0!qemE![bE'dH=+t)O0/c#?9IC9XEPfCe/A/_qsP1I:Gi93mAc2E?t738#p#IUW/S7Mm!5=<`aIfEM<Ys2>e&.Y0ZhJX-aq'tbOsIoYE.k=J%d;VB:aB<bI_rblEfBj^p1R=K*Hi'nV;J/+I,YT[]<3iSq0j"_fD5.GHQ9sRi?Hm3c3S-p!79pR,Q`;q!mC0XK\qU5$G/?oAYN4XKK9![O9M9;(JK$g@P<;orgiE)Wd/_e6&iR=?W_Xldsm><h,SP+L,3L,ntesb+_2WpLI=+=Q*1F"S(>qmjXiPmbHLOYe%@qT_T#OK#H+:rVJ+)kHmJLjHDqC"3ei1+I0_7b&fC4SN?G-:Hn<P`F4_m)X;X7>C^ac96r3OHOcmBeK9_BW&gqhjl7$/j4:&TqtBm]_@&#AP,Y']*Eo)'ONPAD?/7inL-[;Y?u405b]Ka3.k@s]eE(j,^\#rI[BQU.aF>#4B$F4/opmSM*F5.or8NVf4NVD[_hmc;1iLl9739e\*dBrn2.Z2*I#rOpSJRG>K>mOaX&`@A]5&)7CQehdOsNbC>B_D5cTCSXB*di92jW_WX!=EhTKNR%fVt.%Q<$j[iSM!0d6/k`CY(3)+XeI\r:.i,Kg1O$IGhnlT&gm[L0VDFcUFbqACJhEm'4S\ChfI]q:mQ"ZL[OBmJ_6*\'6h<$$-W"[^F9VSm/"@Ys!-Y/kBOeN9p]O$ui,lR;]Y'fWi?-gh&>)baIM5;iRQYFQq5Me#,uCc8P?h`0K;LQ;G)X')<<ZlIDq&LLPU^bo=&gcErEUfq:W`I+ft!GF'4,DQL*1IX_:-FmGd!pPJ9q(G?8P`t=nl4**0bl/9C1$Pk;?WMl,4lD^\UVMQ6bn$qBfs80*^[?ET$I!^-aH!XgKQFCSW7VR7m>c6G0oT/C6@E@rs_sQ]md5bQ1:uFPQQE5J6oF@\//u>D@rl6i0Db`9"CffLREne*h9ea#&lL)T6cQ+8d[]`uKp7-3LZ,`(u?(+fr>(mI*G/\k-YJ:uXP.0=t4*2mZ-eQ't.M\@&WZ\RXWp+0AS>cW33cqTe`:fXp?:r\D9j`52V-"$nNukEh^\mZGUTX8#HKq+^Sc;g;39(Dp^!D$\>%6A^eKD`,3r`Ehh.Y<QB$Hd6Dn\4V,K$O2eQ#]H'IHuY<'.PWh7M8sFIp`W<?e^(pf-sk`:cWX(0Rr5S=GEL-Ru1K?[mL]^3qom@^04`'ab3DaOa@KMi0rX@XE^O>K:6c7M&12fk$N'7q-hiZ)75?UbZH"N)8kd3WGbMX"P/.K^RR%.rqaT@h#u_AEs1)jILMOBks:._,[m*eQ/HMh/2T8\^k5p-Gbk4:UO]EUnspPj)`O0k>MR,M8scu:M"<+[OYCfCtV].VbWfJfu8q0hWVoO!s]=g_kY<KG3'BXc*o(K]QH6CDr&"TSejAI&r>p4`r`?FMo`BP7H7X"Km<=Xfhj^&%sjRIEf&?W*]uFIg@FfT]->7T*G\=GA%PJ<HdnOVSmLf.+KHmSZ$jZQ*BecC<(3<9grri,I2*)b1:SDnGU+d]Xg6MY8$T(:.4?Uk`u[BiGe0Oe2f<H[Ue1=Kh4riO*S$)8!@o+s?;W"![]<f%`Y5^Zc;'ok\LTFOfW\3I*DUf6[:**:Q92N&d_']\[d1Hqldmd(IV"Q.V:G^=4P&F/.]W6G(>cB1O#e,SV5<H8f\>>$gU<+7PBoDYDti\Up=RQ$_FGu!./Yu`]tGE[]1Xflr3@X<>TH,QPDG[k\(f5EN#4:dBj9Dtm'hE@kFId$O4XJRs#<fi\gX(P(\HEsYBB9>>%h8.(c#WXs*h$)D\#t'aEg:?XOs\SA`#9^5CU9:p<JsU>L#D+>hdhTPki]s+6c*j=3>f=V>+D"=D5gHfUbY*f!X/5kZq(aU0s.TSSbiDpX_$Rm57L'='`/+(t;Rbo#i[rD<hl-MMd9XiU7<R]U8H\\)5nGm!GTqIWJoT^k&3K2m'gnqWfVra4/mgQ`:tY5PX.=H\ipm,paod-T=!9ie-6^rsOM%b,=gWO8LhMekGg^s513Ue4#ZT>@pYrm#1Im](Qt<AVg4pp1hYAJ<c+q=&_b]DnkJ,HRr<'>$A+9]t/CS)@C[llo,Jtei=]'$rSMO^!toPHeWb/:-J8LrU&LWJ)\^W(Lh_BC(Sq=]a:sW(9CiU>+L>jbY2:SMO\P;Zl(Q*_#4$"rP+Wa'D/kYlP9isqjdB"Z4^4CMs\(rfP=ldGLb]=a4-[40"Pb'F3QT9K-?3m2,`JlHL%\1Ij*8m=o$^)IJ``GG>8o)=:i+tB%*VOA&aJ4rTY`4<mp,AAS#lUS&fu(ONL&D/#q[E-aS3rEpZinTItFX7`LZA;mpPt<aK*M4`L.JdGj.pL%3[B<0^9=Js@ifcC6aG'JXr;^#lG4Z!G16L%!Kgc\r_t,%&S@[KH;Sbh`b-,X:&f!<?*P^juT/EcNB$W&AtP0_oZ%$360Lace*-_EY$)IJ\1l;I3ZnIJS%;Cu2i#mbPJcDt*f-eZa,X:4"Ls6%]AE=]p#qGtjbd%>DQu\n93U_d,;'5W.rd^OPtDft"Z(lDUVVUibkLA^[AGcHdpAzzzzzzzzzzzzzzzzzz!.Z!\J#>u+ci~>endstream
|
||||
endobj
|
||||
4 0 obj
|
||||
<<
|
||||
/Contents 8 0 R /MediaBox [ 0 0 612 792 ] /Parent 7 0 R /Resources <<
|
||||
/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject <<
|
||||
/FormXob.41b05a9cf8679f0fe6e7c30c9462b767 3 0 R
|
||||
>>
|
||||
>> /Rotate 0 /Trans <<
|
||||
|
||||
>>
|
||||
/Type /Page
|
||||
>>
|
||||
endobj
|
||||
5 0 obj
|
||||
<<
|
||||
/PageMode /UseNone /Pages 7 0 R /Type /Catalog
|
||||
>>
|
||||
endobj
|
||||
6 0 obj
|
||||
<<
|
||||
/Author (anonymous) /CreationDate (D:20260126172022+01'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:20260126172022+01'00') /Producer (ReportLab PDF Library - www.reportlab.com)
|
||||
/Subject (unspecified) /Title (untitled) /Trapped /False
|
||||
>>
|
||||
endobj
|
||||
7 0 obj
|
||||
<<
|
||||
/Count 1 /Kids [ 4 0 R ] /Type /Pages
|
||||
>>
|
||||
endobj
|
||||
8 0 obj
|
||||
<<
|
||||
/Filter [ /ASCII85Decode /FlateDecode ] /Length 250
|
||||
>>
|
||||
stream
|
||||
Gas2BZ&Z[T&4Ckp`KUTrY_02PMb#<CFN=Wfj',kM@19sp55uUe"pptDD)Los"F*-#r%7t"K39EA8f/'^$OO.*D:jQe'n<f:3Cq8'p9Rm8qll,u+[sQj[W6hrFQL%\7G?"sX/%4LXYeUkIBuT`A)Y3?=ouE3GIShId3E("2qqVte.E2,r_bJ%q1G(F,@9C<XiC-L`O1W5it(MP9X]^nj..r=,_#ecrj!ceT&ATWd4)p.7/d!C@/gP%;p#~>endstream
|
||||
endobj
|
||||
xref
|
||||
0 9
|
||||
0000000000 65535 f
|
||||
0000000073 00000 n
|
||||
0000000104 00000 n
|
||||
0000000211 00000 n
|
||||
0000005122 00000 n
|
||||
0000005378 00000 n
|
||||
0000005446 00000 n
|
||||
0000005742 00000 n
|
||||
0000005801 00000 n
|
||||
trailer
|
||||
<<
|
||||
/ID
|
||||
[<38bd217c814ddf937f148e537dce51f8><38bd217c814ddf937f148e537dce51f8>]
|
||||
% ReportLab generated PDF document -- digest (http://www.reportlab.com)
|
||||
|
||||
/Info 6 0 R
|
||||
/Root 5 0 R
|
||||
/Size 9
|
||||
>>
|
||||
startxref
|
||||
6141
|
||||
%%EOF
|
||||
Reference in New Issue
Block a user