fix: resolve TESSDATA_PREFIX path correctly for all Tesseract versions#2257
fix: resolve TESSDATA_PREFIX path correctly for all Tesseract versions#2257DhanushVarma-2 wants to merge 2 commits intoCCExtractor:masterfrom
Conversation
6906ef9 to
71c5762
Compare
|
The format_rust CI failures are pre-existing on master . |
Two bugs in init_ocr() in ocr.c: 1. The Tesseract 4/5 branch always blindly appended '/tessdata' to the path returned by probe_tessdata_location(). If TESSDATA_PREFIX was already set to a path ending in 'tessdata/', this caused a double- append e.g. '/usr/share/tessdata/tessdata'. 2. The legacy Tesseract <4 branch passed tessdata_path raw to TessBaseAPIInit4 without appending 'tessdata' at all, causing Tesseract to look for eng.traineddata directly in e.g. '/usr/share/' instead of '/usr/share/tessdata/'. Fix: normalize the path once before both branches. Detect whether the returned path already ends with 'tessdata' or 'tessdata/', handle Windows backslash separators, and use the resolved path in both Tesseract version branches. Add mprint diagnostic for the resolved path. Fixes CCExtractor#1492
33435e9 to
f257202
Compare
CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit c8932da...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit c8932da...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
|
@cfsmp3 can you review it. |
There were two bugs in init_ocr():
The Tesseract 4/5 branch always appended /tessdata to the probed path — if TESSDATA_PREFIX was already set to point at the tessdata dir itself, this doubled it: /usr/share/tessdata/tessdata.
The legacy Tesseract <4 branch passed the raw probed path to TessBaseAPIInit4 with no /tessdata appended at all — so Tesseract looked for /usr/share/eng.traineddata instead of /usr/share/tessdata/eng.traineddata. This is the exact error in #1492.
Fix: build the tessdata path once before both branches — check if the path already ends with tessdata, otherwise append it. Windows backslash separators handled too. Both branches now use the same resolved path. Added an mprint line showing the resolved path to make future debugging easier.