It seems like Foundry is not respecting / running the environment activation script
https://github.com/conda-forge/tesseract-feedstock/blob/main/recipe/activate.sh
that sets the TESSDATA_PREFIX
environment variable automatically. However, we can infer the value manually and provide it to the pytesseract API calls.
Define the following helper function:
def _get_tessdata_directory_path():
import os
from pathlib import Path
if 'PYSPARK_PYTHON' in os.environ:
pyspark_python = Path(os.environ['PYSPARK_PYTHON'])
env_root = pyspark_python.parent.parent
elif 'CONDA_PREFIX' in os.environ:
env_root = Path(os.environ['CONDA_PREFIX'])
else:
raise ValueError('No env. variable present.')
share_dir = env_root / 'share' / 'tessdata'
assert share_dir.exists(), 'tessdata directory does not exist in <envroot>/share/tessdata'
return str(share_dir)
and use it like shown in the following snippet:
tessdata_dir_config = f'--tessdata-dir "{_get_tessdata_directory_path()}"'
pytesseract.image_to_string(image, ..., config=tessdata_dir_config)