detection of vertical texts (container BIC codes) with tesseract OCR fails

Question

I'm trying to use Tesseract Open Source OCR Engine for a text detection of intermodal (shipping) containers codes in BIC format. BTW, I'm using tesseract through pytesseract and I preprocess input photos with few standard opencv filtering (huge rescaling/blurring/denoising/binarization).

I tuned tesseract (version: tesseract 5.0.0-alpha-647-g4a00) in this way:

  config = (
    # only a set of characters
    ' -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ' +

    # no language model
    ' -c load_system_dawg=0' +
    ' -c load_freq_dawg=0' +

    ' -c enable_new_segsearch=1' +

    ' -c language_model_penalty_non_freq_dict_word=1' +
    ' -c language_model_penalty_non_dict_word=1' +

    # select segmentation mode
    ' --psm 11'
  )

I got hopeful results when codes are horizontal-aligned, as in this case:

but I have issues with vertical texts, as in the example show in the photo:

In this last case tesseract doesn't produce an useful result. Why tesseract fails event if input image seems "good"? Any suggestion about how to improve engine recognition?

K41F4r K41F4r · Accepted Answer · 2020-05-22T16:01:07

Try running with --psm 5 :

pagesegmode values are:
  5 = Assume a single uniform block of vertically aligned text.

detection of vertical texts (container BIC codes) with tesseract OCR fails

1 Answers