Error while training tensorflow object detection about checkpoint error

Question

I have a problem about tensorflow training part.

speci:

tensorflow-gpu= 2.2.0

python= 3.7.9

cuda= 10.1

cdnn= 7.6.- (ı dont remember but it is ok with cuda).

models= ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8 and efficientdet_d7_coco17_tpu-32

reference: https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html

when I start to train it gives that error:

Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\absl\app.py", line 300, in run
    _run_main(main, args)
  File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\absl\app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 110, in main
    record_summaries=FLAGS.record_summaries)
  File "C:\TensorFlow\models\research\object_detection\model_lib_v2.py", line 578, in train_loop
    ckpt, manager_dir, max_to_keep=checkpoint_max_to_keep)
  File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 635, in __init__
    recovered_state = get_checkpoint_state(directory)
  File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 279, in get_checkpoint_state
    coord_checkpoint_filename)
  File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 320, in read_file_to_string
    return f.read()
  File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 116, in read
    self._preread_check()
  File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 79, in _preread_check
    self.__name, 1024 * 512)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 122: invalid start byte

my start-training command: And ı check this paths are correct

python model_main_tf2.py --logtostderr --model_dir=pre-trained-models/ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8 --pipeline_config_path=pre-trained-models/ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8/pipeline.config

Checkpoint path, pipeline path all of them are correct. I tried training with two different models. And cant solve. How can ı solve this problem?

I researched utf-8 errors but cant find solution. Thank you for helping. :))

Have you checked to make sure none of your annotation class names or image files have a unicode character in them? Looks like 0xfe is "þ" — Brad Dwyer
ı checked but there is no unicode character. I think this error about checkpoint from model and checkpoint path. But ı checked too. there is no mistake. — nrllah0742

Suxing Lyu Suxing Lyu · Accepted Answer · 2020-09-24T21:08:28

I met the same issue, which made me struggled overnight. I have the same env like yours, and I tired mask_rcnn model. In a nutshell, it can be figured out by changing the output_dir.

Origin:

----mask_rcnn_model (output_dir)

--------checkpoint (folder)

--------saved_model (folder)

--------my_pipeline.config

After:

----output_dir

----mask_rcnn_model

--------checkpoint (folder)

--------saved_model (folder)

--------my_pipeline.config

hint from: https://github.com/tensorflow/models/issues/8892 it worked for my situation.

Error while training tensorflow object detection about checkpoint error

1 Answers