Huggingface AutoTokenizer can't load from local path

Question

I'm trying to run language model finetuning script (run_language_modeling.py) from huggingface examples with my own tokenizer(just added in several tokens, see the comments). I have problem loading the tokenizer. I think the problem is with AutoTokenizer.from_pretrained('local/path/to/directory').

Code:

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# special_tokens = ['<HASHTAG>', '<URL>', '<AT_USER>', '<EMOTICON-HAPPY>', '<EMOTICON-SAD>']
# tokenizer.add_tokens(special_tokens)
tokenizer.save_pretrained('../twitter/twittertokenizer/')
tmp = AutoTokenizer.from_pretrained('../twitter/twittertokenizer/')

Error Message:

OSError                                   Traceback (most recent call last)
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
    248                 resume_download=resume_download,
--> 249                 local_files_only=local_files_only,
    250             )

/z/huggingface_venv/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
    265         # File, but it doesn't exist.
--> 266         raise EnvironmentError("file {} not found".format(url_or_filename))
    267     else:

OSError: file ../twitter/twittertokenizer/config.json not found

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-32-662067cb1297> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained('../twitter/twittertokenizer/')

/z/huggingface_venv/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    190         config = kwargs.pop("config", None)
    191         if not isinstance(config, PretrainedConfig):
--> 192             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    193 
    194         if "bert-base-japanese" in pretrained_model_name_or_path:

/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    192         """
    193         config_dict, _ = PretrainedConfig.get_config_dict(
--> 194             pretrained_model_name_or_path, pretrained_config_archive_map=ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, **kwargs
    195         )
    196 

/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
    270                     )
    271                 )
--> 272             raise EnvironmentError(msg)
    273 
    274         except json.JSONDecodeError:

OSError: Can't load '../twitter/twittertokenizer/'. Make sure that:

- '../twitter/twittertokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'

- or '../twitter/twittertokenizer/' is the correct path to a directory containing a 'config.json' file

If I change AutoTokenizer to BertTokenizer, the code above can work. Also I can run the script without any problem is I load by shortcut name instead of path. But in the script run_language_modeling.py it uses AutoTokenizer. I'm looking for a way to get it running.

Any idea? Thanks!

dennlinger dennlinger · Accepted Answer · 2020-05-22T07:03:30

The problem is that you are using nothing that would indicate the correct tokenizer to instantiate.

For reference, see the rules defined in the Huggingface docs. Specifically, since you are using BERT:

contains bert: BertTokenizer (Bert model)

Otherwise, you have to specify the exact type yourself, as you mentioned.

Huggingface AutoTokenizer can't load from local path

2 Answers