I'm trying to run language model finetuning script (run_language_modeling.py) from huggingface examples with my own tokenizer(just added in several tokens, see the comments). I have problem loading the tokenizer. I think the problem is with AutoTokenizer.from_pretrained('local/path/to/directory').
Code:
from transformers import *
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# special_tokens = ['<HASHTAG>', '<URL>', '<AT_USER>', '<EMOTICON-HAPPY>', '<EMOTICON-SAD>']
# tokenizer.add_tokens(special_tokens)
tokenizer.save_pretrained('../twitter/twittertokenizer/')
tmp = AutoTokenizer.from_pretrained('../twitter/twittertokenizer/')
Error Message:
OSError Traceback (most recent call last)
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
248 resume_download=resume_download,
--> 249 local_files_only=local_files_only,
250 )
/z/huggingface_venv/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
265 # File, but it doesn't exist.
--> 266 raise EnvironmentError("file {} not found".format(url_or_filename))
267 else:
OSError: file ../twitter/twittertokenizer/config.json not found
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
<ipython-input-32-662067cb1297> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained('../twitter/twittertokenizer/')
/z/huggingface_venv/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
190 config = kwargs.pop("config", None)
191 if not isinstance(config, PretrainedConfig):
--> 192 config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
193
194 if "bert-base-japanese" in pretrained_model_name_or_path:
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
192 """
193 config_dict, _ = PretrainedConfig.get_config_dict(
--> 194 pretrained_model_name_or_path, pretrained_config_archive_map=ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, **kwargs
195 )
196
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
270 )
271 )
--> 272 raise EnvironmentError(msg)
273
274 except json.JSONDecodeError:
OSError: Can't load '../twitter/twittertokenizer/'. Make sure that:
- '../twitter/twittertokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'
- or '../twitter/twittertokenizer/' is the correct path to a directory containing a 'config.json' file
If I change AutoTokenizer
to BertTokenizer
, the code above can work. Also I can run the script without any problem is I load by shortcut name instead of path. But in the script run_language_modeling.py it uses AutoTokenizer
. I'm looking for a way to get it running.
Any idea? Thanks!