1
votes

I'm dealing with a huge text dataset for content classification. I've implemented the distilbert model and distilberttokenizer.from_pretrained() tokenizer.. This tokenizer is taking incredibly long to tokenizer my text data roughly 7 mins for just 14k records and that's because it runs on my CPU.

Is there any way to force the tokenizer to run on my GPU.

1
This seems to be a duplicate of this question.justanyphil

1 Answers

3
votes

Tokenization is string manipulation. It is basically a for loop over a string with a bunch of if-else conditions and dictionary lookups. There is no way this could speed up using a GPU. Basically, the only thing a GPU can do is tensor multiplication and addition. Only problems that can be formulated using tensor operations can be accelerated using a GPU.

The default tokenizers in Huggingface Transformers are implemented in Python. There is a faster version that is implemented in Rust. You can get it either from the standalone package Huggingface Tokenziers or in newer versions of Transformers, they should be available under DistilBertTokenizerFast.