Running huggingface Bert tokenizer on GPU

Question

I'm dealing with a huge text dataset for content classification. I've implemented the distilbert model and distilberttokenizer.from_pretrained() tokenizer.. This tokenizer is taking incredibly long to tokenizer my text data roughly 7 mins for just 14k records and that's because it runs on my CPU.

Is there any way to force the tokenizer to run on my GPU.

Jindřich Jindřich · Accepted Answer · 2021-02-09T09:20:08

Tokenization is string manipulation. It is basically a for loop over a string with a bunch of if-else conditions and dictionary lookups. There is no way this could speed up using a GPU. Basically, the only thing a GPU can do is tensor multiplication and addition. Only problems that can be formulated using tensor operations can be accelerated using a GPU.

The default tokenizers in Huggingface Transformers are implemented in Python. There is a faster version that is implemented in Rust. You can get it either from the standalone package Huggingface Tokenziers or in newer versions of Transformers, they should be available under DistilBertTokenizerFast.

Running huggingface Bert tokenizer on GPU

1 Answers