Python Pandas NLTK Tokenize Column in Pandas Dataframe: expected string or bytes-like object

Question

I have the following sample data frame with a 'problem_definition' column:

ID  problem_definition  
1   cat, dog fish
2   turtle; cat; fish fish
3   hello book fish 
4   dog hello fish cat

I want to word tokenize the 'problem_definition' column.

Below is my code:

from nltk.tokenize import sent_tokenize, word_tokenize 
import pandas as pd 

df = pd.read_csv('log_page_nlp_subset.csv')

df['problem_definition_tokenized'] = df['problem_definition'].apply(word_tokenize)

The code above gives me the following error:

TypeError: expected string or bytes-like object

sync11 sync11 · Accepted Answer · 2018-11-28T20:09:04

Use lambda inside apply:

df = pd.DataFrame({'TEXT':['cat, dog fish', 'turtle; cat; fish fish', 'hello book fish', 'dog hello fish cat']})
df

    TEXT
0   cat, dog fish
1   turtle; cat; fish fish
2   hello book fish
3   dog hello fish cat

df.TEXT.apply(lambda x: word_tokenize(x))

0                [cat, ,, dog, fish]
1    [turtle, ;, cat, ;, fish, fish]
2                [hello, book, fish]
3            [dog, hello, fish, cat]
Name: TEXT, dtype: object

If you also need to escape from punctuation then use:

df.TEXT.apply(lambda x: RegexpTokenizer(r'\w+').tokenize(x))
0             [cat, dog, fish]
1    [turtle, cat, fish, fish]
2          [hello, book, fish]
3      [dog, hello, fish, cat]
Name: TEXT, dtype: object

Python Pandas NLTK Tokenize Column in Pandas Dataframe: expected string or bytes-like object

2 Answers