Tokenizing lists of strings to return one list of tokenized of words

Question

Good day,

I have a function that should have the ability to lower and tokenize text and returns tokens. Here is the function below:

def preprocess_text(text):
""" A function to lower and tokenize text data """ 
# Lower the text
lower_text = text.lower()

# tokenize the text into a list of words
 tokens = nltk.tokenize.word_tokenize(lower_text)

return tokens

I then wish to apply the function to my actual text data called data which is a list with strings within it. I want to iterate over each string within data and apply the function to lower and tokenize the text data.

Finally, I wish to append the tokenized words to a final list called tokenized_final which should be the final list containing the tokenized words. Here is the next bit of code below:

# Final list with tokenized words
tokenized_final = []

# Iterating over each string in data
for x in data:
    # Calliing preprocess text function
    token = preprocess_text(x)

    tokenized_final.append(token)

However, when I do all this and print the list tokenized_final. It outputs a big list containing lists within it.

print (tokeninized_final)

Output:
 [['pfe', 'bulls', 'have', 'reasons', 'on'],
 ['to', 'pay', 'more', 'attention'],
 ['there', 'is', 'still']]

When my desired output for tokenized_final is to be like this in one list:

['pfe', 'bulls', 'have', 'reasons', 'on','to', 'pay','more', 'attention','there','is', 'still']

Is there any way to rectify the preprocess function and apply it to the data to get the desired output. Or is there any way to do this?... Help would truly be appreciated here. Thanks in advance

Andrey Lukyanenko Andrey Lukyanenko · Accepted Answer · 2019-03-17T04:24:11

You just need to flatten the resulting list:

# Final list with tokenized words
tokenized_final = []

# Iterating over each string in data
for x in data:
    # Calliing preprocess text function
    token = preprocess_text(x)

    tokenized_final.append(token) 

flattened_tokeninized_final = [i for j in tokeninized_final for i in j]

Tokenizing lists of strings to return one list of tokenized of words

1 Answers