Good day,
I have a function that should have the ability to lower and tokenize text and returns tokens. Here is the function below:
def preprocess_text(text):
""" A function to lower and tokenize text data """
# Lower the text
lower_text = text.lower()
# tokenize the text into a list of words
tokens = nltk.tokenize.word_tokenize(lower_text)
return tokens
I then wish to apply the function to my actual text data called data
which is a list with strings within it. I want to iterate over each string within data
and apply the function to lower and tokenize the text data.
Finally, I wish to append the tokenized words to a final list called tokenized_final
which should be the final list containing the tokenized words.
Here is the next bit of code below:
# Final list with tokenized words
tokenized_final = []
# Iterating over each string in data
for x in data:
# Calliing preprocess text function
token = preprocess_text(x)
tokenized_final.append(token)
However, when I do all this and print the list tokenized_final
. It outputs a big list containing lists within it.
print (tokeninized_final)
Output:
[['pfe', 'bulls', 'have', 'reasons', 'on'],
['to', 'pay', 'more', 'attention'],
['there', 'is', 'still']]
When my desired output for tokenized_final
is to be like this in one list:
['pfe', 'bulls', 'have', 'reasons', 'on','to', 'pay','more', 'attention','there','is', 'still']
Is there any way to rectify the preprocess function and apply it to the data to get the desired output. Or is there any way to do this?... Help would truly be appreciated here. Thanks in advance