0
votes

I have the following dataset

                                   Text
country     file                          
US          file_US                The Dish: Lidia Bastianich shares Italian recipes ... - CBS News
            file_US                Blog - Tasty Yummies
            file_US                Acne Alternative Remedies: Manuka Honey, Tea Tree Oil ...
            file_US                Looking back at 10 years of Downtown Arts | Times Leader 

IT          filename_IT            Tornando indietro a ...
            filename_IT            Questo locale è molto consigliato per le famiglie
                                                                            ...                                 
            filename_IT            Ci si chiede dove poter andare a mangiare una pizza  Melanzana Capriccia ...
            filename_IT            Ideale per chi ama mangiare vegano
              

with country and file indices. I want to apply a function which remove stopwords based on the value of the index:

def removing(sent):
    
    if df.loc['US','UK']:
        stop_words = stopwords.words('english')
    if df.loc['ES']:
        stop_words = stopwords.words('spanish')    
    
# (and so on)
                      
    c_text = []

    for i in sent.lower().split():
        if i not in stop_words:
            c_text.append(i)

    return(' '.join(c_text))

df['New_Column'] = df['Text'].astype(str)
df['New_Column'] = df['New_Column'].apply(removing)

Unfortunately I am getting this error:

----> 6 if df.loc['US']: 7 stop_words = stopwords.words('english') 8 if df.loc['ES']:

/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in nonzero(self) 1477 def nonzero(self): 1478 raise ValueError( -> 1479 f"The truth value of a {type(self).name} is ambiguous. " 1480 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 1481 )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

and I am still not understanding how to fix it. Can you please tell me how I can run the code without getting the error?

3
Please provide a minimal reproducible example.AMC
Some people just leave just downvote my answer without left a single word , so I will remove it. Hope you get the Idea not use for loop when you have panda and numpyBENY
@still_learning I know , no problem , hope you already get the method np.where ~BENY
@still_learning 1st that is not his answer , 2nd your problem is different from what he linkedBENY

3 Answers

2
votes
#Assuming you have imported all the required libraries
#Make a dictionary with country code & language
lang={'UK':'english','US':'english','ES':'spanish'}
#assuming your dataframe as df
for index,row in df.iterrows():
   df.loc[index,'TEXT']=' '.join([word for word in str(row['TEXT']).split(' ') if word not in stopwords.words(lang[index])])

Updated answer:

 import pandas as pd
 import numpy as np
 import nltk
 from nltk.corpus import stopwords
 ind=pd.MultiIndex.from_arrays([['ind','ind','ind','ind','aus','aus','aus','aus'], ['1','2','3','4','5','6','7','8']])
 df=pd.DataFrame(['he is boy','he is boy','he is boy','he is boy','he is boy','he is boy','he is boy','he is boy'],index=ind,columns=['text'])
 lang={'ind':'spanish','aus':'english'}
 for index,row in df.iterrows():
       df.at[(index[0],index[1]),'text']=' '.join([word for word in str(row['text']).split(' ') if word not in stopwords.words(lang[index[0]])])

Before running loop:

enter image description here

After running loop:

enter image description here

Do try to take reference from the example I used!!

2
votes

Here is how you can use numpy.where():

import pandas as pd
from numpy import where

df = pd.DataFrame(...)

# Remove the english stopwords from the english sentences
c = ['US','UK']
for p in c:
    stop_words = stopwords.words('english')
    for w in stop_words:
        df['Text'] = where(df['country'] == p, # If the country is english
                              df['Text'].str.replace(w,''), # Replace each stopword in each sentence with blank
                              df['Text'])


# Remove the spanish stopwords from the spanish sentences
stop_words = stopwords.words('spanish')
for w in stop_words:
    df['Text'] = where(df['country'] == 'ES', # If the country is spanish
                          df['Text'].str.replace(w,''), # Replace each stopword in each sentence with blank
                          df['Text'])
-1
votes

define your function with

thecountry = x["Country"]
if thecountry == "UK" or thecountry=="US"
x["text"] = remove_stopwords("English")

... (etc)

And then df["filtered"] = df.apply(removing, axis=1)