1
votes

I had some problem with WordCloud code in python when try to run Arabic huge data this my code:

from os import path
import codecs
from wordcloud import WordCloud
import arabic_reshaper
from bidi.algorithm import get_display
d = path.dirname(__file__)
f = codecs.open(path.join(d, 'C:/example.txt'), 'r', 'utf-8')
text = arabic_reshaper.reshape(f.read())
text = get_display(text)
wordcloud = WordCloud(font_path='arial',background_color='white', mode='RGB',width=1500,height=800).generate(text)
wordcloud.to_file("arabic_example.png")

And this is the error I get:

Traceback (most recent call last):

File "", line 1, in runfile('C:/Users/aam20/Desktop/python/codes/WordClouds/wordcloud_True.py', wdir='C:/Users/aam20/Desktop/python/codes/WordClouds')

File "C:\Users\aam20\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile execfile(filename, namespace)

File "C:\Users\aam20\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/aam20/Desktop/python/codes/WordClouds/wordcloud_True.py", line 28, in text = get_display(text)

File "C:\Users\aam20\Anaconda3\lib\site-packages\bidi\algorithm.py", line 648, in get_display resolve_implicit_levels(storage, debug)

File "C:\Users\aam20\Anaconda3\lib\site-packages\bidi\algorithm.py", line 466, in resolve_implicit_levels

'%s not allowed here' % _ch['type']

AssertionError: RLI not allowed here

Can someone help resolve this issue?

3
my data is huge its about 17000 rows, can't run it, but if i try to run a little data the code will run it without any error. also I have another code that can run a huge data but with reflect words I will attach the code with resultsAbdulrahman
I add the complete error text to my questionAbdulrahman
If you look at the error message, the problem is not wordcloud but the bidi package as the error occurs in line text = get_display(text); you never make it to the wordcloud execution. I suspect that there is some word with improperly encoded characters in your data set. If you truncate the data, you are excluding that word (or list of words).Paul Brodersen
On a related note, you should preprocess your text (tokenize, filter out common, non-specific words, etc) and select a much smaller subset of words that can actually be displayed in a readable font size.Paul Brodersen

3 Answers

3
votes

I tried to preprocess the text with the mentioned method below! before calling reshaper and it worked for me.

def removeWeirdChars(text):
    weridPatterns = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u'\U00010000-\U0010ffff'
                               u"\u200d"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\u3030"
                               u"\ufe0f"
                               u"\u2069"
                               u"\u2066"
                               u"\u200c"
                               u"\u2068"
                               u"\u2067"
                               "]+", flags=re.UNICODE)
    return weridPatterns.sub(r'', text)
0
votes

There is a weird character in your text that get_display() is unable to deal with. You can find this character and add it to a list of stopwords. However it might be very painful. One shortcut is to create a dictionary with most frequent words and their frequencies and feed it to generate_from_frequencies fucnction:

wordcloud = WordCloud(font_path='arial',background_color='white', mode='RGB',width=1500,height=800).generate_from_frequencies(YOURDICT)

For more information check my response to this post.

0
votes

Here is how you can simply generate Arabic wordCloud:

import arabic_reshaper
from bidi.algorithm import get_display


reshaped_text = arabic_reshaper.reshape(text)
bidi_text = get_display(reshaped_text)
wordcloud = WordCloud(font_path='NotoNaskhArabic-Regular.ttf').generate(bidi_text)
wordcloud.to_file("worCloud.png")

And here is a link to Google colab example: Colab notebook