Is there a better way to get just 'important words' from a list in python?

Question

I wrote some code to find the most popular words in submission titles on reddit, using the reddit praw api.

import nltk
import praw

picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')

print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)

hey = []

for x in submissions:
    hey.extend(str(x).split(' '))   

fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()

common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1

print '-----------------------'
for word in top_words:  
    if word.lower() not in common_words and word.lower() not in already:
        print str(number) + ". '" + word + "'"
        counter +=1
    number +=1
    already.append(word.lower())
if counter == many:
    break
print '-----------------------\n'

so inputting subreddit 'python' and getting 10 posts returns:

'Python'
'PyPy'
'code'
'use'
'136'
'181'
'd...'
'IPython'
'133'
10. '158'

How can I make this script not return numbers, and error words like 'd...'? The first 4 results are acceptable, but I would like to replace this rest with words that make sense. Making a list common_words is unreasonable, and doesn't filter these errors. I'm relatively new to writing code, and I appreciate the help.

mr2ert mr2ert · Accepted Answer · 2013-08-14T22:08:07

I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.

Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).

2) Getting rid of all number strings is trivial. One way you could do it is:

all([w.isdigit() for w in word])

Where if this returns True, then the word is just a series of numbers.

3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:

tf = [ c.isalpha() for c in word ]

Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:

t = tf.count(True)
f = tf.count(False)

You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:

def check_wordiness(word):
    # This returns true only if a word is all letters
    return all([ c.isalpha() for c in word ])

4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:

for word in top_words:
    # Since you are calling .lower() so much, 
    # you probably want to define it up here
    w = word.lower() 
    if w not in common_words and w not in already:
        # String formatting is preferred over +'s
        print "%i. '%s'" % (number, word)
        number +=1
    # This could go under the if statement. You only want to add
    # words that could be added again.  Why add words that are being
    # filtered out anyways?
    already.append(w)

    # this wasn't indented correctly before
    if number == many:
        break

Hope that helps.

Is there a better way to get just 'important words' from a list in python?

10. '158'

1 Answers