0
votes

I am trying to explore this dataset with pandas 0.20.3 in Python 3.6.2.

%pylab inline
import pandas as pd
df = pd.read_csv('OnlineNewsPopularity.csv')
df['n_tokens_content'][:9]

last line produces error

KeyError Traceback (most recent call last) ~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2441 try: -> 2442 return self._engine.get_loc(key) 2443 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'n_tokens_content'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last) in () ----> 1 df['n_tokens_content'][:9]

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/frame.py in getitem(self, key) 1962 return self._getitem_multilevel(key) 1963 else: -> 1964 return self._getitem_column(key) 1965 1966 def _getitem_column(self, key):

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key) 1969 # get column 1970
if self.columns.is_unique: -> 1971 return self._get_item_cache(key) 1972 1973 # duplicate columns & possible reduce dimensionality

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item) 1643 res = cache.get(item)
1644 if res is None: -> 1645 values = self._data.get(item) 1646 res = self._box_item_values(item, values) 1647
cache[item] = res

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath) 3588 3589 if not isnull(item): -> 3590 loc = self.items.get_loc(item) 3591 else: 3592 indexer = np.arange(len(self.items))[isnull(self.items)]

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2442
return self._engine.get_loc(key) 2443 except KeyError: -> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2445 2446
indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'n_tokens_content'

I think this is caused by some rows in the csv file, as this piece of code work well for other csv.

if yes, how to locate the bad rows efficiently?

2
What is your goal? What are you trying to achieve?Erfan
This error means there is no column called n_tokens_content in the dataframe you created. You'll have to examine the dataframe (e.g., run df.columns or df.head()) to see what your column names are.AlexK

2 Answers

0
votes

When you print the columns using df.columns then 'n_tokens_content' has a leading space at the start.

Input: df.columns

Output:

Index(['url', ' timedelta', ' n_tokens_title', ' n_tokens_content',
   ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens',
   ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos',
   ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle',
   ' data_channel_is_entertainment', ' data_channel_is_bus',
   ' data_channel_is_socmed', ' data_channel_is_tech',
   ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min',
   ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg',
   ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares',
   ' self_reference_max_shares', ' self_reference_avg_sharess',
   ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
   ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
   ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02',
   ' LDA_03', ' LDA_04', ' global_subjectivity',
   ' global_sentiment_polarity', ' global_rate_positive_words',
   ' global_rate_negative_words', ' rate_positive_words',
   ' rate_negative_words', ' avg_positive_polarity',
   ' min_positive_polarity', ' max_positive_polarity',
   ' avg_negative_polarity', ' min_negative_polarity',
   ' max_negative_polarity', ' title_subjectivity',
   ' title_sentiment_polarity', ' abs_title_subjectivity',
   ' abs_title_sentiment_polarity', ' shares'],
  dtype='object')

Give input as: df[' n_tokens_content'][:9]

output: 0 219 1 255 2 211 3 531 4 1072 5 370 6 960 7 989 8 97

0
votes

I encountered the same problem and it has been solved:
input: df.columns output:

 Index(['url', ' timedelta', ' n_tokens_title', ' n_tokens_content',
       ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens',
       ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos',
       ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle',
       ' data_channel_is_entertainment', ' data_channel_is_bus',
       ' data_channel_is_socmed', ' data_channel_is_tech',
       ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min',
       ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg',
       ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares',
       ' self_reference_max_shares', ' self_reference_avg_sharess',
       ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
       ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
       ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02',
       ' LDA_03', ' LDA_04', ' global_subjectivity',
       ' global_sentiment_polarity', ' global_rate_positive_words',
       ' global_rate_negative_words', ' rate_positive_words',
       ' rate_negative_words', ' avg_positive_polarity',
       ' min_positive_polarity', ' max_positive_polarity',
       ' avg_negative_polarity', ' min_negative_polarity',
       ' max_negative_polarity', ' title_subjectivity',
       ' title_sentiment_polarity', ' abs_title_subjectivity',
       ' abs_title_sentiment_polarity', ' shares'],
      dtype='object')

you can find that the title of the column "n_tokens_title" is " n_tokens_title", notice the space front the n_tokens_title, and add the space in your code.