
I am trying to explore this dataset with pandas 0.20.3 in Python 3.6.2.

%pylab inline
import pandas as pd
df = pd.read_csv('OnlineNewsPopularity.csv')

last line produces error

KeyError Traceback (most recent call last) ~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2441 try: -> 2442 return self._engine.get_loc(key) 2443 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'n_tokens_content'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last) in () ----> 1 df['n_tokens_content'][:9]

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/frame.py in getitem(self, key) 1962 return self._getitem_multilevel(key) 1963 else: -> 1964 return self._getitem_column(key) 1965 1966 def _getitem_column(self, key):

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key) 1969 # get column 1970
if self.columns.is_unique: -> 1971 return self._get_item_cache(key) 1972 1973 # duplicate columns & possible reduce dimensionality

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item) 1643 res = cache.get(item)
1644 if res is None: -> 1645 values = self._data.get(item) 1646 res = self._box_item_values(item, values) 1647
cache[item] = res

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath) 3588 3589 if not isnull(item): -> 3590 loc = self.items.get_loc(item) 3591 else: 3592 indexer = np.arange(len(self.items))[isnull(self.items)]

~/anaconda3/envs/tf11/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2442
return self._engine.get_loc(key) 2443 except KeyError: -> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2445 2446
indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'n_tokens_content'

I think this is caused by some rows in the csv file, as this piece of code work well for other csv.

if yes, how to locate the bad rows efficiently?

What is your goal? What are you trying to achieve?Erfan
This error means there is no column called n_tokens_content in the dataframe you created. You'll have to examine the dataframe (e.g., run df.columns or df.head()) to see what your column names are.AlexK

2 Answers


When you print the columns using df.columns then 'n_tokens_content' has a leading space at the start.

Input: df.columns


Index(['url', ' timedelta', ' n_tokens_title', ' n_tokens_content',
   ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens',
   ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos',
   ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle',
   ' data_channel_is_entertainment', ' data_channel_is_bus',
   ' data_channel_is_socmed', ' data_channel_is_tech',
   ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min',
   ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg',
   ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares',
   ' self_reference_max_shares', ' self_reference_avg_sharess',
   ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
   ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
   ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02',
   ' LDA_03', ' LDA_04', ' global_subjectivity',
   ' global_sentiment_polarity', ' global_rate_positive_words',
   ' global_rate_negative_words', ' rate_positive_words',
   ' rate_negative_words', ' avg_positive_polarity',
   ' min_positive_polarity', ' max_positive_polarity',
   ' avg_negative_polarity', ' min_negative_polarity',
   ' max_negative_polarity', ' title_subjectivity',
   ' title_sentiment_polarity', ' abs_title_subjectivity',
   ' abs_title_sentiment_polarity', ' shares'],

Give input as: df[' n_tokens_content'][:9]

output: 0 219 1 255 2 211 3 531 4 1072 5 370 6 960 7 989 8 97


I encountered the same problem and it has been solved:
input: df.columns output:

 Index(['url', ' timedelta', ' n_tokens_title', ' n_tokens_content',
       ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens',
       ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos',
       ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle',
       ' data_channel_is_entertainment', ' data_channel_is_bus',
       ' data_channel_is_socmed', ' data_channel_is_tech',
       ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min',
       ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg',
       ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares',
       ' self_reference_max_shares', ' self_reference_avg_sharess',
       ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
       ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
       ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02',
       ' LDA_03', ' LDA_04', ' global_subjectivity',
       ' global_sentiment_polarity', ' global_rate_positive_words',
       ' global_rate_negative_words', ' rate_positive_words',
       ' rate_negative_words', ' avg_positive_polarity',
       ' min_positive_polarity', ' max_positive_polarity',
       ' avg_negative_polarity', ' min_negative_polarity',
       ' max_negative_polarity', ' title_subjectivity',
       ' title_sentiment_polarity', ' abs_title_subjectivity',
       ' abs_title_sentiment_polarity', ' shares'],

you can find that the title of the column "n_tokens_title" is " n_tokens_title", notice the space front the n_tokens_title, and add the space in your code.