Python pandas read data of the form Label <index_1>:<val_1> <index_2>:<val_2>

Question

For example, a row in the data looks like this

-1 0:183.3575741549828 1:3.11164735151736 2:2.171277907851733 3:26.68849990272964 4:24.76677388937082 5:0.02710337995527495

The reason why index is specified is because attributes for which index is not specified are assumed to be zero.

I'm trying to use the statement:

train = pd.read_csv('train.csv', header=None, delim_whitespace=True).values

It is showing the following error:

train = pd.read_csv('train.csv', header=None, delim_whitespace=True).values

File "/usr/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 646, in parser_f return _read(filepath_or_buffer, kwds)

File "/usr/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in _read data = parser.read()

File "/usr/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 939, in read ret = self._engine.read(nrows)

File "/usr/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1508, in read data = self._reader.read(nrows)

File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415)

File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691)

File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437)

File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308)

File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 132 fields in line 5, saw 143

I can't seem to figure out the problem here. Any help would be great!

can you please edit your error into readable format? Also, I cannot undertand the row in your data. Is that a row of dictionaries? — splinter
@splinter This is a line in a csv file. The number of attributes is a fixed number say, 4125 (0-4124). A row specifies the values of attributes for a training example, 2:1231 says attribute 2 is 1231 — Shardul Tripathi

Nyps Nyps · Accepted Answer · 2017-04-24T12:09:24

Based on your data description and the error message my guess is that the rows in your csv file do not have the same amount of fields per row. Try specifying the field columns:

my_cols = range(0,4125)
train = pd.read_csv('train.csv', header=None, delim_whitespace=True, names=my_cols).values

Find more help here: import csv with different number of columns per row using Pandas and here: Handling Variable Number of Columns with Pandas - Python

Python pandas read data of the form Label <index_1>:<val_1> <index_2>:<val_2>

1 Answers