2
votes

I have two columns (serverTs, FTs) in DataFrame which are timestamps in the format of Unix Time. In my code I need to subtract one from another. When i did so I received an error saying I can't subtract strings. So I added types for serverTs and FTs as integers.

file = r'S:\Работа с клиентами\Клиенты\BigTV Rating\fts_check.csv'
col_names = ["Day", "vcId", "FTs", "serverTs", "locHost", "tnsTmsec", "Hits", "Uniqs"]
df_empty = pd.DataFrame()
with open(file) as fl:
    chunk_iter = pd.read_csv(fl, sep='\t', names=col_names, dtype={'serverTs': np.int32, 'FTs': np.int32}, chunksize = 100000)
    for chunk in chunk_iter:
        chunk['diff'] = np.array(chunk['serverTs'])-np.array(chunk['FTs'])
        chunk = chunk[chunk['diff'] > 180]
        df_empty = pd.concat([df_empty,chunk])  

But the program gives me an error:

TypeError Traceback (most recent call last) pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

TypeError: Cannot cast array from dtype('O') to dtype('int32') according to the rule 'safe'

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last) in () 6 #dtype={'serverTs': np.int32, 'FTs': np.int32}, 7 #chunk_iter = chunk_iter.astype({'serverTs': np.int32, 'FTs': np.int32}) ----> 8 for chunk in chunk_iter: 9 #print(chunk[chunk['FTs'] == 'NaN']) 10 #chunk[['serverTs','FTs']] = chunk[['serverTs','FTs']].astype('int32')

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in next(self) 1040 def next(self): 1041 try: -> 1042 return self.get_chunk() 1043 except StopIteration: 1044 self.close()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in get_chunk(self, size) 1104 raise StopIteration
1105 size = min(size, self.nrows - self._currow) -> 1106 return self.read(nrows=size) 1107 1108

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1067 raise ValueError('skipfooter not supported for iteration') 1068 -> 1069 ret = self._engine.read(nrows) 1070 1071 if self.options.get('as_recarray'):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1837 def read(self, nrows=None): 1838
try: -> 1839 data = self._reader.read(nrows) 1840 except StopIteration: 1841 if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

ValueError: invalid literal for int() with base 10: 'FTs'

I'm taking data from Hadoop with SQL queries, so I checked for any symbol with letters, but there are only numbers. Moreover if FTs has any characters which are not numbers it cannot appear in the database. What could be the problem?

1
Does your CSV have a header? If it does, then do not pass names and let read_csv read the column names. It looks like you are trying to read the string 'FTs' from the file as a number.jdehesa
@jdehesa, yes, that was the problem. Thank you for your comment!julliet

1 Answers

1
votes

The problem here is that you are passing a names along with a dtypes argument. This causes header to act as None. So consider:

In [1]: import pandas as pd, numpy as np

In [2]: dt={'serverTs': np.int32, 'FTs': np.int32}

In [3]: import io

In [4]: s = """FTs,serverTs
   ...: 0,1
   ...: 1,2
   ...: """

In [5]: pd.read_csv(io.StringIO(s))
Out[5]:
   FTs  serverTs
0    0         1
1    1         2

In [6]: pd.read_csv(io.StringIO(s), dtype=dt)
Out[6]:
   FTs  serverTs
0    0         1
1    1         2

Works fine. However, if I pass names:

In [8]: names = 'FTs','serverTs'

In [9]: pd.read_csv(io.StringIO(s), dtype=dt, names=names)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

TypeError: Cannot cast array from dtype('O') to dtype('int32') according to the rule 'safe'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-9-18dcd5477b7e> in <module>()
----> 1 pd.read_csv(io.StringIO(s), dtype=dt, names=names)

/Users/juan/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708
--> 709         return _read(filepath_or_buffer, kwds)
    710
    711     parser_f.__name__ = name

/Users/juan/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    453
    454     try:
--> 455         data = parser.read(nrows)
    456     finally:
    457         parser.close()

/Users/juan/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1067                 raise ValueError('skipfooter not supported for iteration')
   1068
-> 1069         ret = self._engine.read(nrows)
   1070
   1071         if self.options.get('as_recarray'):

/Users/juan/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1837     def read(self, nrows=None):
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:
   1841             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

ValueError: invalid literal for int() with base 10: 'FTs'

In [10]:

So one solution is to pass the correct header index:

In [10]: pd.read_csv(io.StringIO(s), dtype=dt, names=names, header=0)
Out[10]:
   FTs  serverTs
0    0         1
1    1         2

Or better yet, don't pass the names at all, pandas will infer it for you anyway:

In [11]: pd.read_csv(io.StringIO(s), dtype=dt)
Out[11]:
   FTs  serverTs
0    0         1
1    1         2