I have a "CSV" data file with the following format (well, it's rather a TSV):
event pdg x y z t px py pz ekin 3383 11 -161.515 5.01938e-05 -0.000187112 0.195413 0.664065 0.126078 -0.736968 0.00723234 1694 11 -161.515 -0.000355633 0.000263174 0.195413 0.511853 -0.523429 0.681196 0.00472714 4228 11 -161.535 6.59631e-06 -3.32796e-05 0.194947 -0.713983 -0.0265468 -0.69966 0.0108681 4233 11 -161.515 -0.000524488 6.5069e-05 0.195413 0.942642 0.331324 0.0406377 0.017594
This file is interpretable as-is in pandas
:
from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False) # Works
data = read_table("test.csv", index_col=False) # Works
However, when I try to read it in blaze
(that declares to use pandas keyword arguments), an exception is thrown:
from blaze import Data
Data("test.csv") # Attempt 1
Data("test.csv", sep="\t") # Attempt 2
Data("test.csv", sep="\t", index_col=False) # Attempt 3
None of these works and pandas is not used at all. The "sniffer" that tries to deduce column names and types just calls csv.Sniffer.sniff()
from standard library (which fails).
Is there a way how to properly read this file in blaze (given that its "little brother" has few hundred MBs, I want to use blaze's sequential processing capabilities)?
Thanks for any ideas.
Edit: I think it might be a problem of odo/csv and filed an issue: https://github.com/blaze/odo/issues/327
Edit2: Complete error:
Error Traceback (most recent call last) in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False) /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs) 54 if isinstance(data, _strtypes): 55 data = resource(data, schema=schema, dshape=dshape, columns=columns, ---> 56 **kwargs) 57 if (isinstance(data, Iterator) and 58 not isinstance(data, tuple(not_an_iterator))): /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs) 62 63 def __call__(self, s, *args, **kwargs): ---> 64 return self.dispatch(s)(s, *args, **kwargs) 65 66 @property /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs) 276 @resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?') 277 def resource_csv(uri, **kwargs): --> 278 return CSV(uri, **kwargs) 279 280 /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs) 102 if has_header is None: 103 self.has_header = (not os.path.exists(path) or --> 104 infer_header(path, sniff_nbytes)) 105 else: 106 self.has_header = has_header /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs) 58 with open_file(path, 'rb') as f: 59 raw = f.read(nbytes) ---> 60 return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding)) 61 62 /home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample) 392 # subtracting from the likelihood of the first row being a header. 393 --> 394 rdr = reader(StringIO(sample), self.sniff(sample)) 395 396 header = next(rdr) # assume first row is header /home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters) 187 188 if not delimiter: --> 189 raise Error("Could not determine delimiter") 190 191 class dialect(Dialect): Error: Could not determine delimiter