3
votes

I have a "CSV" data file with the following format (well, it's rather a TSV):

event  pdg x   y   z   t   px  py  pz  ekin
3383    11  -161.515    5.01938e-05 -0.000187112    0.195413    0.664065    0.126078    -0.736968   0.00723234  
1694    11  -161.515    -0.000355633    0.000263174 0.195413    0.511853    -0.523429   0.681196    0.00472714  
4228    11  -161.535    6.59631e-06 -3.32796e-05    0.194947    -0.713983   -0.0265468  -0.69966    0.0108681   
4233    11  -161.515    -0.000524488    6.5069e-05  0.195413    0.942642    0.331324    0.0406377   0.017594

This file is interpretable as-is in pandas:

from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False)     # Works
data = read_table("test.csv", index_col=False)             # Works

However, when I try to read it in blaze (that declares to use pandas keyword arguments), an exception is thrown:

from blaze import Data 
Data("test.csv")                             # Attempt 1
Data("test.csv", sep="\t")                   # Attempt 2
Data("test.csv", sep="\t", index_col=False)  # Attempt 3

None of these works and pandas is not used at all. The "sniffer" that tries to deduce column names and types just calls csv.Sniffer.sniff() from standard library (which fails).

Is there a way how to properly read this file in blaze (given that its "little brother" has few hundred MBs, I want to use blaze's sequential processing capabilities)?

Thanks for any ideas.

Edit: I think it might be a problem of odo/csv and filed an issue: https://github.com/blaze/odo/issues/327

Edit2: Complete error:

Error Traceback (most recent call last)  in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False)

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs)
     54     if isinstance(data, _strtypes):
     55         data = resource(data, schema=schema, dshape=dshape, columns=columns,
---> 56                         **kwargs)
     57     if (isinstance(data, Iterator) and
     58             not isinstance(data, tuple(not_an_iterator))):

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
     62 
     63     def __call__(self, s, *args, **kwargs):
---> 64         return self.dispatch(s)(s, *args, **kwargs)
     65 
     66     @property

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs)
    276 @resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?')
    277 def resource_csv(uri, **kwargs):
--> 278     return CSV(uri, **kwargs)
    279 
    280 

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs)
    102         if has_header is None:
    103             self.has_header = (not os.path.exists(path) or
--> 104                                infer_header(path, sniff_nbytes))
    105         else:
    106             self.has_header = has_header

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs)
     58     with open_file(path, 'rb') as f:
     59         raw = f.read(nbytes)
---> 60     return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding))
     61 
     62 

/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample)
    392         # subtracting from the likelihood of the first row being a header.
    393 
--> 394         rdr = reader(StringIO(sample), self.sniff(sample))
    395 
    396         header = next(rdr) # assume first row is header

/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters)
    187 
    188         if not delimiter:
--> 189             raise Error("Could not determine delimiter")
    190 
    191         class dialect(Dialect):

Error: Could not determine delimiter
1
Hi, there. I believe that this worked on my system. What are the errors you are getting? - wgwz
Seems like installing dask changed the error being reported to something else. I am a bit confused. I'll leave it for now and parse the data with just pandas - I need the analysis to be done in a few hours. I'll come back to this later. Thanks. - honza_p
See the edits to my post. Your columns are not parsed correctly in the purely pandas analysis. - wgwz
I edited the original code a few minutes after posting (see ;-)). So yes, you are right about pandas parsing. But this alas is not the problem :-( - honza_p
My point is sep='\t' does not work - wgwz

1 Answers

3
votes

I am working with Python 2.7.10, dask v0.7.1, blaze v0.8.2 and conda v3.17.0.

conda install dask
conda install blaze

Here is a way you can import the data for use with blaze. Parse the data first with pandas and then convert it into blaze. Perhaps this defeats the purpose, but there are no troubles this way.

As a side note in order to parse the data file correctly your line in pandas parse statment should be:

from blaze import Data
from pandas import DataFrame, read_csv
data = read_csv("csvdata.dat", sep="\s*", index_col=False)
bdata = Data(data)

Now the data is formatted correctly with no errors, bdata:

   event  pdg        x         y         z         t        px        py  \
0   3383   11 -161.515  0.000050 -0.000187  0.195413  0.664065  0.126078   
1   1694   11 -161.515 -0.000356  0.000263  0.195413  0.511853 -0.523429   
2   4228   11 -161.535  0.000007 -0.000033  0.194947 -0.713983 -0.026547   
3   4233   11 -161.515 -0.000524  0.000065  0.195413  0.942642  0.331324   

     pz      ekin  
0 -0.736968  0.007232  
1  0.681196  0.004727  
2 -0.699660  0.010868  

Here is an alternative, use dask, it probably can do the same chunking, or large scale processing you are looking for. Dask certainly makes it immediately easy to correctly load a tsv format.

In [17]: import dask.dataframe as dd

In [18]: df = dd.read_csv('tsvdata.txt', sep='\t', index_col=False)

In [19]: df.head()
Out[19]: 
   event  pdg        x         y         z         t        px        py  \
0   3383   11 -161.515  0.000050 -0.000187  0.195413  0.664065  0.126078   
1   1694   11 -161.515 -0.000356  0.000263  0.195413  0.511853 -0.523429   
2   4228   11 -161.535  0.000007 -0.000033  0.194947 -0.713983 -0.026547   
3   4233   11 -161.515 -0.000524  0.000065  0.195413  0.942642  0.331324   
4    854   11 -161.515  0.000032  0.000418  0.195414  0.675752  0.315671   

         pz      ekin  
0 -0.736968  0.007232  
1  0.681196  0.004727  
2 -0.699660  0.010868  
3  0.040638  0.017594  
4 -0.666116  0.012641  

In [20]:

See also: http://dask.pydata.org/en/latest/array-blaze.html#how-to-use-blaze-with-dask