6
votes

2nd UPDATE: Confirmed as a bug by user @Matt B. See his answer below for more detail.

UPDATE: @waTeim has demonstrated that one can write and read a DataFrame that contains a column of type date (confirmed on my setup). This is important, as it means Julia can write and read some composite types that are in the column of a data-frame. However, the case of a type datetime (which is different to type date) still throws an error, so at this point the question remains unanswered.

In Julia, using the HDF5 and JLD package, it is possible to save and load DataFrames in a .jld file using, for example:

#Preamble
using HDF, JLD, DataFrames
filePath = "/home/colin/Test.jld";

#Save the data-frame
fid1 = jldopen(FP, "w");
write(fid1, "MyDataFrame", MyDataFrame);
close(fid1);

#Come back later and load the data-frame
fid1 = jldopen(FP, "r");
X = read(fid1, "MyDataFrame");
close(fid1);

This works nicely, as long as the columns of the data-frame are all vectors of a base Julia type like Float64 or Int64. However, in practice, we will often want the first column of a data-frame to be a datetime, which is not a base type (although might become one in future releases). In this situation, the code above fails for me on the read operation, with a long error message (I'll add it to the bottom if anyone asks in the comments). Following the documentation for the JLD package, I tried the following when saving:

#Save the data-frame
fid1 = jldopen(FP, "w");
addrequire(fid1, "/home/colin/.julia/v0.2/DataFrames/src/dataframe.jl")
addrequire(fid1, "/home/colin/.julia/v0.2/Datetime/src/Datetime.jl")
write(fid1, "MyDataFrame", MyDataFrame);
close(fid1);

but this did not help.

Am I doing something stupid, or is this functionality simply not available?

Note: HDF5 tag included because the JLD package uses it.

2
Any chance serialize() and deserialize() do what you want? You wouldn't get a .jld but you should be able to do the i/o.Mageek
@Mageek serialize() and deserialize() can probably be made to work, but the solution is not feasible in the long term since a different version of Julia or even an instance of Julia running on a different system may not read back the same data that was written. Thanks for the idea though, and sorry it has taken me so long to respond.Colin T Bowers
This is a bug. Reported at: github.com/timholy/HDF5.jl/issues/106mbauman
It should now be fixed! Let me know if you're still having trouble (until a new version of HDF5 is tagged, you can use Pkg.checkout("HDF5") to get this patch).mbauman
@MattB. Thanks Matt, I've confirmed the fix on my setup. All working now. If you want to write a very brief answer indicating it is a bug and has been fixed, I'll upvote and give the answer tick.Colin T Bowers

2 Answers

6
votes

When HDF5 support for a particular Julia datatype is lacking then one can expect this error. In this case it was not specifically DataFrames using Datetime, but lack of support for the type Datetime itself. Apparently when the library is unable to load the type for whatever reason (see here and here too for other examples). The exact reason and fix were different for each type, but reporting the bug led to prompt fixes (see below).

The error

HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
  #000: H5Dio.c line 182 in H5Dread(): can't read data
    major: Dataset
    minor: Read failed
  #001: H5Dio.c line 438 in H5D__read(): unable to set up type info
    major: Dataset
    minor: Unable to initialize object
  #002: H5Dio.c line 939 in H5D__typeinfo_init(): unable to convert between src and dest datatype
    major: Dataset
    minor: Feature is unsupported
  #003: H5T.c line 4525 in H5T_path_find(): no appropriate function for conversion path
    major: Datatype
    minor: Unable to initialize object

Historical

Version 0.2.25

I would suggest that you migrate to Julia version 0.3 as it's at release candidate status now and update your package repository. My setup is different; I am using different versions of HDF5, JLD, DataFrames, and Datetime. But that being said, the two significant changes I made were to simply indicate the module name instead of the filename in the call to addrequire and also use the @read and @write macros rather than the corresponding functions as the latter seem to be buggy.

Version 0.3.0-rc1+4263 (2014-07-19 02:59 UTC)

Pkg.status()
- DataFrames                    0.5.7
- HDF5                          0.2.25
- Datetime                      0.1.6

Create the datafile

using HDF5,JLD,DataFrames,Datetime

testFile = jldopen("test.jld","w")
addrequire(testFile,"DataFrames")
addrequire(testFile,"Datetime")
df = DataFrame()
df[:column1] = today() 
@write testFile df
close(testFile)

Restarting Julia and reading....

julia> using HDF5,JLD,DataFrames,Datetime

julia> testFile = jldopen("test.jld","r")
Julia data file version 0.0.2: test.jld

julia> @read testFile df
1x1 DataFrame
|-------|------------|
| Row # | column1    |
| 1     | 2014-07-19 |

julia> df[:column1]
 1-element DataArray{Date{ISOCalendar},1}:
 2014-07-19

Version 0.2.25+ (prerelease)

Indeed I can confirm that trying to store Datetime was failing and using the latest from the repo fixes the problem.

 HDF5                          0.2.25+            master

if the above is modified only by changing today() to now()

df[:column1] = now()

Then the following

julia> using HDF5,JLD,DataFrames,Datetime

julia> testFile = jldopen("test.jld","r")
Julia data file version 0.0.2: test.jld

julia> @read testFile df
1x1 DataFrame
|-------|-------------------------|
| Row # | column1                 |
| 1     | 2014-07-26T03:38:45 UTC |

But it appears that the same general looking error message that was occurring for Datetime also happens for type complex despite this fix.

c = 1 + im;
@write testFile c

Version 0.2.26

By this version complex was also supported. Originally it appeared that the problem was lack for support for type complex generally, but it was more likely a special problem of complex being initialized from 1 + im; rather than 1.0 + im.

- HDF5                          0.2.26

julia> using HDF5, JLD

julia> testFile = jldopen("test.jld","r")
Julia data file version 0.0.2: test.jld

julia> @read testFile c
1 + 1im
2
votes

As I noted in my comment above, this is behavior is a bug which has now been fixed. Until version 0.2.26 gets tagged, you can use Pkg.checkout("HDF5") to get this bugfix.


But to make this a bit more of an answer, I'll describe the issue a bit more and give a potential workaround. Both the Date and DateTime types are bitstypes with very similar definitions. Saving and loading bitstypes in the HDF5.jl package is a relatively new feature; it's only been supported for the past month (tagged as versions 0.2.24 and 0.2.25).

These versions have a bug where the type names of bitstypes don't get saved with their module name (as the fully-qualified typename). You can see this very clearly in the distinction between import and using:

julia> using HDF5, JLD # version 0.2.25

julia> import Datetime

julia> save("today.jld","t",Datetime.today()) # today() returns a `Datetime.Date`

julia> load("today.jld") # But it was saved as just a `Date`, not a `Datetime.Date`
                         # so HDF5 cannot find the definition
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
  #000: H5Dio.c line 182 in H5Dread(): can't read data … # backtrace truncated

julia> using Datetime # Bring `Date` into the `Main` namespace

julia> load("today.jld") # now it works!
Dict{Union(UTF8String,ASCIIString),Any} with 1 entry:
  "t" => 2014-07-25

So, when you go to save a DateTime object, it is parameterized by both a Calendar and and a timezone Offset. But the Offset types aren't exported from the Datetime package… there are a lot of them! Most DateTimes, however, just use Zone0: UTC. So if you have DateTime data saved with HDF5.jl versions 0.2.24-25, you could recover it by manually "exporting" those types into your main namespace.

julia> save("now.jld","n",now())

julia> load("now.jld")
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
  #000: H5Dio.c line 182 in H5Dread(): can't read data … # truncated

julia> const Zone0 = Datetime.Zone0;

julia> load("now.jld")
Dict{Union(UTF8String,ASCIIString),Any} with 1 entry:
  "n" => 2014-07-25T13:45:45 UTC