Pickle incompatibility of numpy arrays between Python 2 and 3

Question

I am trying to load the MNIST dataset linked here in Python 3.2 using this program:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

Unfortunately, it gives me the error:

Traceback (most recent call last):
   File "mnist.py", line 7, in <module>
     train_set, valid_set, test_set = pickle.load(f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 614: ordinal not in range(128)

I then tried to decode the pickled file in Python 2.7, and re-encode it. So, I ran this program in Python 2.7:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f)

    # Printing out the three objects reveals that they are
    # all pairs containing numpy arrays.

    with gzip.open('mnistx.pkl.gz', 'wb') as g:
        pickle.dump(
            (train_set, valid_set, test_set),
            g,
            protocol=2)  # I also tried protocol 0.

It ran without error, so I reran this program in Python 3.2:

import pickle
import gzip
import numpy

# note the filename change
with gzip.open('mnistx.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

However, it gave me the same error as before. How do I get this to work?

This is a better approach for loading the MNIST dataset.

there are compatibility breaks between 2.7 and 3.x. especially string vs unicode. And picking a numpy object requires that both systems load the numpy module but those modules are different. Sorry I don't have an answer but this might not be do-able and is probably not advisable. If this are big things (gzip), maybe hdf5 with pytables?? — Phil Cooper
@PhilCooper: Thanks, your comment (post this as an answer?) clued me in to the right answer. I could have used hdf5, but it seemed complicated to learn, so I went with numpy.save/load and this worked. — Neil G
h5py is very simple to use, almost certainly much easier then solving nebulous compatibility problems with pickling numpy arrays. — DaveP
You say you "ran this program under Python 2.7". OK but what did you run under 3.2? :-) The same? — Lennart Regebro
@LennartRegebro: After running the second program that pickles the arrays, I ran the first program (substituting the filename mnistx.pkl.gz) in Python 3.2. It didn't work, which I think illustrates some kind of incompatibility. — Neil G

Lennart Regebro Lennart Regebro · Accepted Answer · 2012-07-03T15:48:53

This seems like some sort of incompatibility. It's trying to load a "binstring" object, which is assumed to be ASCII, while in this case it is binary data. If this is a bug in the Python 3 unpickler, or a "misuse" of the pickler by numpy, I don't know.

Here is something of a workaround, but I don't know how meaningful the data is at this point:

import pickle
import gzip
import numpy

with open('mnist.pkl', 'rb') as f:
    u = pickle._Unpickler(f)
    u.encoding = 'latin1'
    p = u.load()
    print(p)

Unpickling it in Python 2 and then repickling it is only going to create the same problem again, so you need to save it in another format.

Pickle incompatibility of numpy arrays between Python 2 and 3

7 Answers