2
votes

I am saving a numpy array using the following export_vectors defined below. In this function, I load string values separated by space and then store them as floats in an numpy array.

def export_vectors(vocab, input_filename, output_filename, dim):
    embeddings = np.zeros([len(vocab), dim])
    with open(input_filename) as f:
        for line in f:
            line = line.strip().split(' ')
            word = line[0]
            embedding = line[1:]
            if word in vocab:
                word_idx = vocab[word]
                embeddings[word_idx] = np.asarray(embedding).astype(float)

    np.savez_compressed(output_filename, embeddings=embeddings)

Here embeddings is an ndarray of float64 type.

Although, then when trying to load the file, using:

def get_vectors(filename):
    with open(filename) as f:
        return np.load(f)["embeddings"]

When trying to do the loading, I am getting the error:

File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 10: invalid start byte

Why is this?

1
Especially since the file is a z compressed, use np.load(filename). In other words let load take care of opening the compressed archive in the right way. - hpaulj

1 Answers

5
votes

You are probably using open wrong. I suspect, you need to give it a flag to use binary-mode like (docs):

open(filename, 'rb')  # r: read-only; b: binary

The docs explain the default-behaviour: Normally, files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding.

But you can make it simple and just use the filepath itself (as np.load is able to take file-like object, string, or pathlib.Path):

np.load(filename)  # This would be more natural
                   # as it's kind of the direct inverse of your save-code;
                   # -> no manual file-handling

(A simplified rule: everything which is using general-purpose compression is alway's working with binary-files; not text-files!)