2
votes

I would like to parse data file with the format col_index:value in pandas/numpy. For example:

0:23 3:41 1:31 2:65

would correspond to this matrix:

[[23 0 0 41] [0 31 65 0]]

It seems like a pretty common way to represent sparse data in a file, but I can't find an easy way to parse this without having to do some sort of iteration after calling read_csv.

2

2 Answers

2
votes

I found out recently that this is in fact svm-light format and you may be able to read a dataset like this using an svm loader like:

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html

1
votes

So, is parsing the file line by line an option, like:

from scipy.sparse import coo_matrix

rows, cols, values = [], [], []

with open('sparse.txt') as f:
    for i, line in enumerate(f):
        for cell in line.strip().split(' '):
            col, value = cell.split(':')
            rows.append(i)
            cols.append(int(col))
            values.append(int(value))

matrix = coo_matrix((values, (rows, cols)))

print matrix.todense()

Or do you need a faster one-step implementation? Not sure if this is possible.

Edit #1: You can avoid one iteration splitting each line in one step using regular expressions leading to the following alternative implementation:

import numpy as np
from scipy.sparse import coo_matrix
import re

rows, cols, values = [], [], []

with open('sparse.txt') as f:
    for i, line in enumerate(f):
        numbers = map(int, re.split(':| ', line))
        rows.append([i] * (len(numbers) / 2))
        cols.append(numbers[::2])
        values.append(numbers[1::2])

matrix = coo_matrix((np.array(values).flatten(),
                     (np.array(rows).flatten(),
                      np.array(cols).flatten())))

print matrix.todense()

Edit #2: I found an even shorter solution without explicit loop:

from scipy.sparse import coo_matrix, vstack

def parseLine(line):
    nums = map(int, line.split(' '))
    return coo_matrix((nums[1::2], ([0] * len(nums[0::2]), nums[0::2])), (1, 4))

with open('sparse.txt') as f:
    lines = f.read().replace(':', ' ').split('\n')
    cols = max(map(int, " ".join(lines).split(" "))[::2])
    M = vstack(map(parseLine, lines))

print M.todense()

The loop is hidden within the map commands that act on lines. I think there is no solution without loops at all, since most built-in functions use them and many string-parsing methods like re.finditer yield iterators only.