Here is another variation using standard modules:
import csv
import re
from collections import defaultdict
from itertools import chain
d = defaultdict(list)
with open('data.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
d[row[0]].append(row[1])
k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))
e = []
for i in d:
e.append([0]*len(v))
for j in d[i]:
e[-1][int(j)-1] += 1
print ' ', re.sub(r'[\[\],]','',str(v))
for i, j in enumerate(k):
print j, re.sub(r'[\[\],]','',str(e[i]))
Given data.csv has the contents of the input file shown in the question, this script prints the following as output:
1 2 3
A 2 1 1
B 3 2 0
Thanks to @zyxue for a pure pandas solution. It takes a lot less code up front with the problem being selection of it. However, extra coding is not necessarily in vain regarding run time performance. Using timeit in IPython to measure the run time difference between my code and that of &zyxue using pure pandas, I found that my method ran 36 times faster excluding imports and input IO and 121 times faster when also excuding output IO (print statements). These tests were done with functions to encapsulate code blocks. Here are the functions that were tested using Python 2.7.10 and Pandas 0.16.2:
def p(): # 1st pandas function
s = df.groupby(['label', 'value']).size()
m = s.unstack()
m.columns.name = None
m.index.name = None
m = m.fillna(0)
print m
def p1(): # 2nd pandas function - omitting print statement
s = df.groupby(['label', 'value']).size()
m = s.unstack()
m.columns.name = None
m.index.name = None
m = m.fillna(0)
def q(): # first std mods function
k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))
e = []
for i in d:
e.append([0]*len(v))
for j in d[i]:
e[-1][int(j)-1] += 1
print ' ', re.sub(r'[\[\],]','',str(v))
for i, j in enumerate(k):
print j, re.sub(r'[\[\],]','',str(e[i]))
def q1(): # 2nd std mods function - omitting print statements
k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))
e = []
for i in d:
e.append([0]*len(v))
for j in d[i]:
e[-1][int(j)-1] += 1
Prior to testing the following code was run to import modules, input IO and initialize variables for all functions:
import pandas as pd
df = pd.read_csv('data.csv', names=['label', 'value'])
import csv
from collections import defaultdict
from itertools import chain
import re
d = defaultdict(list)
with open('data.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
d[row[0]].append(row[1])
The contents of the data.csv input file was:
B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2
The test command line for each function was of the form:
%timeit fun()
Here are the test results:
p(): 100 loops, best of 3: 4.47 ms per loop
p1(): 1000 loops, best of 3: 1.88 ms per loop
q(): 10000 loops, best of 3: 123 µs per loop
q1(): 100000 loops, best of 3: 15.5 µs per loop
These results are only suggestive and for one small dataset. In particular I would expect pandas to perform comparatively better for larger datasets up to a point.