1
votes

I am trying to read some numbers from a .csv file and store them into a matrix using Python. The input file looks like this

Input File

B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2

The input is to be manipulated to a matrix like -

Output File

  1 2 3 
A 2 1 1
B 3 2 0

Here, the first column of the input file becomes the row, second column becomes the column and the value is the count of the occurrence. How should I implement this? The size of my input file is huge (1000000 rows) and hence there can be large number of rows (anywhere between 50 to 10,000) and columns (from 1 to 50)

5
What have you tried so far? Please add some code showing where you are stuck currently.albert
How do you read a file? How do you read an individual line? How do you parse a comma-separated line into tokens? How do you associate letters A=1, B=2, etc.? Break complex problems into simpler ones. Your input file format doesn't make much sense. I'd expect three values per line: row #, col #, and value.duffymo
Possibly a duplicate of this post: stackoverflow.com/questions/30165819/…Moritz
After re-reading the question I think OP wants to as well count the occurrences of the numbers as combine them with the letter since the whished output seems to be a kind of summary of the occurrences given in the input file.albert

5 Answers

2
votes

With pandas, it becomes easy, almost in just 3 lines

import pandas as pd

df = pd.read_csv('example.csv', names=['label', 'value'])
# >>> df
#   label  value
# 0     B      1
# 1     A      1
# 2     A      1
# 3     B      1
# 4     A      3
# 5     A      2
# 6     B      1
# 7     B      2
# 8     B      2

s = df.groupby(['label', 'value']).size()
# >>> s
# label  value
# A      1        2
#        2        1
#        3        1
# B      1        3
#        2        2
# dtype: int64

# ref1: http://stackoverflow.com/questions/15751283/converting-a-pandas-multiindex-dataframe-from-rows-wise-to-column-wise
# ref2: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html
m = s.unstack()
# >>> m
# value  1  2   3
# label          
# A      2  1   1
# B      3  2 NaN

# Below are optional: just to make it look more like what you want
m.columns.name = None
m.index.name = None
m = m.fillna(0)
print m
#    1  2  3
# A  2  1  1
# B  3  2  0
0
votes

My solution does not seem to be very effective for a huge amout of input data since I am doing a lot of stuff manually which could be done by some of pandas DataFrame methods, probably.

However, this does the job:

#!/usr/bin/env python3
# coding: utf-8

import pandas as pd
from collections import Counter

with open('foo.txt') as f:
     l = f.read().splitlines()

numbers_list = []
letters_list = []

for element in l:
    letter = element.split(',')[0]
    number = element.split(',')[1]
    if number not in numbers_list:
        numbers_list.append(number)
    if letter not in letters_list:
        letters_list.append(letter)

c = Counter(l)
d = dict(c)

output = pd.DataFrame(columns=sorted(numbers_list), index=sorted(letters_list))

for col in numbers_list:
    for row in letters_list:
        key = '{},{}'.format(row, col)
        if key in d:
            output[col][row] = d[key]
        else:
            output[col][row] = 0

The output is as desired:

   1  2  3
A  2  1  1
B  3  2  0
0
votes

The following solution uses just standard Python modules:

import csv, collections, itertools

with open('my.csv', 'r') as f_input:
    counts = collections.Counter()
    for cols in csv.reader(f_input):
        counts[(cols[0], cols[1])] += 1

keys = set(key[0] for key in counts.keys())
values = set(counts.values())

d = {}
for k in itertools.product(keys, values):
    d[(k[0], str(k[1]))] = 0

d.update(dict(counts))

with open('output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output)

    # Write the header, 'X' is whatever you want the first column called
    csv_output.writerow(['X'] + sorted(values))

    # Write the rows
    for k, g in itertools.groupby(sorted(d.items()), key=lambda x: x[0][0]):
        csv_output.writerow([k] + [col[1] for col in g])

This gives you an output CSV file looking like:

X,1,2,3
A,2,1,1
B,3,2,0
0
votes

Here is another variation using standard modules:

import csv
import re
from collections import defaultdict
from itertools import chain

d = defaultdict(list)

with open('data.csv', 'rb') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        d[row[0]].append(row[1])

k = sorted(d.keys())
v = sorted(map(int,set(chain.from_iterable(d.values()))))

e = []
for i in d:
    e.append([0]*len(v))
    for j in d[i]:
        e[-1][int(j)-1] += 1

print ' ', re.sub(r'[\[\],]','',str(v))
for i, j in enumerate(k):
    print j, re.sub(r'[\[\],]','',str(e[i]))

Given data.csv has the contents of the input file shown in the question, this script prints the following as output:

  1 2 3
A 2 1 1
B 3 2 0

Thanks to @zyxue for a pure pandas solution. It takes a lot less code up front with the problem being selection of it. However, extra coding is not necessarily in vain regarding run time performance. Using timeit in IPython to measure the run time difference between my code and that of &zyxue using pure pandas, I found that my method ran 36 times faster excluding imports and input IO and 121 times faster when also excuding output IO (print statements). These tests were done with functions to encapsulate code blocks. Here are the functions that were tested using Python 2.7.10 and Pandas 0.16.2:

def p(): # 1st pandas function
    s = df.groupby(['label', 'value']).size()
    m = s.unstack()
    m.columns.name = None
    m.index.name = None
    m = m.fillna(0)
    print m

def p1(): # 2nd pandas function - omitting print statement
    s = df.groupby(['label', 'value']).size()
    m = s.unstack()
    m.columns.name = None
    m.index.name = None
    m = m.fillna(0)

def q(): # first std mods function
    k = sorted(d.keys())
    v = sorted(map(int,set(chain.from_iterable(d.values()))))
    e = []
    for i in d:
        e.append([0]*len(v))
        for j in d[i]:
            e[-1][int(j)-1] += 1       
    print ' ', re.sub(r'[\[\],]','',str(v))
    for i, j in enumerate(k):
        print j, re.sub(r'[\[\],]','',str(e[i]))

def q1(): # 2nd std mods function - omitting print statements
    k = sorted(d.keys())
    v = sorted(map(int,set(chain.from_iterable(d.values()))))
    e = []
    for i in d:
        e.append([0]*len(v))
        for j in d[i]:
            e[-1][int(j)-1] += 1

Prior to testing the following code was run to import modules, input IO and initialize variables for all functions:

import pandas as pd
df = pd.read_csv('data.csv', names=['label', 'value'])

import csv
from collections import defaultdict
from itertools import chain
import re

d = defaultdict(list)

with open('data.csv', 'rb') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        d[row[0]].append(row[1])

The contents of the data.csv input file was:

B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2

The test command line for each function was of the form:

%timeit fun()

Here are the test results:

p(): 100 loops, best of 3: 4.47 ms per loop

p1(): 1000 loops, best of 3: 1.88 ms per loop

q(): 10000 loops, best of 3: 123 µs per loop

q1(): 100000 loops, best of 3: 15.5 µs per loop

These results are only suggestive and for one small dataset. In particular I would expect pandas to perform comparatively better for larger datasets up to a point.

0
votes

Here is a way to do it with MapReduce using Hadoop streaming where the mapper and reducer scripts both read stdin.

The mapper script is mostly an input mechanism and filters input to remove improper data with advantages that the input can be split over multiple mapper processes with the total output automatically sorted and forwarded to a reducer plus the possibility of running combiners locally on mapper nodes. Combiners are essentially intermediate reducers useful for speeding up reduction through parallelism over a cluster.

# mapper script
import sys
import re

# mapper
for line in sys.stdin:
    line = line.strip()
    word = line.split()[0]
    if word and re.match(r'\A[a-zA-Z]+,[0-9]+',word):
        print '%s\t%s' % (word)

The reducer script gets sorted output over all mappers, builds an intermediate dict for each input key such as A or B, which is called 'prefix' in the code and outputs results to a file in csv format.

# reducer script
from collections import defaultdict 
import sys

def output(s,d):
    """
    this function takes a string s and dictionary d with int keys and values
    and sorts the keys then creates a string of comma-separate values ordered 
    by the keys with appropriate insertion of comma-separate zeros equal in 
    number to the difference between successive keys minus one
    """
    v = sorted(d.keys())
    o = str(s) + ','
    lastk = 0
    for k in v:
        o += '0,'*(k-lastk-1) + str(d[k]) + ','
        lastk = k
    return o

prefix = ''
current_prefix = ''
d = defaultdict(int)
maxkey = 0

for line in sys.stdin:
    line = line.strip()
    prefix,value = line.split(',')

    try:
        value = int(value)
    except ValueError:
        continue

    if current_prefix == prefix:
        d[value] += 1
    else:
        if current_prefix:
            if len(d) > 0:
                print output(current_prefix,d)
                t = max(d.keys())
                if t > maxkey:
                    maxkey = t
            d = defaultdict(int)
        current_prefix = prefix                   
        d[value] += 1       

# output info for last prefix if needed
if current_prefix == prefix:
    print output(prefix,d)
    t = max(d.keys())
    if t > maxkey:
        maxkey = t    

# output csv list of keys from 1 through maxkey
h = ' ,'
for i in range(1,maxkey+1):
    h += str(i) + ','
print h

To run through data streaming process, given that the mapper gets:

B,1
A,1
A,1
B,1
A,3
A,2
B,1
B,2
B,2

It directly outputs the same content which then all gets sorted (shuffled) and sent to a reducer. In this example, what the reducer gets is:

A,1
A,1
A,2
A,3
B,1
B,1
B,1
B,2
B,2

Finally the output of the reducer is:

A,2,1,1,
B,3,2,
 ,1,2,3,    

For larger data sets, the input file would be split with portions containing all data for some sets of keys going to separate mappers. Using a combiner on each mapper node would save overall sorting time. There would still be a need for a single reducer so that the output is totally sorted by key. If that's not a requirement, multiple reducers could be used.

For practical reasons I made a couple of choices. First, each line of output only goes up to the highest integer for a key and trailing zeros are not printed because there is no way to know how many to write until all the input has been processed, which for large input means storing a large amount of intermediate data in memory or slowing down processing by writing it out to disk and reading it back in to complete the job. Second and for the same reason, the header line cannot be written until just before the end of the reduce job so that's when its written. It may be possible to prepend it to the output file, or the first one if output has been split, and that can be investigated in due course. However, provided a great speedup of performance from parallel processing, for massive input, these are minor issues.

This method will work with relatively minor but crucial modifications on a Spark cluster and can be converted to Java or Scala to improve performance if necessary.