1
votes

I have a numpy array of dtype = object (which are actually lists of various data types). So it makes a 2D array because I have an array of lists (?). I want to copy every row & only certain columns of this array to another array. I stored data in this array from a csv file. This csv file contains several fields(columns) and large amount of rows. Here's the code chunk I used to store data into the array.

data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
    data[i] = row

data can be basically depicted as follows

column1  column2  column3  column4  column5 ....
1         none     2       'gona'    5.3
2         34       2       'gina'    5.5
3         none     2       'gana'    5.1
4         43       2       'gena'    5.0
5         none     2       'guna'    5.7
.....     ....   .....      .....    ....
.....     ....   .....      .....    ....
.....     ....   .....      .....    ....

There're unwanted fields in the middle that I want to remove. Suppose I don't want column3. How do I remove only that column from my array? Or copy only relevant columns to another array?

3
Are you looking to process the CSV input before it gets into the numpy array, or to remove columns from the array after it's been created? (Or just "whichever is easier" or "whichever is faster"?) - abarnert
@maheshakyha: Then I think root's answer is the easiest. If you can't/don't want to replace your reading with pandas.read_csv, then probably my numpy.delete is easiest, but I think you're better off with his answer. - abarnert

3 Answers

4
votes

Use pandas. Also it seems to me, that for various type of data as yours, the pandas.DataFrame may be better fit.

from StringIO import StringIO
from pandas import *
import numpy as np

data = """column1  column2  column3  column4  column5
1         none     2       'gona'    5.3
2         34       2       'gina'    5.5
3         none     2       'gana'    5.1
4         43       2       'gena'    5.0
5         none     2       'guna'    5.7"""

data = StringIO(data)
print read_csv(data, delim_whitespace=True).drop('column3',axis =1)

out:

   column1 column2 column4  column5
0        1    none  'gona'      5.3
1        2      34  'gina'      5.5
2        3    none  'gana'      5.1
3        4      43  'gena'      5.0
4        5    none  'guna'      5.7

If you need an array instead of DataFrame, use the to_records() method:

df.to_records(index = False)
#output:
rec.array([(1L, 'none', "'gona'", 5.3),
           (2L, '34', "'gina'", 5.5),
           (3L, 'none', "'gana'", 5.1),
           (4L, '43', "'gena'", 5.0),
           (5L, 'none', "'guna'", 5.7)], 
            dtype=[('column1', '<i8'), ('column2', '|O4'),
                   ('column4', '|O4'), ('column5', '<f8')])
3
votes

Assuming you're reading the CSV rows and sticking them into a numpy array, the easiest and best solution is almost definitely preprocessing the data before it gets to the array, as Maciek D.'s answer shows. (If you want to do something more complicated than "remove column 3" you might want something like [value for i, value in enumerate(row) if i not in (1, 3, 5)], but the idea is still the same.)

However, if you've already imported the array and you want to filter it after the fact, you probably want take or delete:

>>> d=np.array([[1,None,2,'gona',5.3],[2,34,2,'gina',5.5],[3,None,2,'gana',5.1],[4,43,2,'gena',5.0],[5,None,2,'guna',5.7]])
>>> np.delete(d, 2, 1)
array([[1, None, gona, 5.3],
       [2, 34, gina, 5.5],
       [3, None, gana, 5.1],
       [4, 43, gena, 5.0],
       [5, None, guna, 5.7]], dtype=object)
>>> np.take(d, [0, 1, 3, 4], 1)
array([[1, None, gona, 5.3],
       [2, 34, gina, 5.5],
       [3, None, gana, 5.1],
       [4, 43, gena, 5.0],
       [5, None, guna, 5.7]], dtype=object)

For the simple case of "remove column 3", delete makes more sense; for a more complicated case, take probably makes more sense.

If you haven't yet worked out how to import the data in the first place, you could either use the built-in csv module and something like Maciek D.'s code and process as you go, or use something like pandas.read_csv and post-process the result, as root's answer shows.

But it might be better to use a native numpy data format in the first place instead of CSV.

1
votes

You can use range selection. Eg. to remove column3, you can use:

data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
    data[i] = row[:2] + row[3:]

This will work, assuming that csv_file_object yields lists. If it is e.g. a simple file object created with csv_file_object = open("file.cvs"), add split in your loop:

data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
    row = row.split()
    data[i] = row[:2] + row[3:]