0
votes

I have a csv data, the first column of the data is 'label' and columns after the first one to the end 784 column contains a representation of an image (28*28) format.

I created a tuple of numpy array using the following function.

Next step is I am trying to split this dataset into desired 80% /20% split for training and validation. For that, I use loadData() method as below. When I run the function to split, I get error could not broadcast input array from shape (5851,784) into shape (5851) error.

My question here is I just want to split the available tuple generated using load(filename) into two datasets. Any help?

filename=dir_path+'train1.csv'
def load(filename):
    # read file into a list of rows
    with open(filename, 'rU') as csvfile:
        lines = csv.reader(csvfile, delimiter=',')
        rows = list(lines)

    # create empty numpy arrays of the required size
    data = np.empty((len(rows), len(rows[0])-1), dtype=np.float64)
    expected = np.empty((len(rows),), dtype=np.int64)

    # fill array with data from the csv-rows
    for i, row in enumerate(rows):
        data[i,:] = row[1:]
        expected[i] = row[0]

    training_data = data, expected
    return training_data

print load(filename)

Result

 (array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           ..., 
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.]]), array([1, 1, 1, ..., 1, 1, 1]))

Run this function to split:

def loadData():
    train_data= load(train_name)
    #test_data= load(test_name)

    training_data,validation_data =np.split(train_data, [int(.8 * len(train_data))])

    return train_data

print loadData()

Result: could not broadcast input array from shape (5851,784) into shape (5851)

SOLUTION:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
train_name=dir_path+'train8.csv'

test_name=dir_path+'test8.csv'

def load(filename):
    # read file into a list of rows
    with open(filename, 'rU') as csvfile:
        lines = csv.reader(csvfile, delimiter=',')
        rows = list(lines)

    # create empty numpy arrays of the required size
    data = np.empty((len(rows), len(rows[0])-1), dtype=np.float64)
    expected = np.empty((len(rows),), dtype=np.int64)

    # fill array with data from the csv-rows
    for i, row in enumerate(rows):
        data[i,:] = row[1:]
        expected[i] = row[0]

    result_data = data, expected
    return result_data

def loadData():
    train_data= load(train_name)[0]
    labels= load(train_name)[1]
    test_data= load(test_name)

    x_train, x_test, y_train, y_test = train_test_split(train_data, labels, test_size=0.33)

    training_data = (x_train, y_train)
    validation_data=(x_test, y_test)

    return (training_data, validation_data, test_data)

This solution will match the mnist data set

1
Why not just slice the data: train_data[:int(.8 * len(train_data))])? Also, you might want to checkout pandas.read_csv for loading the CSV file into an array. - JoeCondron
os if I run that fucntion, it splits the first array only, I do not know where other array from the tuple goes: (array([[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]]),) - lpt
The problem is you are returning two arrays from the load. You need to unpack these and slice them individually so that you have four arrays; your input variables and output variable for both the training and validation set. You should probably get rid of load entirely, use pandas.read_csv, slice the result 80/20 and then split each of those into your input and output variables - JoeCondron

1 Answers

0
votes

From what I understand, you are passing a tuple consisting of one matrix and one array (that does not have the same shape) to np.split which is why you get the broadcast error. It works fine if you give np.split a single matrix:

train_data = np.zeros((5000, 784))
labels = np.zeros(5000)

train,test = np.split(train_data, [int(0.8 * len(train_data))])
print "Train: {0}, Test: {1}".format(train.shape, test.shape)

This gives the following output:

Train: (4000, 784), Test: (1000, 784)

While if you pass it a tuple of a matrix and an array:

train_data = np.zeros((5000, 784))
labels = np.zeros(5000)

train,test = np.split((train_data,labels), [int(0.8 *len(train_data))])

You get the broadcast error:

ValueError: could not broadcast input array from shape (5000,784) into shape (5000)

If you want to split a dataset, including its labels, I would suggest using something like train_test_split from scikit learn (available through pip install sklearn) which can handle both observations and labels using the same function:

import numpy as np
from sklearn.model_selection import train_test_split

def loadData():

    train_data = np.zeros((5000, 784))
    labels = np.zeros(5000)
    x_train, x_test, y_train, y_test = train_test_split(train_data, labels, test_size=0.22)

    print "Training samples: {0}, training labels: {1}".format(x_train.shape, y_train.shape)
    print "Validation samples: {0}, validation labels: {1}".format(x_test.shape, y_test.shape)

if __name__ == "__main__":
    loadData()

Output:

Training samples: (3900, 784), training labels: (3900,)
Validation samples: (1100, 784), validation labels: (1100,)