I have a csv data, the first column of the data is 'label' and columns after the first one to the end 784 column contains a representation of an image (28*28) format.
I created a tuple of numpy array using the following function.
Next step is I am trying to split this dataset into desired 80% /20% split for training and validation. For that, I use loadData() method as below. When I run the function to split, I get error could not broadcast input array from shape (5851,784) into shape (5851) error.
My question here is I just want to split the available tuple generated using load(filename) into two datasets. Any help?
filename=dir_path+'train1.csv'
def load(filename):
# read file into a list of rows
with open(filename, 'rU') as csvfile:
lines = csv.reader(csvfile, delimiter=',')
rows = list(lines)
# create empty numpy arrays of the required size
data = np.empty((len(rows), len(rows[0])-1), dtype=np.float64)
expected = np.empty((len(rows),), dtype=np.int64)
# fill array with data from the csv-rows
for i, row in enumerate(rows):
data[i,:] = row[1:]
expected[i] = row[0]
training_data = data, expected
return training_data
print load(filename)
Result
(array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]]), array([1, 1, 1, ..., 1, 1, 1]))
Run this function to split:
def loadData():
train_data= load(train_name)
#test_data= load(test_name)
training_data,validation_data =np.split(train_data, [int(.8 * len(train_data))])
return train_data
print loadData()
Result: could not broadcast input array from shape (5851,784) into shape (5851)
SOLUTION:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
train_name=dir_path+'train8.csv'
test_name=dir_path+'test8.csv'
def load(filename):
# read file into a list of rows
with open(filename, 'rU') as csvfile:
lines = csv.reader(csvfile, delimiter=',')
rows = list(lines)
# create empty numpy arrays of the required size
data = np.empty((len(rows), len(rows[0])-1), dtype=np.float64)
expected = np.empty((len(rows),), dtype=np.int64)
# fill array with data from the csv-rows
for i, row in enumerate(rows):
data[i,:] = row[1:]
expected[i] = row[0]
result_data = data, expected
return result_data
def loadData():
train_data= load(train_name)[0]
labels= load(train_name)[1]
test_data= load(test_name)
x_train, x_test, y_train, y_test = train_test_split(train_data, labels, test_size=0.33)
training_data = (x_train, y_train)
validation_data=(x_test, y_test)
return (training_data, validation_data, test_data)
This solution will match the mnist data set
train_data[:int(.8 * len(train_data))])? Also, you might want to checkoutpandas.read_csvfor loading the CSV file into an array. - JoeCondronload. You need to unpack these and slice them individually so that you have four arrays; your input variables and output variable for both the training and validation set. You should probably get rid ofloadentirely, usepandas.read_csv, slice the result 80/20 and then split each of those into your input and output variables - JoeCondron