0
votes

I have many csv files that has multiple rows and columns which are mostly floating point numbers (some are categorical but one-hot encoded). Each csv file is the representation of one training example.It contains dependent and independent variables in the same file. (for example, its not like machine learning problem where each row contains all the information and predicts y1, y2,y3 of that row, its like all the rows combined of x1 to x8 will predict all rows combined of y1 to y3. Hence each csv becomes one training example.

representation of one such csv

** The above image is the representation of one of such csv files

Please note that the length/size of each csv varies.

I want to build a simple ann or any other neural net model. I have problem in processing input data. As each csv is one single training example, in which format should i have to store data to pass to a neural net.

Thanks in advance, skw

1
What is the data about ? Are the rows or columns related in any way. Please provide additional information. Do you get 3 outputs : y1,y2,y3 for 8 input attributes : x1-x8 - Suraj Subramanian

1 Answers

0
votes

Let's say you have some .csv file all with same data format stored in a folder data.

First you can use glob to read the filenames and use pandas to read the csv and convert to numpy array.

import glob
import pandas as pd

csv = [] # read as numpy array
for f in glob.glob('path/*.csv'):
    csv.append(pd.read_csv(f).to_numpy)

print(csv[0].shape)

# it should print (num_rows_csv, 11) # as, 11 columns

# now, first 8 columns are features, and last 3 columns are response

X = []
y = []
for arr in csv:
    X.append(arr[0:8])
    y.append(arr[8:])

X = np.array(X)
y = np.array(y)

Now, it's easy to train this with CNN, LSTM, any model you want.