I am new in python. I read this Kaggle kernel.
In that kernel, he used the train data with chunksize 150_000
train = pd.read_csv('../input/train.csv', iterator=True, chunksize=150_000, dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})
I visualized the X_train(statistical features) and y_train(given time_to_failure) using python. It gave me good visualizations
train = pd.read_csv('../input/train.csv', iterator=True, chunksize=150_000, dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})
X_train = pd.DataFrame()
y_train = pd.Series()
for df in train:
ch = gen_features(df['acoustic_data'])
X_train = X_train.append(ch, ignore_index=True)
y_train = y_train.append(pd.Series(df['time_to_failure'].values[-1]))
#Visulization function
plotstatfeature(X_train,y_train.to_numpy(dtype ='float32'))
For the test data, plotted same visualizations between X_test(statistical features) and y_hat(calculated time_to_failure) using the same function
submission = pd.read_csv('../input/sample_submission.csv', index_col='seg_id')
X_test = pd.DataFrame()
# prepare test data
for seg_id in submission.index:
seg = pd.read_csv('../input/test/' + seg_id + '.csv')
ch = gen_features(seg['acoustic_data'])
X_test = X_test.append(ch, ignore_index=True)
X_test = scaler.transform(X_test)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)
y_hat = model.predict(X_test)
submission['time_to_failure'] = y_hat
submission.to_csv('submission.csv')
#Visulization function
plotstatfeature(X_test,y_hat.to_numpy(dtype ='float32'))
Question 1:
Is it meaningful to visualize X_test(statistical features) and y_hat(calculated time_to_failure)
Question 2(main question):
The visualization of test data are not good like train data .because train data is read in chunksize of 150000 giving the clear visualization while test data is full data which gives the more dense unclear visulization. How I can convert the test data in same chunksize of 150000 for the same uniform visualization just as train data visulization?
For converting the test data in the same chunksize of 150000 I tried to modify this line by introducing iterator and chunksize in the code
First case:
submission = pd.read_csv('../input/sample_submission.csv', index_col='seg_id' , iterator=True, chunksize=150_000)
But it gave me this error
Traceback (most recent call last):
File "", line 1, in runfile('D:/code.py', wdir='D:/')
File "C:\Users\abc\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile execfile(filename, namespace)
File "C:\Users\abc\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile exec(compile(f.read(), filename, 'exec'), namespace)
File "D:/code.py", line 299, in main()
File "D:/code.py", line 239, in main test(X_train, y_train)
File "D:/code.py", line 168, in test for seg_id in submission.index:
AttributeError: 'TextFileReader' object has no attribute 'index'
2nd case
seg = pd.read_csv('test/' + seg_id + '.csv' , iterator=True, chunksize=150000)
it gave me this error
Traceback (most recent call last):
File "", line 1, in runfile('D:/code.py', wdir='D:/')
File "C:\Users\abc\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile execfile(filename, namespace)
File "C:\Users\abc\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile exec(compile(f.read(), filename, 'exec'), namespace)
File "D:/code.py", line 299, in main()
File "D:/code.py", line 239, in main test(X_train, y_train)
File "D:/code.py", line 170, in test ch = gen_features(seg['acoustic_data'])
TypeError: 'TextFileReader' object is not subscriptable
How I can introduce the chuncksize in test data ?
submission. - Michael Delgado