Low accuracies due to lack of data for machine learning

Question

I'm currently applying Tensorflow to the Titanic machine learning problem on Kaggle: https://www.kaggle.com/c/titanic

My training data is 891 by 8 (891 data points and 8 features). The goal is to predict whether a passenger on the Titanic survived or not. So it's a binary classification problem.

I'm using a single layer neural network. This is my cost function:

cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y))

This is my optimizer:

optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=momentum).minimize(cost)

Here is my question/problem:

I tried submitting some predictions made by the neural network to Kaggle, and so far all my attempts have 0% accuracy. However, when I replaced the predictions for the first 10 passengers to the predictions made by RandomForestClassifier() from sk-learn, the accuracy sky-rocketed to 50%..

My guess for the incompetence of the neural network is that it's caused by inadequate training data. So I was thinking about adding noise to the input data, but I don't really have an idea how.

My 8 features of the training data are: ['Pclass', 'Sex', 'Age', 'Fare', 'Child', 'Fam_size', 'Title', 'Mother']. Some are categorical and some are continuous.

Any ideas/links are much appreciated! Thanks a lot in advance.

EDIT:

I found what's wrong with my submissions. For some reason my predictions were all floats instead of int. So I just did this:

result_df.astype(int)

Thank you everyone for pointing out that my submission format is wrong.

Are you sure your output match the expected format ? It's very hard, even for a very bad model, to have 0% accuracy. — polku
Yes I checked the .csv files and they matched the exact format... at first I thought the format with wrong too. I even checked the English spelling and whether the output is limited to 0 and 1s. — Clement
I also checked my code, and I'm fairly sure it's correct. If you're interested I can post the entire code. — Clement

nom nom · Accepted Answer · 2016-08-10T15:35:49

Try cross-validating the training data locally and see what accuracy you get. The sklearn package has a simple k-fold cross-validation utility (here) that divides the samples in training and test folds. What accuracy do you obtain?

Remember 50% accuracy for binary classification is the baseline. If the k-fold CV accuracy is higher than 50%, your problem is likely with the submission.

Low accuracies due to lack of data for machine learning

1 Answers