1
votes

I have a folder with differently sized jpg images from which I'd like to generate a train and a test set via sklearn.model_selection.train_test_split().
This is my code so far:

helper = list()
y = list()

for path, subdirs, files in os.walk(inputDir):
    for s in subdirs:
        y.append(s)
    for f in files:
        img_path = os.path.join(path,f)
        pixels = Image.open(img_path).getdata()
        helper.append(pixels)

 x = np.asarray(helper)

 x_train, x_test, y_train, y_test = train_test_split(x,y) #error occurs here

I get the following error message:

File "getTrainTestSet.py", line 57, in getTrainTestSet x_train, x_test, y_train, y_test = train_test_split(x,y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_split.py", line 1689, in train_test_split arrays = indexable(*arrays)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 206, in indexable check_consistent_length(*result)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 181, in check_consistent_length " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [120, 0]

Please help me fix this.

Thanks in advance!


EDIT: I figured out how to do it in a way that doesn't mess with the train_test_split() function:

y = list()
helpers = list()

for path, subdirs, files in os.walk(inputDir):
    for s in subdirs:
        files = glob.glob(inputDir+ s + '/*.jpg')
        helpers.append(np.array([np.array(Image.open(f)) for f in files]))
        y.append(s)

x = np.array([np.array(h) for h in helpers])

x_train, x_test, y_train, y_test = train_test_split(x,y)

I believe the issue was that len(y) and x.shape[0] must be equal. My final x has the shape (4,) as I have 4 subdirectories with image files in total.

Thank you to everyone for your input!

1
What's the shape and dtype for x. I suspect it is a 1d object array. Study sklearn to see if there is any way of handling different size test and training images. I'm sure the normal processing expects a consistent size (and multidimensional arrays).hpaulj
x.shape == (120,) and x.dtype == object. If I use np.atleast_2d(x) as @Def_Os suggested, the shape is (1,120) and the dtype remains object. But even with the two-dimensional array I still get the ValueError (see below). I'm searching the web for a solution, but unfortunately have not found any way of handling different sized images yet.hsvar
Test this code on a set of images that all have the same size.hpaulj
You may need to scale, pad or crop the images to match.hpaulj

1 Answers

0
votes

x should be a 2-dimensional array of size [no_of_samples, no_of_features]. Do this:

x = np.atleast_2d(x).T