1
votes

I have a very large corpus with each element consisting of a large amount of high dimensional data. Elements are constantly being added to the corpus. Potentially, only a portion of the corpus needs to be considered for each interaction. Elements are labeled, potentially with multiple labels and weights associated with the strength of those labels. The data is not sparse as far as I understand.

The input data is a set of parameters in the range of -1...1 between around (10-1000) inputs. This may be somewhat flexible depending on what machine learning method is most appropriate.

I am targeting high end smart phone devices. Ideally the processing could be done on the same device but I'm open to the possibility of transmitting it to a modest server.

What would be an appropriate machine learning approach for this kind of situation?

I've been reading about random forrest decision trees, restricted boltzmann machines, deep learning boltzmann machines etc, but I could really use the advice of an experienced hand to direct me towards a few approaches to research that would work well give the conditions.

If my description seems wonky please let me know as I am still getting to grips with the ideas and may be fundamentally misunderstanding some aspect.

1
What is your output data? Is this supervised or unsupervised learning?Atilla Ozgur
Its primarily supervised - assigning labels and weights for those labels. I'm working with music, and statistical properties of its partitioning, so the data is dense, hierarchical and high dimensional (though they can be cropped). If unsupervised benefits those circumstances, could it be integrated? Output data is a ranked list of elements from the corpus or a portion of it. I am hoping to let the outputs that I can get, shape the way I use them though I am having trouble telling what is redundant or unavailable to my type of data. Any views would be greatly appreciated.eigen_enthused

1 Answers

0
votes

Try using the simplest k-nearest neighbor algorithm. You can use a Manhattan distance function to attain a quick distance function. You then can take a weighted average or majority class based on the nearest points.

This is also similar to kernel regression. I would suggest using a data structure such as a k-d tree to efficiently store your points.