We are at the beginning of an f# project involving real-time and historical analysis of streaming data. The data is contained in a c# object (see below) and is sent as part of a standard .net event. In real-time, the number of events we typically receive can vary greatly from less than 1/sec to upwards of around 800 events per second per instrument and thus can be very bursty. A typical day might accumulate 5 million rows/elements per insturment
A generic version of the C# event's data structure looks like this:
public enum MyType { type0 = 0, type1 = 1}
public class dataObj
{
public int myInt= 0;
public double myDouble;
public string myString;
public DateTime myDataTime;
public MyType type;
public object myObj = null;
}
We plan to use this data structure in f# in two ways:
- Historical analysis using supervised && unsupervised machine learning (CRFs, clustering models, etc)
- Real-time classification of data streams using the above models
The data structure needs to be able to grow as we add more events. This rules out array<t> because it does not allow for resizing, though it could be used for the historical analysis. The data structure also needs to be able to quickly access recent data and ideally needs to be able to jump to data x points back. This rules out Lists<T> because of the linear lookup time and because there is no random access to elements, just "forward-only" traversal.
According to this post, Set<T> may be a good choice...
EDIT: Yin Zhu response gave me some additional clarity into exactly what I was asking. I have edited the remainder of the post to reflect this. Also, the previous version of this question was muddied by the introduction of requirements for historical analysis. I have omitted them.
Here is a breakdown of the steps of the real-time process:
- A realtime event is received
- This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a
Set<T>, or some other structure? - A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.
- Features for the model are generated and sent to a model for evaluation.
- We may prune the data structure of older data to improve performance.
So the question is what is the best data structure to use for storing the real-time streaming events that we will use to generated features.