F# Immutable data structures for high frequency real-time streaming data

Question

We are at the beginning of an f# project involving real-time and historical analysis of streaming data. The data is contained in a c# object (see below) and is sent as part of a standard .net event. In real-time, the number of events we typically receive can vary greatly from less than 1/sec to upwards of around 800 events per second per instrument and thus can be very bursty. A typical day might accumulate 5 million rows/elements per insturment

A generic version of the C# event's data structure looks like this:

public enum MyType { type0 = 0, type1 = 1}

public class dataObj
{
    public int myInt= 0;
    public double myDouble;
    public string myString;
    public DateTime myDataTime;
    public MyType type;
    public object myObj = null;

}

We plan to use this data structure in f# in two ways:

Historical analysis using supervised && unsupervised machine learning (CRFs, clustering models, etc)
Real-time classification of data streams using the above models

The data structure needs to be able to grow as we add more events. This rules out array<t> because it does not allow for resizing, though it could be used for the historical analysis. The data structure also needs to be able to quickly access recent data and ideally needs to be able to jump to data x points back. This rules out Lists<T> because of the linear lookup time and because there is no random access to elements, just "forward-only" traversal.

According to this post, Set<T> may be a good choice...

> " ...Vanilla Set<'a> does a more than adequate job. I'd prefer a 'Set' over a 'List' so you always have O(lg n) access to the largest and smallest items, allowing you to ordered your set by insert date/time for efficient access to the newest and oldest items..."

EDIT: Yin Zhu response gave me some additional clarity into exactly what I was asking. I have edited the remainder of the post to reflect this. Also, the previous version of this question was muddied by the introduction of requirements for historical analysis. I have omitted them.

Here is a breakdown of the steps of the real-time process:

A realtime event is received
This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a Set<T>, or some other structure?
A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.
Features for the model are generated and sent to a model for evaluation.
We may prune the data structure of older data to improve performance.

So the question is what is the best data structure to use for storing the real-time streaming events that we will use to generated features.

Just found this post on using mailbox processor to act as a thread-safe message handling queue for high frequency sensor data. Somewhat relevant stackoverflow.com/a/928892/628094 — Andre P.
A skip list is worth considering for the O(log n) random access, searching for times, cumulative statistics, etc. Someone has an F# implementation here: ffogd.blogspot.co.uk/2010/10/f-skip-list.html — 79E09796
Slightly off-topic, but if you are at the beginning of a project you may want to look at Rx. The environment you describe has streams of data coming in asynchronously. Rx lets you compose these streams and operate on them directly, also with Linq, sliding windows etc. So you could do pretty sophisticated real-time analysis on the 'live' stream even before committing it to some data structure. here is a good place to start. — gjvdkamp

Jack Fox Jack Fox · Accepted Answer · 2013-07-31T03:42:54

You should consider FSharpx.Collections.Vector. Vector<T> will give you Array-like features, including indexed O(log32(n)) look-up and update, which is within spitting distance of O(1), as well as adding new elements to the end of your sequence. There is another implementation of Vector which can be used from F# at Solid Vector. Very well documented and some functions perform up to 4X faster at large scale (element count > 10K). Both implementations perform very well up to and possibly beyond 1M elements.

F# Immutable data structures for high frequency real-time streaming data

4 Answers

Using the FSharpx.Collections.Vector<'T>

Using the Solid.Vector<'T>