1
votes

I am processing a high frequency stream of timestamped events with no ordering guarantee (ordered 90% of the time). I need to store these events (for caching purpose) for some times in my program. In order to optimize the performances of my computations (that require mainly iteration over the events collection) it would be much more easier for me if I could guarantee the order by caching an ordered list. So what I am looking for is an ordered data structure that is fast in insertion and iteration and allow duplicates.

Among all propositions I have found on internet, I have tried :
- TreeSet -> Does not work because I might have duplicate timestamp
- PriorityQueue -> Does not work because the iterator does not guarantee priority order

Since 9/10 events are well ordered I thought I could use a basic ArrayList with a modified version of the add method :

public class TimeOrderedArrayList<E> extends ArrayList<E>{

private long lastTs;
private Comparator<E> comparator;
private TimeGetter<E> tsgetter;

public TimeOrderedArrayList (Comparator<E> comparator, TimeGetter<E> tsgetter) {
    super();
    this.comparator = comparator;
    this.tsgetter = tsgetter;
    this.lastTs = Long.MIN_VALUE;
}


@Override
public boolean add(E e) {
    if (tsgetter.getTime(e) >= lastTs) {
        lastTs = tsgetter.getTime(e);
        return super.add(e);
    } else {

        // VERSION 1
        int index = super.size()-1;
        while (tsgetter.getTime(super.get(index))>tsgetter.getTime(e) && index > 0) {
            index--;
        }
        super.add(index, e);

        // VERSION 2
        int index = Collections.binarySearch(this, e, comparator);
        super.add(index>-1 ? index : -index-1,e);
        return true;
    }
}

@Override
public boolean addAll(Collection<? extends E> c) {
    boolean result = super.addAll(c);
    super.sort(comparator);
    return result;
}
}

But for both version I get really bad performances.

Any suggestions ?

2
do you need random access to the elements in the data structure or are you always iterating across all?matias elgart
Why not use heap data structure of your own?SMA
Would it be possible to write a custom Comparator that simply makes sure that events with the same timestamp do not show up as "the same" thing?GhostCat
@matiaselgart No, i only need iteration ! Ideally range would be great too :) Any ideas ?nbchn
If you only need iteration, not random access, then a LinkedList based implementation will perform better than ArrayList, as random inserts are far less expensive.Brett Okken

2 Answers

2
votes

From the problem description, it appears to me that a strict order is not mandatory for the problem as long as you can have iteration over the events collection over a period of time. Also, the kind of data you mention seems to be one where multiple client nodes are sending data to one centralized server (could be logs/events accumulation from multiple services).

If that is the case, you could explore using a simple array of buckets, where an event corresponding to a timestamp goes into a specific bucket only. You will ensure that all events which have very near timestamps are classified into the same buckets, so that you could achieve partial order between the events.

For Example: If you need data for last 1 minute (60 seconds), you could define 60 buckets, one for each second, and keep rotating them. An event timestamped 2016-12-08 19:59:29.538331 goes to the 29th bucket (assuming indices start from 0, and you take a floor of the seconds of every event). When a minute lapses, simply clear away past data for the ith bucket, and start building it afresh. So, at 2016-12-08 20:00:00.129845, the 0th bucket is reset to an empty array.

Since you have a high frequency stream of timestamped events, chances of empty buckets etc will be minimal. You can tweak around the number of buckets required based on your exact requirements.

1
votes

I know you have dismissed it already but I would still suggest TreeSet.

There is actually no problem in having duplicate timestamps. The only condition is that the comparator is consistent with equals. Nothing more.

So yes, in the first approach if you just compare timestamps of events then that would probably not consistent with equals. But if you also compare other fields of your events then this will be consistent with equals.

This assumes, of course, that the timestamp is a part of E, the event class.