0
votes

I am currently evaluating Apache Storm to process heterogeneous data from multiple data sources. While there may be some common properties shared by all data (i.e., a "type" property), I would like to be able many different "classes" of tuples and also be able to handle new data types with minimal changes to the topology. To give an example what these data types might look like:

{type=LogTransaction,timestamp=...,user=...,duration=...}
{type=LogEvent,timestamp=...,user=...,message=...}

The examples on the Storm page primarily deal with simple Tuples which are well-defined in advance so that the spouts / bolts can statically declare the output fields.

My initial idea was to declare the type and store all other properties in a Map<String,Object>, which could then be declared:

public void declareOutputFields(OutputFieldsDeclarer ofd) {
    ofd.declare(new Fields("type", "attributes"));
}

However, I believe at that point many of the more advanced features of Storm will no longer work correctly. For example, it it my understanding that I could no longer use Trident to execute a groupBy on any of the attributes.

Is there a better way to handle this type of data that I have missed in Apache Storm? I did find this post describing a similar issue, however I would like to avoid having to create a Java class for each data type.

1
Putting aside the Storm-specific code for a second, how would you process multiple classes of tuples without having a specific class per tuple?Chris Gerken
One of my requirements is to be able to group data based on one or more attributes and stream it to different destinations. An example of this would be collecting all tuples which are associated to a certain user (via the "user" property). I am aware that whatever component will ultimately consume that data would need to be aware of each "class" to process the data, but ideally the streaming pipeline should be handle any kind of tuple.Ben Damer
It seems as though you're saying your requirements are basically, "whatever we decide to send you, you have to handle correctly and dynamically." Which seems about as vague as possible. Are anything in these tuples related so you could abstract something to an interface or an abstract class?Morgan Kenyon
Unfortunately, that does match my requirements quite closely... Aside from the type property, which will be present in each tuple, there is no single attribute of data that is guaranteed to be present. However, even if there were shared attributes, it is my understanding that I would still have to declare each subclass' properties as output fields to be able to use them in Apache Storm - or am I missing something?Ben Damer
Do you know all possible attributes for each type?Matthias J. Sax

1 Answers

0
votes

You can use your own customized fields as long as the field is serializable , It will work fine in storm with more than one supervisor.

Because storm is a distributed data processing tool and when there exists more than one supervisor, based on grouping, certain bolts will emit the fields to bolts running on different supervisor. In such sutiuations, the output fields will be serialized and sent through network. This serialization can be of regular java serialization or Kryo serialization(to avoid network latency).

Hence you might experience exceptions if your jvm not able to serialize your output fields.