Persisting real-time data through Cygnus to Cosmos is slow and unreliable

Question

Cygnus version is 0.8.2 and I'm using the public instance of Cosmos from our FI-Ware instance inside the FI-Ware Lab.

I have 8 sensor devices that push updates to IDAS. Some updates come once per second, some once per 5 seconds, in average around 8,35 updates per second. I created subscriptions to Orion (version 0.22) to send ONCHANGE notifications to the Cygnus.

Cygnus is configured to persist data to Cosmos, Mongo and MySQL. I used the standard configuration where is 1 source (http-source), 3 channel (hdfs-channel mysql-channel mongo-channel) and 3 sink (hdfs-sink mysql-sink mongo-sink).

mysql-sink and mongo-sink persist data near real-time. However, the hdfs-sink is really slow, only about 1,65 events per seconds. As the http-source receives around 8,35 events per second, the hdfs-channel is soon full and you get a warning to the log file.

time=2015-07-30T13:39:02.168CEST | lvl=WARN | trans=1438256043-345-0000002417 | function=doPost | comp=Cygnus | msg=org.apache.flume.source.http.HTTPSource$FlumeHTTPServlet[203] : Error appending event to channel. Channel might be full. Consider increasing the channel capacity or make sure the sinks perform faster.
org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: hdfs-channel}
        at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:200)
        at org.apache.flume.source.http.HTTPSource$FlumeHTTPServlet.doPost(HTTPSource.java:201)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:814)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:401)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.apache.flume.ChannelException: Space for commit to queue couldn't be acquired Sinks are likely not keeping up with sources, or the buffer size is too tight
        at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:128)
        at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
        at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:192)
        ... 16 more

Side effect is that if the http-source cannot inject the notification to the hdfs-channel, it doesn't inject it to the mysql-channel and mongo-channel either and that notification is totally lost. It's not persisted to anywhere.

You can circumvent the problem partly by launching 3 separate Cygnuses (one for Cosmos, one for MySQL and one for MongoDB) with different http-source port, different Management Interface port and adding subscriptions for each Cygnus. MySQL and MongoDB persisting is not affected by the hdfs-channel becoming full, but the Cosmos persisting still has a problem. Adding more hdfs-sinks might do the trick with our 8 sensor devices, but if you add more sensor devices or they send more updates, you are just postponing the problem.

These 2 question are a bit unrelated, but I'm asking anyway...

Question 1: Is it really the case that persisting to the Cosmos is that slow?

I know there is a lot of going on behind the scenes compared to persisting to the local databases and that we are using the public instance of Cosmos that is resource limited, but still. Is it even meant to be used that way with real-time data (our 8 sensor device test is even quite modest)? Of course it's possible to create a sink that pushes the data to a file and then do a simple file upload to the Cosmos, but it's a bit more hassle. I guess there isn't such file-sink available?

Question 2: Is it really the case that if the notification cannot be injected to the hdfs-channel (I guess any channel), it's not added to other channels either and it's discarded totally?

frb frb · Accepted Answer · 2015-08-19T07:12:38

All the sinks design is quite similar, nevertheless there are some differences between the HDFS sink and MySQL/MongoDB sinks:

The HDFS endpoint (the HttpFS server running at cosmos.lab.fiware.org:14000) is shared among many FIWARE users. However, I guess your MySQL and MongoDB deployments were private ones and thus only used by you.
The HDFS sink is based on WebHDFS, a REST API, while MySQL and MongoDB sinks are based on "binary protocols" (JDBC and Mongo driver are used respectively). There is an old issue at Github about moving to a "binary" implementation of the sink.

Being said that, and trying to fix the problem with the current implementation these are my recommendations:

Try to change the looging level to ERROR; logging traces consumes a lot of resources.
Try to send "batches" of notifications to Cygnus (a Orion notification may contain several context entity elements); each batch is stored as a single Flume event in the channel.
As you already figured out, try to configure more than one HDFS sink, this is explained here (reading the full document is also a good idea).

Nevertheless, if the bottleneck is on the HDFS endpoint, I figure out this won't fix anything.

About Cygnus does not put an event in other non HDFS channels if it cannot be persisted in the HDFS channel, I'll have a look on that. Cygnus relies on Apache Flume and the events delivering feature is within the core of Flume, so it seems to be a bug/problem about Flume.

Persisting real-time data through Cygnus to Cosmos is slow and unreliable

1 Answers