9
votes

Apache Apex looks similar to Apache Storm.

  • Users build application/topology as Directed Acyclic Graph (DAG) on both platforms. Apex uses operators/streams and Storm uses spouts/streams/bolts.
  • They both process data in real time as opposed to batch processing.
  • Both seem to have high throughput & low latency

So, at a glance, both look similar and I'm not quite getting the difference. Can someone please explain what are the key differences? In other words, when should I use one instead of the other?

2
Add Apache Flink and Apache Beam, all DAG processorsuser3613754
Also please add use-cases, I prefer what kind of use case is suitable for each.ChikuMiku

2 Answers

3
votes

There are fundamental differences in architecture which make each of the platform very different in terms of latency, scaling and state management.

At the very basic level,

  1. Apache Storm uses record acknowledgement to guarantee message delivery.
  2. Apache Apex uses checkpointing to guarantee message delivery.

You can learn more differences in the following blog which also includes other main stream processing platforms out there.

https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/

3
votes

Architecture and Features

+-------------------+---------------------------+---------------------+
|                   |           Storm           |         Apex        |
+-------------------+---------------------------+---------------------+
| Model             | Native Streaming          | Native Streaming    |
|                   | Micro batch (Trident      |                     |
+-------------------+---------------------------+---------------------+
| Language          | Java.                     | Java (Scala)        |
|                   | Ability to use non        |                     |
|                   | JVM languages support     |                     |
+-------------------+---------------------------+---------------------+
| API               | Compositional             | Compositional (DAG) |
|                   | Declarative (Trident)     | Declarative         |
|                   | Limited SQL               |                     |
|                   | support (Trident)         |                     |
+-------------------+---------------------------+---------------------+
| Locality          | Data Locality             | Advance Processing  |
+-------------------+---------------------------+---------------------+
| Latency           | Low                       | Very Low            |
|                   | High (Trident)            |                     |
+-------------------+---------------------------+---------------------+
| Throughput        | Limited in Ack mode       | Very high           |
+-------------------+---------------------------+---------------------+
| Scalibility       | Limited due to Ack        | Horizontal          |
+-------------------+---------------------------+---------------------+
| Partitioning      | Standard                  | Advance             |
|                   | Set parallelism at work,  | Parallel pipes,     |
|                   | executor and task level   | unifiers            |
+-------------------+---------------------------+---------------------+
| Connector Library | Limited (certification)   | Rich library of     |
|                   |                           | connectors in       |
|                   |                           | Apex Malhar         |
+-------------------+---------------------------+---------------------+

Operability

+------------+--------------------------+---------------------+
|            |           Storm          |         Apex        |
+------------+--------------------------+---------------------+
| State      | External store           | Checkpointing       |
| Management | Limited checkpointing    | Local checkpointing |
|            | Difficult to exploit     |                     |
|            | local state              |                     |
+------------+--------------------------+---------------------+
| Recovery   | Cumbersome API to        | Incremental         |
|            | store and retrieve state | (buffer server)     |
|            | Require user code        |                     |
+------------+--------------------------+---------------------+
| Processing | At least once            |                     |
| Semantic   | Exactly once require     | At least once       |
|            | user code and affect     | End to end          |
|            | latency                  |                     |
|            |                          | exactly once        |
+------------+--------------------------+---------------------+
| Back       | Watermark on queue       | Automatic           |
| Pressure   | size for spout and bolt  | Buffer server       |
|            | Does not scale           | memory and disk     |
+------------+--------------------------+---------------------+
| Elasticity | Through CLI only         | Yes w/ full user    |
|            |                          | control             |
+------------+--------------------------+---------------------+
| Dynamic    | No                       | Yes                 |
| topology   |                          |                     |
+------------+--------------------------+---------------------+
| Security   | Kerberos                 | Kerberos, RBAC,     |
|            |                          | LDAP                |
+------------+--------------------------+---------------------+
| Multi      | Mesos, RAS - memory,     | YARN                |
| Tenancy    | CPU, YARN                | full isolation      |
+------------+--------------------------+---------------------+
| DevOps     | REST API                 | REST API            |
| Tools      | Basic UI                 | DataTorrent RTS     |
+------------+--------------------------+---------------------+

Source: Webinar: Apache Apex (Next Gen Hadoop) vs. Storm - Comparison and Migration Outline https://www.youtube.com/watch?v=sPjyo2HfD_I