We are using Bitnami Kafka 0.8.2 + spark 1.5.2 in Google cloud platform. Our spark streaming job(consumer) not receiving all the messages sent to the specific topic. It receives 1 out of ~50 messages(added log in the job stream and identified). We are not seeing any errors in the kafka logs. Unable to debug further from kafka layer. The console consumer shows the INPUT topic is received in the console. it is not reaching the spark-kafka integration stream. Any thoughts how to debug this issue. Another topic is working fine in same setup. Again tried with spark 1.3.0, kafka 0.8.1.1 which is also has same issue. All these jobs are working fine in our local lab servers
0
votes
Did you use KafkaUtils.createDirectStream() to read message from Kafka? Did the spark streaming job start to work before you published any message to Kafka?
- JuliaLi
Yes, we have identified the problem. it is related to the thread behavior difference in google cpu. We have a transformation map/reduce line in the job which uses groupByKey([numTasks]). NumTasks was defined 10 - which is working fine in local server. removed the Numtasks to use default and in the google platform it started working. But we still have some performance issue. Planning to change the groupbykey to reducebykey.
- vimeghan
1 Answers
0
votes
The actual root cause was - apache cassandra incompatibility with spark-cassandra-connector. Though we used an aligned connector and apache cassandra version, some cassandra communications was getting stuck. The cassandra nodes cpu usage was above 98% most of the times. Changed the cassandra to Datastax cassandra version - and .... it just worked perfectly!!! No code changes were required.