I have an application that triggers the job to the spark master. But when I check the IP address executing the job, its displaying my application IP and not the spark worker IP. So, from what I understand, the call on RDD generates a spark worker to work.
But my question is this.
CassandraSQLContext c = new CassandraSQLContext(sc);
QueryExecution q=c.executeSql(cqlCommand); //-----1
q.toRDD().count(); //----2
I saw the worker doing something for 2 but nothing for 1.
So does this mean fetch from Cassandra and RDD creation out of it in 1 is all done in the application?
If so, 2 does trigger a job to two workers. In that case, does it fetch again from Cassandra and process the count?
Can someone clarify this??
EDIT
- Going by the answer provided, if the count call triggers the workers to function, then what is the use of executeSQL creating a RDD in local? Does that create a Cassandra dataset of the data by querying ? If that's the case, querying from Cassandra happens twice?
2.. If spark automatically distributes the computations of 10 partitions of Cassandra among 4 workers, who will aggregate the results? Master is just doing the distribution. So does it aggregate too?
If I don't cache the RDD and do another count operation, what will happen? Will spark try to to use the same worker that was used previously for a particular partition and append to the result RDD in that node. I think it has to query Cassandra to get this partition data again? Can you provide some clarity in this?
If I cache my RDD, what happens? RDD is stored in the worker and it will be used for all operations? In that case, how this is different from we storing a dataset in memory and processing it? Let me know if an right about this too.