How to Tachyon to share data between Spark jobs

Question

I'm a beginner with Tachyon. I want to share some data or rdd between spark jobs. Tachyon overview says

Tachyon is an open source memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster jobs.

But I can't figure out how to enable this. I only know that tachyon can act as a off-heap cache layer in Spark. Thanks.

Eugene Eugene · Accepted Answer · 2019-11-21T07:43:24

I don't think you need to do it explicitly, Alluxio will help you manage the data sharing.

Assume you have two spark jobs A and B and they're configured to fetch data from Alluxio.

Assume there is no data in Alluxio yet and job A and job B are executed in a batch. When job A is running, Alluxio will firstly fetch data from UFS, serve compute needs and cache data to its local storage like memory. When job B wants data for query, Alluxio will check its local storage firstly to serve job B's need. It will fetch data from UFS only if cache is missed. The data is now shared through different jobs.

So in a nutshell, I think the data sharing here is actually the cache you mentioned.

How to Tachyon to share data between Spark jobs

1 Answers