4
votes

I'm a beginner with Tachyon. I want to share some data or rdd between spark jobs. Tachyon overview says

Tachyon is an open source memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster jobs.

But I can't figure out how to enable this. I only know that tachyon can act as a off-heap cache layer in Spark. Thanks.

1
Save to FS with Tachyon tier, read back in another job?zero323
@zero323 I'll try. Thanksstarrynight92

1 Answers

0
votes

I don't think you need to do it explicitly, Alluxio will help you manage the data sharing.

Assume you have two spark jobs A and B and they're configured to fetch data from Alluxio.

Assume there is no data in Alluxio yet and job A and job B are executed in a batch. When job A is running, Alluxio will firstly fetch data from UFS, serve compute needs and cache data to its local storage like memory. When job B wants data for query, Alluxio will check its local storage firstly to serve job B's need. It will fetch data from UFS only if cache is missed. The data is now shared through different jobs.

So in a nutshell, I think the data sharing here is actually the cache you mentioned.