I've set up a spark job-server (see https://github.com/spark-jobserver/spark-jobserver/tree/jobserver-0.6.2-spark-1.6.1) in standalone mode.
I've created a default context to use. Currently I have 2 kind of jobs on this context:
- Synchronization with another server:
- Dumps the data from the other server's db;
- Perform some joins, reduce the data, generating a new DF;
- Save the obtained DF in a parquet file;
- Load this parquet file as a temp table and cache it;
- Queries: perform sql queries on the cached table.
The only object that I persist is the final table that will be cached.
What I don't get is why when I perform the synchronization, all the assigned memory is used and never released, but, if I load the parquet file directly (doing a fresh start of the server, using the parquet file generated previously), only a fraction of the memory is used.
I'm missing something? There is a way to free up unused memory?
Thank you