0
votes

I am quite puzzled by BigQuery connector on Spotfire. It is taking !extremely! long time to import my dataset in-memory.

my configuration: spotfire on AWS windows instance (8vCPU - 32Go RAM). dataset 50Go >100M rows on BigQuery.

Yes - I should use in-database for such large dataset and push the queries to BigQuery and use Spotfire only for display, but that is not my question today ????

Today i am trying to understand how the import works and why it is taking so long. this import job started 21hrs ago and it is still not finished. The resources of the server are barely used (CPU, Disk, Network).

Testing done:

  • I tried importing data from Redshift and it was much faster (14min for 22Go)
  • I checked resources used during import: network speed (Redshift ~ 370Mbs, BQ ~ 8Mbs for 30min), CPU (Redshift ~ 25%, BQ < 5%), RAM (Redshift & BQ ~ 27Go), Disk write (Redshift 30Mbs, BQ 5MBs)

I really don't understand what is Spotfire actually doing for all this time while importing dataset from BQ in memory. There seems to be no use of server resources and there is no indication of status apart from time running.

Any Spotfire experts have any insights on what's happening? Is the connector to BigQuery actually not to be used for In-memory analysis - what is the actual implementation limiting factor?

Thanks! ????

3

3 Answers

0
votes

We had an issue which is fixed in the Spotfire versions below:

TS 10.10.3 LTS HF-014 TS 11.2.0 HF-002

Please also vote and comment on the idea of using the Storage API when extracting data from BigQuery:

https://ideas.tibco.com/ideas/TS-I-7890

Thanks,

Thomas Blomberg Senior Product Manager TIBCO Spotfire

0
votes
0
votes

An update after more testing. Thanks @Thomas and @Manoj's very helpful support. Here are the results:

  1. I updated spotfire version to 11.2.0 HF002 and it fixed the issue with bringing data in-memory with BigQuery 👌 - Using (Data > Add Data...), the data throughput was very low though ~ 13min/Go. The network throughput was doing burst of 8Mbs.
  2. As suggested in tibco ideas by Thomas, I installed Simba JDBC driver and the data throughput improved drammatically to ~ 50s/Go using (Data > Information Designer). The issue off course is that you need access to the server to install it. the network throughput was roughly 200Mbs. I am not sure what is the limiting factor (Spotfire config, Samba Driver or BigQuery).
  3. Using Redshift connector to a Redshift cluster with the same data and connecting using (Data > Information Designer), I get to a data import throughput of ~ 30s/Go with a network throughput of 380Mbs.

So my recommandation is to use the latest simba driver along with the Information Designer to get the best "in-memory" data import throughput when connecting to medium size dataset in BigQuery (10-30Go). This leads to a data import throughput of 1min/Go.

It's not clear what makes Redshift connection faster though and if there is faster method to import data from GCP/BigQuery to Spotfire 🤷‍♂️

Any comments or suggestions are welcome! Tarik