0
votes

I'm querying tables in an Azure SQL DB from Azure data lake analytics, and are experiencing ineffecient queries. The queries are simple, all SELECT * FROM EXTERNAL Datasource EXECUTE @"SELECT *FROM externalTable. This table consist of more than 60million rows. The challenge is that during the execute of the u-sql script retrieving all of these 60 million rows, u-sql only splits these operations down into one vertex, making it impossible to scale the job. If I split the query into X other "part queries", where I retrieve a piece of the total rows in each part query, and then combining all part queries at the end, I obviously get X vertexes. To demonstrate "part queries":

       SELECT * FROM EXTERNAL Datasource EXECUTE @"SELECT *FROM externalTable 
  where registered >= GETDATE()-10000 and Registered !> GETDATE()-8000
    union all
     SELECT * FROM EXTERNAL Datasource EXECUTE @"SELECT *FROM externalTable where registered >= GETDATE()-7999 and Registered !> GETDATE()-6000

My question is, is this the preferred way of querying external data sources efficiently, or am I missing something?

1

1 Answers

0
votes

Try placing SQL Azure database server and Data Lake resources participating on cross database queries in the same region for better performance.

You can also retrieve all those millions of records in batches by parameterizing your elastic queries on the Azure SQL Database side. Parameterized operations can be pushed to remote databases and be evaluated remotely on the Azure Data Lake side. Read more about it here.