1
votes

TL;DR

I have a Spring Boot application that makes use of parquet files stored on the file system. To access them we are using Apache Drill.

Since I have multiple users that might access them, I've set up a connection pool in Spring.

When I'm using the connection pool, Drill somehow executes a "limit 0" query before executing my actual query, and this affect performances. The same "limit 0" query is NOT executed when I run my queries through a simple Statement obtained from direct Connection.

This seems to be related to the fact that Spring JdbcTemplate makes use of PreparedStatements instead of simple Statements.

Is there a way to get rid of those "limit 0" queries?

-- Details --

The connection pool in the Spring configuration class looks like this:

@Bean
@ConfigurationProperties(prefix = "datasource.parquet")
@Qualifier("parquetDataSource")
public DataSource parquetDataSource() {
    return DataSourceBuilder.create().build();
}

And the corresponding properties in the development profile YML file are:

datasource:
  parquet:
    url: jdbc:drill:drillbit=localhost:31010
    jdbcUrl: jdbc:drill:drillbit=localhost:31010
    jndiName: jdbc/app_parquet
    driverClassName: org.apache.drill.jdbc.Driver
    maximumPoolSize: 5
    initialSize: 1
    maxIdle: 10
    maxActive: 20
    validation-query: SELECT 1 FROM sys.version
    test-on-borrow: true

When I execute a query using the JdbcTemplate created with the mentioned Drill DataSource, 3 different queries might be executed:

  1. the validation query SELECT 1 FROM sys.version;
  2. a "limit 0" query that looks like SELECT * FROM (<my actual query>) LIMIT 0;
  3. my actual query.

Here's the execution code (parquetJdbcTemplate is an instance of a class that extends org.springframework.jdbc.core.JdbcTemplate):

parquetJdbcTemplate.query(sqlQuery, namedParameters,
            resultSet -> {
                MyResultSet result = new MyResultSet();
                while (resultSet.next()) {
                    // populate the "result" object
                }
                return result;
            });

Here's a screenshot from the Profile page of my Drill monitor: monitor screenshot

The bottom query is the "limit 0" one, then in the middle you have the validation query and on top (even if the query is not shown) the actual query that returns the data I want.

As you can see, the "limit 0" query takes more than 1/3 of the entire execution time to run. The validation query is fine, since the execution time is negligible and it's needed to check the connection.

The fact is, when I execute the same query using a Connection through the Drill driver (thus, with no pool), I only see my actual query in the UI monitor:

public void executeQuery(String myQuery) {
    Class.forName("org.apache.drill.jdbc.Driver");
    Driver.load();
    Connection connection = DriverManager.getConnection("jdbc:drill:drillbit=localhost:31010");
    Statement st = connection.createStatement();
    ResultSet resultSet = st.executeQuery(myQuery);
    while (resultSet.next()) {
        // do stuff
    }
}

monitor screenshot As you can see, the total execution time improves by a lot (~14 seconds instead of ~26), just because the "limit 0" query is not executed.

As far as I know, those "limit 0" queries are executed to validate and get information about the underlying schema of the parquet files. Is there a way to disable them while using the connection pool? I ideally would like to still use PreparedStatements over simple Statements, but I could switch to simple Statements if needed, because I have full control over those queries (so, no SQL injection should be possible unless someone hacks the deployed artifacts).

1

1 Answers

1
votes

You are right Drill executes limit 0 prior prepared statements to get information about schema. I don't think there is a way to disable such behavior. Though I can recommend to enable planner.enable_limit0_optimization option which is false by default, this may speed limit 0 query execution. Another way to speed limit 0 queries is to indicate schema explicitly using casts through the view usage or directly in queries.

Regarding not showing query, I think this was fixed in the latest Drill version.