0
votes

I ran across a database query written in R that runs against a mapR data store using Apache Drill driver. Due to a performance ceiling with my program of about 700,000 rows, I'm looking into using a different DB situation than SQL.

This question is about using R to query SQL and store it in the working environment. I generalized it to just say SELECT * FROM ... for the sake of this question.

Say you're running a three node MapR cluster, and execute a SQL query against the database using R, will the query return results faster because it's MapR or would a single RDBMS perform the same?

library(RODBC)

# initialize the connection
ch <- odbcConnect("drill64")

# run the query
df = sqlQuery(SELECT * FROM state)

#Code to write output to file

# close the connection so we don't get a warning at the end
odbcClose(ch)

Performance wise, is this the same as using odbcConnect("RMySQL") or some similar MySQL library?

1

1 Answers

2
votes

The answer depends on what the underlying data is. Drill is a distributed query engine that can be run on a large cluster, so for large data sets it will be advantages. Very small data sets it won't help much to have a large distributed query engine. Also keep in mind that Drill can deal with various data sources, which can give your program a lot more flexibility, pending the use case(s).

However if the data is already in mysql and you are looking to use the Drill JDBC plugin to mysql, it will likely not be beneficial to go through Drill.