9
votes

I am trying to query a variable from a Microsoft SQL Server database using R/RODBC. RODBC is truncating the character string at 8000 characters.

Original code: truncates at 255 characters (as per RODBC documentation) library(RODBC) con_string <- odbcConnect("DSN") query_string <- "SELECT text_var FROM table_name" dat <- sqlQuery(con_string, query_string, stringsAsFactors=FALSE)

Partial solution: modifying query string truncate text after 7999 characters. library(RODBC) con_string <- odbcConnect("DSN") query_string <- "SELECT [text_var]=CAST(text_var AS VARCHAR(8000)) FROM table_name" dat <- sqlQuery(con_string, query_string, stringsAsFactors=FALSE)

The table/variable contains text strings at long as 250,000 characters. I really want to work with all the text in R. Is this possible?

@BrianRipley discusses the problem (but no solution) on page 18 of following document: https://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf

@nutterb dicusses similar issues with RODBCext package on GitHub: https://github.com/zozlak/RODBCext/issues/6

Have seen similar discussion on SO, but no solution using RODBC with VARCHAR>8000.

RODBC sqlQuery() returns varchar(255) when it should return varchar(MAX)

RODBC string getting truncated

Note:

  • R 3.3.2
  • Microsoft SQL Server 2012
  • Linux RHEL 7.1
  • Microsoft ODBC Driver for SQL Server
2
Regarding the partial solution - you may be wondering why this truncates at 7999 instead of 8000. This is due to a bug in RODBC which has been fixed in the RODBCext package - see in particular github.com/zozlak/RODBCext/issues/8. Additionally, in RODBCext zozlak has made the design choice to allow 65,535 characters by default if pulling from a column with the VARCHAR(max) type. So to future readers, I would recommend something similar to @Benjamin's approach below - but looping in increments of 65,535 instead of 8,000.logworthy

2 Answers

5
votes

Since this is a limitation of the ODBC driver provided by Microsoft, there's little to be done until they make a change to the driver. @zozlak explains why in the GitHub issue you linked to.

I tend to use stored procedures to get around this when I need to, but that usually requires writing a stored procedure for each particular instance. At some point, I may come up with a way to do this in a stored procedure more generically, but I find the process of constructing queries in a stored procedure to be tedious and frustrating.

For that reason, I just spent some time building a function that will do limited queries that involve VARCHAR(MAX) variables. It's a brute force approach that for a variable of 17000 characters will export it in three variables and paste them together in R. It's crude, probably not very efficient, but the best solution I've come up with yet.

The other limitation is that it doesn't allow you to rename variables within the query. You'll be stuck with the variables as they are named in the database. If you're only involving a couple of tables, that may not be a problem. In very complex database, this can be problematic. But, at least with this you could query just the VARCHAR(MAX) variables with a handful of necessary IDs on which to perform a merge in R.

As discussed in the GitHub issue, it's probably best to avoid VARCHAR(MAX) as much as possible. If an unknown length is truly necessary, VARBINARY(MAX) is a little easier to query.

Example

source("https://gist.githubusercontent.com/nutterb/d2e050dada608bb6213e61d0f8471b65/raw/be8717f318b3e3087e7c26c9a8f9d0a582a5daef/query_varchar_max"

channel <- odbcDriverConnect(...)

query_varchar_max(channel = channel,
                  id = c("idvar"),
                  varchar_max = c("varchar_max_var", "varchar_max_var2"),
                  from = "FROM dbo.table_name WHERE group = ?",
                  data = list(group = "A"))

Function Code

#' @name query_varchar_max
#' @title Query a VARCHAR(MAX) Variable from SQL Server
#' 
#' @description The RODBC driver to SQL Server (SQL Server Native Client 11.0)
#'   reports the lenght of a VARCHAR(MAX) variable to be zero.  This presents 
#'   difficulties in extracting long text values from the database. Often, the
#'   ODBC will assume a length of 255 characters and truncate the text to that
#'   many characters.  The approach taken here searches the VARCHAR(MAX) variables 
#'   for the longest length, and extracts the data in segments to be pasted 
#'   together in R.  
#'   
#' @param channel A valid ODBC channel to a SQL Server database.
#' @param id A character vector of ID variables that may be used to merge the 
#'   data from this query into another dataset.
#' @param varchar_max a character vector of variable names that are to be 
#'   treated as if they are VARCHAR(MAX) variables. 
#' @param from A single character string providing the remainder of the query 
#'   to be run, beginning with the \code{FROM} statement.
#' @param stringsAsFactors \code{logical(1)}. Should character strings returned 
#'   from the database be converted to factors?
#' @param ... Additional arguments to \code{sqlExecute} when running the full 
#'   query.
#'   
#' @details \code{query_varchar_max} operates by determining how many columns of up to
#'   8000 characters each are required to export a complete VARCHAR(MAX) variable.
#'   It then creates the necessary number of intermediate variables and queries the 
#'   data using the SQL Server \code{SUBSTRING} command, extracting the VARCHAR(MAX)
#'   variable in increments of 8000 characters. After completing the query, 
#'   the intemediary variables are concatenated and removed from the data.
#'   
#'   The function makes accommodation for multi-part queries as far as [TABLE].[VARIABLE]
#'   formats are concerned. It is not intended for use in [SCHEMA].[TABLE].[VARIABLE]
#'   formats. This at least allows \code{from} to include joins for more complex 
#'   queries.  Parameterized queries are also supported through \code{sqlExecute}.
#'
#' @export

query_varchar_max <- function(channel, id, varchar_max, from, 
                              stringsAsFactors = FALSE, ...)
{
  coll <- checkmate::makeAssertCollection()

  checkmate::assert_class(x = channel,
                          classes = "RODBC",
                          add = coll)

  checkmate::assert_character(x = id,
                              add = coll)

  checkmate::assert_character(x = varchar_max,
                              add = coll)

  checkmate::assert_character(x = from,
                              len = 1,
                              add = coll)

  checkmate::assert_logical(x = stringsAsFactors,
                            len = 1,
                            add = coll)

  checkmate::reportAssertions(coll)

  varchar_max_len <-
    paste0(
      sprintf("MAX(LEN(%s)) AS len_%s", 
              varchar_max,
              sub("[.]", "_", varchar_max)),
      collapse = ", "
    )

  varchar_len <- 
    unlist(
      RODBCext::sqlExecute(
        channel = channel,
        query = sprintf("SELECT %s %s",
                        varchar_max_len,
                        from),
        fetch = TRUE
      )
    )

  varchar_max_cols <- 
    unlist(
      mapply(expand_varchar_max,
             varchar_max,
             varchar_len,
             SIMPLIFY = FALSE)
    )

  Prelim <- 
    RODBCext::sqlExecute(
      channel = channel,
      query = sprintf("SELECT %s, %s %s",
                      paste0(id, collapse = ", "), 
                      paste0(varchar_max_cols, collapse = ", "),
                      from),
      fetch = TRUE,
      stringsAsFactors = stringsAsFactors,
      ...
    )

  var_stub_to_combine <-
    unique(
      sub(
        "(part)(\\d{1,3})", 
        "\\1",
        sub(".+AS ", "", varchar_max_cols)
      )
    )

  col_to_combine <- 
    lapply(var_stub_to_combine,
           grep,
           names(Prelim))

  Prelim[sub(".+[.]", "", varchar_max)] <-
    lapply(col_to_combine,
           function(col) apply(Prelim[col], 1, paste0, collapse = ""))

  Prelim[-unlist(col_to_combine)]

}


expand_varchar_max <- function(varchar_max, varchar_len)
{
  nvar <- varchar_len %/% 8000 + 1

  var_list <- vector("character", length = nvar)

  for (i in seq_along(var_list))
  {
    var_list[i] <- 
      sprintf("SUBSTRING(%s, %s, %s) AS %s_part%s",
              varchar_max,
              1 + (i - 1) * 8000,
              8000,
              paste0(sub("[.]", "_", varchar_max)),
              i)
  }
  var_list
}
4
votes

Maybe this has changed recently, but with the following query format getting varchar(MAX) fields longer then 8000 character can be queried from SQL Server into R

query_string <- "SELECT CAST(text_var AS text) AS text_var FROM table_name"