1
votes

I want to create multiple functions in R to run embedded in MonetDB.

Function 1 (calling in the select part of a query)

dbGetQuery(conn, "DROP FUNCTION normalize;")
functionDef <- paste(
    "CREATE FUNCTION normalize(welltype STRING, data_column DOUBLE) RETURNS TABLE (i DOUBLE) LANGUAGE R {", 
    "idx <- which(welltype == 'LC')",
    "100 * data_column / median(data_column[idx])",
    "};", sep = "\n")
dbGetQuery(conn, functionDef)

example use

stmt <- "SELECT barcode, normalize(welltype_code, data_column1) FROM hcs;"
dbGetQuery(conn, stmt)

This works just fine.

Function 2 (calling in the select part of a query)

So I wanted to create a more advanced version:

dbGetQuery(conn, "DROP FUNCTION normalize2;")
functionDef <- paste(
    "CREATE FUNCTION normalize2(barcode DOUBLE, welltype STRING, data_column DOUBLE) RETURNS TABLE (i DOUBLE) LANGUAGE R {", 
    "idx <- which(welltype == 'LC')",
    "100 * data_column / median(data_column[idx])",
    "};", sep = "\n")
dbGetQuery(conn, functionDef)

As you see, the only thing that I have changed so far, is the addition of an extra argument to the function.

However, this fails:

example use:

stmt <- "SELECT normalize2(barcode, welltype_code, data_column1) FROM hcs;"
dbGetQuery(conn, stmt)

results in:

 !SELECT: no such operator 'normalize2'

I checked, and the types are all fine:

barcode = DOUBLE
welltype_code = CHARACTER LARGE OBJECT
data_column1 = DOUBLE

I then tried another way that I have seen functions used in MonetDB.

Calling function with result from sub query

In this case, we feed the function the result of a SELECT query, instead of using it in the column selection part of the query.

dbGetQuery(conn, "SELECT * FROM normalize2( (SELECT barcode, welltype_code, data_column1 FROM hcs) );")

It works!

However,I need to perform this operation on many columns of the table. With the first syntax I could do the following:

stmt <- "SELECT barcode, normalize(welltype_code, data_column1), normalize(welltype_code, data_column2) FROM hcs;"
dbGetQuery(conn, stmt)

I am not sure how to accomplish a similar result with the new query structure, apart from joining up the intermediate tables. (And since this operation has to be done on thousands of columns, that will probably not be the most efficient strategy)

So my question boils down to: Why does:

SELECT normalize(welltype_code, data_column1) FROM hcs;

work? And not:

SELECT normalize2(barcode, welltype_code, data_column1) FROM hcs;

And secondly, what is the difference with calling it using a sub-query. If that is the only way to use it, pointer on how to write a function that could be applied to a varying number of columns in an efficient way, would be welcomed.

Test data:

The dput output:

structure(list(barcode = c(110000184638, 110000184638, 110000184638, 
110000184638, 110000184638, 110000184638, 110000184638, 110000184638, 
110000184638, 110000184638), welltype_code = c("LC", "LC", "LC", 
"LC", "LC", "LC", "LC", "LC", "LC", "LC"), data_column1 = c(0.344772189855576, 
0.334164410829544, 0.315271258354187, 0.320378184318542, 0.322041183710098, 
0.32072114944458, 0.29565417766571, 0.321962893009186, 0.298929244279861, 
0.323741465806961)), .Names = c("barcode", "welltype_code", "data_column1"
), row.names = c(NA, 10L), class = "data.frame")
1

1 Answers

2
votes

Projection functions (the ones used after the SELECT) can only return single columns. Hence, the definition of normalize2 should be

CREATE FUNCTION normalize2(barcode DOUBLE, welltype STRING, data_column DOUBLE)
RETURNS DOUBLE LANGUAGE R { ...

instead of

CREATE FUNCTION normalize2(barcode DOUBLE, welltype STRING, data_column DOUBLE) 
RETURNS TABLE (i DOUBLE) LANGUAGE R { ...

Hope this helps