0
votes

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.

This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward (delete from x where event_date >= y). But because BigQuery disallows the parameterization of table names (see here) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.

I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it (as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.

1
Why are you worried about SLQi? Will your code receive external parameters? - rmesteves
No it isn't supposed to receive external parameters. I'm not an expert in these things so I don't know how risky it is to have a python function for this process stored on a server, which accepts a table name as an argument and just uses string operations to make the query. I mean you'd have to hack ssh to gain access to the server. What do you think? - Jonathan
Being honest, if someone can hack your server, the smallest problem you'll have is a SQLi through this script. Someone hacking your server could just alter your code to do anything. I'd just ensure that the server security is good enough. If you really feel that you should handle the SQLi stuff, one approach I think you could try is checking your parameters with a regex before proceeding. - rmesteves
Yeah I thought that would be the case for hacking the server. But I was uncertain if someone could somehow call the script without gaining full access to the server. I think I'm gonna go for the regex approach and post the answer once finished. Even if it isn't entirely necessary for security it's good for reporting formatting errors to others who will use my code. - Jonathan
I'll post an answer to summarize the discussion. Feel free to add your own complete answer to the question later if you feel it explains better the solution. - rmesteves

1 Answers

1
votes

As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.

If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings. In Python you could use the re library.

As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library

import re

tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]

#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")

for t in tests:
    if(regex.fullmatch(t)):
        print("This source is ok")
    else:
        print("This source is not ok")

In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.

When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.