5
votes

In order to find the country for a specific IP address, I'm using the Maxmind IP address database. I've downloaded the database and imported it into Google BigQuery, so I can query it. In a separate table, I'm capturing IP-addresses from certain events in our systems. I would now like to join those two data sources.

The columns in the Maxmind database are as follows:

  • start_ip_range STRING NULLABLE
  • end_ip_range STRING NULLABLE
  • start_ip_num STRING NULLABLE
  • end_ip_num STRING NULLABLE
  • country_code STRING NULLABLE
  • country_name

The columns in my event table are:

  • request_id STRING NULLABLE
  • ip_address STRING NULLABLE

As documented here (https://dev.maxmind.com/geoip/legacy/csv/) there is a way to get an integer representation of the ip address, so I can use it to query the ip address and retrieve the country_code or country_name.

I have now constructed the following query:

SELECT
  p.*,
  g.country_code AS country_code
FROM
  `dev.event_v1` p
INNER JOIN
  `dev.geo_ip_countries` g
ON
  SAFE_CAST(SPLIT(p.ip_address, ".")[OFFSET(0)] AS NUMERIC)*16777216 + 
  SAFE_CAST(SPLIT(p.ip_address, ".")[OFFSET(1)] AS NUMERIC)*65536 + 
  SAFE_CAST(SPLIT(p.ip_address, ".")[OFFSET(2)] AS NUMERIC)*256 + 
  SAFE_CAST(SPLIT(p.ip_address, ".")[OFFSET(3)] AS NUMERIC)
BETWEEN 
  SAFE_CAST(g.start_ip_num AS INT64)
AND 
  SAFE_CAST(g.end_ip_num AS INT64)
LIMIT 100

And however this works when using a limit, it doesn't for constructing a view.

Two questions: 1. Is there a way to simplify the query 2. Google BigQuery throws an error when I try to return a large result set:

Error: Query exceeded resource limits. 28099.974050246612 CPU seconds were used, and this query must use less than 5600.0 CPU seconds.

Any help is appreciated!

Solution Rewriting the query to the following worked and solved the resource limit issue too:

SELECT
  p.*,
  g.country_code
FROM
  `dev.event_v1` p
INNER JOIN
  `dev.geo_ip_countries` g
ON 
  NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(p.ip_address),16) = NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(g.start_ip_range),16)
WHERE 
  NET.SAFE_IP_FROM_STRING(p.ip_address) 
  BETWEEN 
    NET.SAFE_IP_FROM_STRING(g.start_ip_range) 
  AND 
    NET.SAFE_IP_FROM_STRING(g.end_ip_range)
2

2 Answers

2
votes

Try below (BigQuery Standard SQL)

#standardSQL
SELECT
  p.* EXCEPT(ip_address_num),
  g.country_code AS country_code
FROM (
  SELECT *, 
    SAFE_CAST(SPLIT(p.ip_address, ".")[OFFSET(0)] AS NUMERIC)*16777216 + 
    SAFE_CAST(SPLIT(p.ip_address, ".")[OFFSET(1)] AS NUMERIC)*65536 + 
    SAFE_CAST(SPLIT(p.ip_address, ".")[OFFSET(2)] AS NUMERIC)*256 + 
    SAFE_CAST(SPLIT(p.ip_address, ".")[OFFSET(3)] AS NUMERIC) ip_address_num
  FROM `dev.event_v1` 
) p
INNER JOIN (
  SELECT 
    SAFE_CAST(g.start_ip_num AS INT64) start_ip_num, 
    SAFE_CAST(g.end_ip_num AS INT64) end_ip_num,
    country_code
  FROM `dev.geo_ip_countries`
) g
ON ip_address_num BETWEEN g.start_ip_num AND g.end_ip_num
1
votes

So you're joining everything in dev.event_v1 with dev.geo_ip_countries to get dev.geo_ip_countries.country_code for each row in dev.event_v1. I think you'd be interested in a left join.

You might be interested in testing if the conversions in the net functions can replace your math portion.

I don't know much about the contents of event_v1 requests or ip_addresses which could divide it, but I'm willing to bet there's a lot more rows there than there are in the geo_ip_countries. It's likely the bulk of your query time. Lets say you need to reduce that time about 6 fold. You should probably select a 6th of it to join and insert into a staging table, and repeat for the next 6ths in sequence.

I think using the NTILE(6) would help you; maybe OVER (ROWS UNBOUNDED PRECEDING) as nt, not sure really. Then with a nt = 1 in the where or join on clause.