0
votes

I have a table containing lines with an id ( representing a person ) linked to a url ( fields names are id and url ). An id can be linked to many urls. An id can be linked to a url multiple times. As a result, I would like to have for each domain we can find in the table, the number of ids in common with each other domain in the table. If 2 domains have no id intersection, I don't need the information: I would like to get only the information when the intersection is not null. The output could be:

  • 1 line per unique domain found, contaning the domain and an array of tuple (domain, volume) representing an intersection between 2 domain and the volume intersecting for each domain where intersection is not empty
  • Just lines with domain1, domain2, intersection volume

Is it possible to do such thing in BigQuery ?

Thank you for your help

2
Are you able to share some sample data and/or the query yo have tried so far? - Ben P

2 Answers

0
votes

If rtenha solution is not what you are looking for, perhaps you may want to consider an INTERSECT function. I hope I understood your need correctly, given you say that if 2 domains have no id intersection you don't need the information then...

Hope my effort helps you to achieve what you want somehow...

WITH tmp as (
(SELECT *
from `project:dataset.table`)
INTERSECT DISTINCT

(SELECT * FROM `project:dataset.table`)
)

SELECT count(*), field1, field2 FROM  tmp GROUP BY field1, field2 
0
votes

You will want to inner join your table against itself, also known as a self-join.

select
  d1.url as url_1 as primary_url,
  d2.url as url_2 as linked_url,
  count(distinct id) as shared_count -- note this gets unique matches, not the total matches
from <dataset.table> d1
inner join <dataset.table> d2 using(id)
where d2.url is not null
group by 1,2

To put the linked urls in an array, put into a CTE and use array_agg.

with temp as(
  ... query from above
)
select primary_url, array_agg(struct(linked_url,shared_count)) as linked_urls
from temp
group by 1

Depending on potential # of linked urls, this might create arrays that are pretty big, which might not be a great idea.