0
votes

I'm working with Big Query's Hacker News dataset, and was looking at which urls have the most news stories. I'd also like to strip the domain names out, and see which of those have the most news stories. I'm working in R, and am having a bit of trouble getting the follow query to work.

 sql_domain <- "SELECT url,
                       REPLACE(CASE WHEN REGEXP_CONTAINS(url, '//') 
                           THEN url ELSE 'http://' + url END, '&', '?') AS domain_name,
                       COUNT(domain_name) as story_number
                FROM `bigquery-public-data.hacker_news.full`
                WHERE type = 'story'
                GROUP BY domain_name
                ORDER BY story_number DESC
                LIMIT 10"

I've been getting the following error: "Error: No matching signature for operator + for argument types: STRING, STRING. Supported signatures: INT64 + INT64; FLOAT64 + FLOAT64; NUMERIC + NUMERIC"

Can't for the life of me figure out a replacement for the "+" operator. Your help is much appreciated!

2
Do you also need to strip top level domain? i.e. do you need stackoverflow.com or stackoverflow? If you need the latter, then you have stumbled upon a huge problem since everything can be a top level domain nowadays - Check library(urltools) in R for URL parsing. - Sotos
@Sotos stackoverflow.com would be fine! I'll check out urltools though, thank you. - Luke Steer

2 Answers

0
votes

Can't for the life of me figure out a replacement for the "+" operator

In BigQuery - instead of 'http://' + url you should use CONCAT('http://', url)

0
votes

For your goals (top domains submitting to Hacker News):

#standardSQL
SELECT NET.REG_DOMAIN(url) domain, COUNT(*) c
  , ROUND(EXP(AVG(LOG(IF(score<=0,0.1,score)))),2) avg_score
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100

Note how much easier is to call NET.REG_DOMAIN() to get the domain.

enter image description here