2
votes

I have a column which has only ".edu .gov .org .com" domains stored in it. There is another column in the same table which stores CompanyID. I want to know the best procedure to categorize the company based on their domain name. i.e.: Educational for .edu, government for .gov etc.

My table looks like this:

CompaniID  Website
--------------------------------------------
1          www.comp1.gov/index.php
2          www.great-nice.comp.edu/#contact

I tried the following approach to handle the TLD but it fails for some cases. Once I figure the TLD comparison, I can handle the rest of the query.

Snippet:

CASE
   WHEN
     REGEXP_Instr(Website, '.com') > 0
     AND
     NOT REGEXP_LIKE (SUBSTR (Website, REGEXP_Instr(Website, '.com') + 4,1), '^[a-z]|^[0-9]', 'i')
   THEN  'For Profit'   
   ELSE 'Others'
END 
2

2 Answers

2
votes

You can USE \.[a-z]+(/|$) pattern to extract characters before the '/' characters.

WITH x (y)
     AS (SELECT 'www.comp1.gov/index.php' FROM DUAL UNION ALL
         SELECT 'www.great-nice.comp.edu/#contact' FROM DUAL UNION ALL
         SELECT 'www.comp1.edu' FROM DUAL
        )
SELECT y,
       RTRIM (REGEXP_SUBSTR (y, '\.[a-z]+(/|$)'), '/') tld,
       REGEXP_SUBSTR (y, '\.([a-z]+)(/|$)', 1, 1, NULL, 1) tld2 --works only in 11gR2 and higher versions 
  FROM x;

|    y                              | tld | tld2 |
|-----------------------------------|-----|------|
|  www.comp1.gov/index.php          |.gov | gov  |
|  www.great-nice.comp.edu/#contact |.edu | edu  |
|  www.comp1.edu                    |.edu | edu  |

Use this in CASE statement to categorise as per your needs.

0
votes

That would match things like www.comed.com. You probably want

.com(?:[/]|$)