python - Regular expression to extract URL from an HTML link

Question

I’m a newbie in Python. I’m learning regexes, but I need help here.

Here comes the HTML source:

<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>

I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?

Duplicate: stackoverflow.com/questions/430966/regex-for-links-in-html-text — S.Lott
I've been away from SO for a while, it's good to see I've missed nothing, and people are STILL asking how to parse HTML with regex every damn day. — bobince
@bobince Multiple times a day, it is so bad I created two questions that I can redirect people to and a form answer that points them there. — Chas. Owens

David David · Accepted Answer · 2009-01-31T19:17:06

If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.

python - Regular expression to extract URL from an HTML link

10 Answers