I want to extract shortened URLs from tweets if any. These URLs follow a standard form:http://t.co (details here)
For this, I used the following regex expression which works fine when I tested it with tweet text by just storing the text as a string.
NOTE: I am using https://shortnedurl/string instead of the real shortened URL because StackOverflow does not allow posting such URLs here.
Sample code:
import re
tweet = "Grim discovery in the USS McCain collision probe https://shortnedurl.com @MattRiversCNN reports #TheLead"
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
tweet)
for url in urls:
print "printing urls", url
The output of this code:
printing urls https://shortnedurl.com
However, when I read the tweet from twitter using its API and run the same regex on it, I get the following output which is undesirable.
printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string</a></span>
printing urls https://twitter.com/MattRiversCNN
printing urls https://twitter.com/search?q=%23TheLead
It seems like it's getting the URL for twitter ID, as well as a hashtag.
How can I deal with this problem? I just want to read only these http://t.co URLs.
UPDATE1: I tried https?://t.co/\S*, however, I am still getting the following noisy url:
printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>
I do not know why the same URL is found again with the </a><span>
.
For the https?://t.co/\S+, I get invalid URLs because it combines both of these above URLs in one:
printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>
Update2: The tweet text looks a bit different what I expected:
Grim discovery in the USS McCain collision probe
<span class="link"><a href="https://shortenedurl">https://shortenedurl</a></span> <span class="username"><a
href="https://twitter.com/MattRiversCNN">@MattRiversCNN</a></span>
reports <span class="tag"><a href="https://twitter.com/search?
q=%23TheLead">#TheLead</a></span>