1
votes

I want to extract shortened URLs from tweets if any. These URLs follow a standard form:http://t.co (details here)

For this, I used the following regex expression which works fine when I tested it with tweet text by just storing the text as a string.

NOTE: I am using https://shortnedurl/string instead of the real shortened URL because StackOverflow does not allow posting such URLs here.

Sample code:

import re

tweet = "Grim discovery in the USS McCain collision probe https://shortnedurl.com @MattRiversCNN reports #TheLead"

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                  tweet)
for url in urls:
    print "printing urls", url 

The output of this code:

printing urls https://shortnedurl.com

However, when I read the tweet from twitter using its API and run the same regex on it, I get the following output which is undesirable.

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string</a></span>
printing urls https://twitter.com/MattRiversCNN
printing urls https://twitter.com/search?q=%23TheLead

It seems like it's getting the URL for twitter ID, as well as a hashtag.

How can I deal with this problem? I just want to read only these http://t.co URLs.

UPDATE1: I tried https?://t.co/\S*, however, I am still getting the following noisy url:

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

I do not know why the same URL is found again with the </a><span>.

For the https?://t.co/\S+, I get invalid URLs because it combines both of these above URLs in one:

printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

Update2: The tweet text looks a bit different what I expected:

    Grim discovery in the USS McCain collision probe 
<span class="link"><a href="https://shortenedurl">https://shortenedurl</a></span> <span class="username"><a 
href="https://twitter.com/MattRiversCNN">@MattRiversCNN</a></span>
     reports <span class="tag"><a href="https://twitter.com/search?
    q=%23TheLead">#TheLead</a></span>
2

2 Answers

1
votes

If I understand you correctly, just put the string you want to have contained in your regex, like so:

https?://shortnedurl.com/\S*
# look for http or https:://
# shortnedurl.com/ literally
# followed by anything not a whitespace character, 0+

See a demo on regex101.com.
For your special case:

https?://t\.co/\S*
1
votes

you can use the regex

https?://t\.co/\S+