0
votes

I have a dataset containing thousands of tweets. Some of those contain urls but most of them are in the classical shortened forms used in Twitter. I need something that gets the full urls so that I can check the presence of some particular websites. I have solved the problem in Python like this:

import urllib2
url_filename='C:\Users\Monica\Documents\Pythonfiles\urlstrial.txt'
url_filename2='C:\Users\Monica\Documents\Pythonfiles\output_file.txt'
url_file= open(url_filename, 'r')
out = open(url_filename2, 'w')
for line in url_file:
   tco_url = line.strip('\n')
   req = urllib2.urlopen(tco_url)
   print >>out, req.url
url_file.close()
out.close()

Which works but requires that I export my urls from Stata to a .txt file and then reimport the full urls. Is there some version of my Python script that would allow me to integrate the task in Stata using the shell command? I have quite a lot of different .dta files and I would ideally like to avoid appending them all just to execute this task.

Thanks in advance for any answer!

1
What does the Python script do precisely? I think I can guess except for req = urllib2.urlopen(tco_url) - Nick Cox
It transforms the tiny-urls used in twitter, like for example bit.ly/162VWRZ in full-blown urls with the complete website, so that I can then, say match the linking patterns of users to their political affiliation. - jre11
That's not to my knowledge a built-in Stata command or function. - Nick Cox

1 Answers

1
votes

Sure, this is possible without leaving Stata. I am using a Mac running OS X. The details might differ on your operating system, which I am guessing is Windows.

Python and Stata Method

Say we have the following trivial Python program, called hello.py:

#!/usr/bin/env python

import csv

data = [['name', 'message'], ['Monica', 'Hello World!']]
with open('data.csv', 'w') as wsock:
    wtr = csv.writer(wsock)
    for i in data:
        wtr.writerow(i)
    wsock.close()

This "program" just writes some fake data to a file called data.csv in the script's working directory. Now make sure the script is executable: chmod 755 hello.py.

From within Stata, you can do the following:

! ./hello.py
* The above line called the Python program, which created a data.csv file.
insheet using data.csv, comma clear names case
list

     +-----------------------+
     |   name        message |
     |-----------------------|
  1. | Monica   Hello World! |
     +-----------------------+

This is a simple example. The full process for your case will be:

  1. Write file to disk with the URLs, using outsheet or some other command
  2. Use ! to call the Python script
  3. Read the output into Stata using insheet or infile or some other command
  4. Cleanup by deleting files with capture erase my_file_on_disk.csv

Let me know if that is not clear. It works fine on *nix; as I said, Windows might be a little different. If I had a Windows box I would test it.

Pure Stata Solution (kind of a hack)

Also, I think what you want to accomplish can be done completely in Stata, but it's a hack. Here are two programs. The first simply opens a log file and makes a request for the url (which is the first argument). The second reads that log file and uses regular expressions to find the url that Stata was redirected to.

capture program drop geturl
program define geturl
    * pass short url as first argument (e.g. http://bit.ly/162VWRZ)
    capture erase temp_log.txt
    log using temp_log.txt
    copy `1' temp_web_file
end

The above program will not finish because the copy command will fail (intentionally). It also doesn't clean up after itself (intentionally). So I created the next program to read what happened (and get the URL redirect).

capture program drop longurl
program define longurl, rclass
    * find the url in the log file created by geturl
    capture log close
    loc long_url = ""
    file open urlfile using temp_log.txt , read
    file read urlfile line
    while r(eof) == 0 {
        if regexm("`line'", "server says file permanently redirected to (.+)") == 1 {
            loc long_url = regexs(1)
        }
        file read urlfile line
    }
    file close urlfile
    return local url "`long_url'"
end

You can use it like this:

geturl  http://bit.ly/162VWRZ
longurl
di "The long url is:  `r(url)'"
* The long url is:  http://www.ciwati.it/2013/06/10/wdays/?utm_source=twitterfeed&
* > utm_medium=twitter

You should run them one after the other. Things might get ugly using this solution, but it does find the URL you are looking for. May I suggest that another approach is to contact the shortening service and ask nicely for some data?

If someone at Stata is reading this, it would be nice to have copy return HTTP response header information. Doing this entirely in Stata is a little out there. Personally I would use entirely Python for this sort of thing and use Stata for the analysis of data once I had everything I needed.