2
votes

I want to collect all Twitter card headlines and urls from my tweets for a project. For example, for this tweet: https://twitter.com/WSJ/status/1021517076069056514, I would want to retrieve the following information:

Right now, I'm getting this information by going to the tweet and inspecting the card, but I'd like to do this for code and iterate through my tweets. Does anyone know how to get this information programmatically? Would really appreciate it!

1
Possible duplicate or helpful Get Twitter card from APIchickity china chinese chicken

1 Answers

3
votes

TLDR; The real, best answer may be a duplicate of Get Twitter card from API

The answer suggests to inspect a request to the URL and examine HTML elements. This works for your example tweet, but unfortunately it likely will not be general enough to work for all others.

For example, I used hard-coded tags found in the example that may not be in others. But surely this can serve as a starting point and be adapted to work for all tweets.

Most importantly proves it can be done.

import tweepy
from tweepy import OAuthHandler
import requests 

# fill values
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

tweet_id = 1021517076069056514

status = api.get_status(id=tweet_id)

tweet_url = status.entities['urls'][0]['expanded_url']

r = requests.get(tweet_url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content, 'html.parser')

media_container =  soup.select('div.card2.js-media-container')

tweet_card = media_container[0].select('div.js-macaw-cards-iframe-container')

tweet_card_url = tweet_card[0]['data-full-card-iframe-url']

twitter_base_url = 'http://www.twitter.com'

r2 = requests.get(''.join([twitter_base_url, tweet_card_url]))

final_page = r2.content

soup2 = BeautifulSoup(final_page, 'html.parser')

final_data = soup2.find('img', {'class': 'u-block'}) 

headline = final_data['alt']
image_link = final_data['data-src']

print 'Headline: {}'.format(headline)
print 'Image Link: {}'.format(image_link)

gets:

Headline: Global central banks have rattled bond markets
Image Link: https://pbs.twimg.com/card_img/1021513789722841093/LQWGa8uL?format=jpg&name=600x314