0
votes

i scraped twitter data but not with tweepy and I want to get the number of images / videos used in a tweet for every user. what I have as far: the tweet URL: "https://twitter.com/user_screen_name/status/tweet_id, I have also the user_id and tweets ( text + links +media).

what I want to do, is to check if the tweet contains a video, if yes, count it and the same for the image. I noticed that the links used in tweets starts with "../t.co.." so they're basically redirected links. also, the images / videos showed in the tweet are basically those contained in the redirected link ( that's what I understand)

I tried this code for images count but I didn't get any results:

import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen   
def get_image_count(url):              
    soup = bs4.BeautifulSoup(urlopen((url))
    images = soup.findAll('img')
    file_types= '//img[contains(@src, ".jpg") or contains(@src, ".jpeg") or contains(@src, ".png")]'
    # loop through all img elements found and store the urls with matching extensions
    urls = list(x for x in images if x['src'].split('.')[-1] in file_types)
    print(urls)
    return len(urls)

when I run this code using this link='https://twitter.com/fritzlabs/status/1369661296162054145' this is what I get as output:

[<img alt="Twitter" height="38" src="https://abs.twimg.com/errors/logo46x38.png" srcset="https://abs.twimg.com/errors/logo46x38.png 1x, https://abs.twimg.com/errors/[email protected] 2x" width="46"/>]

1

any help here please? I tried other code but got the same output. thank you

2

2 Answers

1
votes

This is happening because the HTML returned from the request is not the tweet, but a warning saying that Javascript is disabled. This is not a fault of your script, it also happens when you make the request in the browser, regardless of whether javascript is enabled or not.

Whan making a browser request to your example tweet, the disabled javascript HTML is returned, then javascript does run and loads in the actual tweet.

To see this in action, open Chrome or Firefox, press F12 and go to the Network tab. Visit your page. the first request is the same as the request you make in python, to tweet 1369661296162054145. If you look at the preview of that requests response, you will see the javascript warning.

Further down the network tab, you will see a request for 1369661296162054145.json. This is the request that returns the actual tweet, and the request you will need to replicate.

0
votes

so I tried to use selenium with the PhantomJS driver as recommended in some posts that i checked. this is the code that I tried:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotVisibleException
import requests

link = 'https://twitter.com/fritzlabs/status/1369661296162054145'
driver = webdriver.PhantomJS()
driver.get(link)
image_src = driver.find_element_by_tag_name('img').get_attribute('src')
print(image_src)
response = requests.get(image_src).content
print(response)

I tried to print the 'image_src' to have an idea about it. when I run the code, this is what I get:

NoSuchElementException: Message: {"errorMessage":"Unable to find element with tag name 'img'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Content-Length":"90","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:63767","User-Agent":"selenium/3.141.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"tag name\", \"value\": \"img\", \"sessionId\": \"5bba45c0-8279-11eb-b30c-d7ded72a9eb3\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/5bba45c0-8279-11eb-b30c-d7ded72a9eb3/element"}}
Screenshot: available via screen

I am really not familiare with selenium not that much with beautifulsoup, so any one could help I'd appreciate it. thank you