45
votes
<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>

I want to extract the source (i.e. src) attribute from an image (i.e. img) tag using BeautifulSoup. I use bs4 and I cannot use a.attrs['src'] to get the src, but I can get href. What should I do?

4
Hi, your post is kinda hard to read -- add some punctuation and line-breaks. It would also be helpful to report the exact error message you receive and what you'd expect / want to happen. - patrick
@patrick I have edited the question - iDelusion
Why would you expect a.attrs['src'] to work? There's no <a> tag with a src attribute in the snippet you've shown. - jwodder
this is also a completely different question than before & the headline makes no sense now. - patrick
@patrick I used regex to get the src .what's the other questions ? - iDelusion

4 Answers

63
votes

You can use BeautifulSoup to extract src attribute of an html img tag. In my example, the htmlText contains the img tag itself but this can be used for a URL too along with urllib2.

For URLs

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    #print image source
    print image['src']
    #print alternate text
    print image['alt']

For Texts with img tag

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print image['src']
15
votes

A link doesn't have attribute src you have to target actual img tag.

import bs4

html = """<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'
4
votes

here is a solution that will not trigger a KeyError in case the img tag does not have a src attribute:

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')

images = bs.find_all('img')
for img in images:
    if img.has_attr('src'):
        print(img['src'])
2
votes

You can use BeautifulSoup to extract src attribute of an html img tag. In my example, the htmlText contains the img tag itself but this can be used for a URL too along with urllib2.

The solution provided by the most rated answer is not working any more with python3. This is the correct implementation:

For URLs

from bs4 import BeautifulSoup as BSHTML
import urllib3

http = urllib3.PoolManager()
url = 'your_url'

response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')

for image in images:
    #print image source
    print(image['src'])
    #print alternate text
    print(image['alt'])

For Texts with img tag

from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])