0
votes

I am trying to scrape a website with multiple 'p' tags with beautifulsoup and I find it very difficult.

I want to get all posts associated with p tags. The find_all on beautifulsoup will not get this done and the image is not saving I get an error that the file cannot be saved and tell me how to retrieve, add or scrape all the text in p tags and the image on the HTML code below.

my code

kompas = requests.get('https://url_on_html.com/')
beautify = BeautifulSoup(kompas.content,'html5lib')

news = beautify.find_all('div', {'class','jeg_block_container'})
arti = []

for each in news:
    title = each.find('h3', {'class','jeg_post_title'}).text
    lnk = each.a.get('href')
    r = requests.get(lnk)
    soup = BeautifulSoup(r.text,'html5lib')
    content = soup.find('p').text.strip()
    images = soup.find_all('img')

    arti.append({
        'Headline': title,
        'Link': lnk,
        'image': 'images'
        })

let's take this HTML code as a scraping sample

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p>Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<p>the emergency of our matter is Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<p> we will not once in Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well. 
</p>
<script></script>
<br></br>
<script></script>
<p>king of our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<img src="image.png">
<p>he is our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<p>some weas Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>

I want to filter and scrape all the 'p' tags and add them to my content.

The issue is find_all attribute on beautifulsoup can not retrieve this. The find all attributes will just scrape the first line of the p element or elements.

1
FYI ‘to scrap’ and ‘scrapping’ mean to throw away like rubbish. You should use scrape and scraping for what you’re doingDisappointedByUnaccountableMod

1 Answers

0
votes

You are running soup.find, not soup.find_all. find_all will return a list of all p's. You cannot run text.strip() on a list, so let's wrap it in a list comprehension that does it for all independent items:

soup = BeautifulSoup(r.text,'html5lib')
content = [i.text.strip() for i in soup.find_all('p')]

now content is a list of strings. If you want to turn this list into a single string you can run:

content = ' '.join(content)

About the image, soup.find_all('img') will also return a list of images. To extract the link you will also need to do this for all images in the list independently: images = [i['src'] for i in soup.find_all('img')].

This makes:

kompas = requests.get('https://url_on_html.com/')
beautify = BeautifulSoup(kompas.content,'html5lib')

news = beautify.find_all('div', {'class','jeg_block_container'})
arti = []

for each in news:
    title = each.find('h3', {'class','jeg_post_title'}).text
    lnk = each.a.get('href')
    r = requests.get(lnk)
    soup = BeautifulSoup(r.text,'html5lib')
    content = [i.text.strip() for i in soup.find_all('p')]
    content = ' '.join(content)
    images = [i['src'] for i in soup.find_all('img')]

    arti.append({
        'Headline': title,
        'Link': lnk,
        'image': images,
        'content': content
        })