1
votes

Long time lurker, first time poster. I spent some time looking over related questions but I still couldn't seem to figure this out. I think it's easy enough but please forgive me, I'm still a bit of a BeautifulSoup/python n00b.

I have a text file of URLs I parsed from a previous webscraping exercise that I'd like to search through and extract the text contents of a list item (<li>) based on a given keyword. I want to save a csv file of the URL as one column and the corresponding contents from the list item in the second column.

Given some html:

...

<li>
<span class = "spClass">Breakfast</span> " — "
<a href="/examplepage/Pancakes" class="linkClass">Pancakes</a>
</li>

<li>
<span class = "spClass">Lunch</span> " — "
<a href="/examplepage/Sandwiches" class="linkClass">Sandwiches</a>
</li>

<li>
<span class = "spClass">Dinner</span> " — "
<a href="/examplepage/Stew" class="linkClass">Stew</a>
</li>

etc etc etc

...

My code so far is something like:

import requests
import pandas as pd
from bs4 import BeautifulSoup

results = []
kw = "Dinner"

with open("urls.txt") as file:
    for line in file:
        url = line.rstrip()
        source = requests.get(url).text
        soup = BeautifulSoup(source, "html.parser")
        results.append(url + (soup.find_all('li', string=kw)))

print(results)
df = pd.DataFrame(results)
df.to_csv('mylist1.csv')

I'm hoping the output in this case is Col 1: (url from txt file); Col 2: "Dinner — Stew" (or just "Stew"), but for each page in the list because these items vary from page to page. I will change the keyword to extract different list items accordingly. In the end I'd like one big spreadsheet with the full list or the url and corresponding item side by side something like below:

example:
url | Stew
url | Hamgurgers
url | Hamgurgers
url | Chicken
url | Hamburgers
url | Chicken
url | Hamburgers
url | Curry
etc etc etc

Thanks for your help! Cheers.