Python - trying to get URL (href) from web scraping using Scrapy

Question

I'm trying to get the URL, or href, from a webpage using web scraping, specifically using Scrapy. However, it returns an empty list when I response.xpath('XPATH').extract() the href link. The HTML page structure is: The specific HTML element href I'm trying to get is: <a href="#2020-38970" class="redNoticeItem__labelLink" data-singleurl="https://ws-public.interpol.int/notices/v1/red/2020-38970">MAGOMEDOVA<br>MADINA</a>

The result of the xpath command is:

For context, I'm trying to get the information in each person's URL and extract it, but I'm unable to retrieve the href from the web page.

I copied the full xpath of the HTML element, and it's: /html/body/div1/div1/div[6]/div/div2/div/div2/div2/div/div2/div/div/div2/div1/a.

But this still returns [] when I run response xpath command.

When you have text output, don't take a picture but copy paste the output in your POST The html can be copied as well with right click -> copy as outerHTML. — Gilles Quenot
with Google Chrome you could right click on a page to inspect and get by context menu xpath value for focused element. — boly38

stidmatt stidmatt · Accepted Answer · 2020-06-10T21:07:05

In this situation I personally wouldn't use xpath. I wouldn't even use Scrapy. In this situation I believe the simplest solution would be to instead use BeautifulSoup and requests together.

import BeautifulSoup as bs4
import requests
url=YOUR_URL_HERE
soup=BeautifulSoup(requests.get(url).text)
links=soup.find_all('a')
urls=[x['href'] for x in links]

This code will give you the href of every link on the page in a list, and you can filter the list further by the class or whatever you need.

Python - trying to get URL (href) from web scraping using Scrapy

2 Answers