0
votes

I am using scrapy to extract data from web. I am trying to extract the text of anchor tags under a span tag as shown below:

<span>.....</span>
<span id = "size_selection_list">
    <a>....</a>
    <a>....</a>
    .
    .
    .
    <a>
</span>

I am using the following xpath logic:

t = sel.xpath('//div[starts-with(@id,"size_selection_container")]/span[2]')
for x in t.xpath('.//a'):
....

The problem is that the span element is reached but the <a> tags are not iterated. What is the mistake here? Also the <a> has an href which has javascript. Is this the reason for the problem?

2
Your logic works with the sample HTML you provided: pastebin.com/hxSZ041j . So either you're not sharing your code as it is or the sample HTML is not what you are working with.paul trmbrth

2 Answers

0
votes

If I would you I would use requests and BeautifulSoup4.

Please note, this code is untested, but it should work.

import requests
from bs4 import BeautifulSoup
r = requests.get(yoururlhere).text
soup = BeautifulSoup(r, 'html.parser') #You can use LXML or other things, I am using the standard parser for compatibility
span = div.find('div', {'class': 'theclass'}
tags = span.findAll('a', href=True)
for i in tags:
    print(i.getText()) #getText might not be a function, consider removing the extra ()
    print(i['href']) #<-- This is the links, above is the text

I hope this works, please let me know

0
votes

this a all i can do, you html code is not complete.

import lxml.html
string = '''<span>.....</span>
<span id = "size_selection_list">
    <a>....</a>
    <a>....</a>
    .
    .
    .
    <a>....</a>
</span>'''

html = lxml.html.fromstring(string)
for a in html.xpath('//span[@id="size_selection_list"]//a'):
    print(a.tag)

out:

a
a
a