Unable to select the table data in scrapy

Question

I am trying to scrape this website for an academic purpose in scrapy using css/xpath selector.

I need to select the details in td in the table with id DataTables_Table_0. However I am unable to even select div element which contains the table, let alone the table data.

HTML block that I want to parse is

# please ignore wrong indentation
<div id="fund-selector-data">
    <div class=" ">
        <div id="DataTables_Table_0_wrapper" class="dataTables_wrapper no-footer">
            <div class="dataTables_scroll">
                <div class="dataTables_scrollHead"
                </div>
            <div class="dataTables_scrollBody" style="position: relative; overflow: auto; width: 100%;">
                  <table class="row-border dataTable table-snapshot no-footer" data-order="[]" cellspacing="0" width="100%"
        id="DataTables_Table_0" role="grid" style="width: 100%;">
                        <thead>
                        </thead>
        <tbody>
        <tr role="row" class="odd">
        <td><a href="/downloads/fund-card/38821" class="orange">PDF</a></td>
        <td class=" text-left"><a href="/funds/38821/aditya-birla-sun-life-bal-bhavishya-yojna-direct-plan">ABSL Bal
                Bhavishya Yojna Dir</a>&nbsp;|&nbsp;<a class="invest-online-blink invest-online " target="_blank"
                href="/funds/invest-online-tracking/420/" data-amc="aditya-birla-sun-life-mutual-fund"
                data-fund="aditya-birla-sun-life-bal-bhavishya-yojna-direct-plan">Invest Online</a></td>
        <td data-order="" class=" text-left">
            <div class="raterater-layer text-left test-fund-rating-star "><small>Unrated</small></div>
        </td>
        <td class=" text-left"><a
                href="/premium/?utm_medium=vro&amp;utm_campaign=premium-unlock&amp;utm_source=fund-selector">
                <div class="unlock-premium"></div>
            </a></td>
        </tbody>

scrapy CSS selector is as follow:

# Selecting Table (selector)
response.css("#DataTables_Table_0")         # returns blank list
# Selecting div class (selector)
response.css(".dataTables_scrollBody")      # returns blank list
# Selecting td element
response.css("#DataTables_Table_0 tbody tr td a::text").getall()        # returns blank list

I have also tried xpath to select the element but have gotten the same result. I have found that I am unable to select any element which is below the div with empty class.I am unable to comprehend why it will not work in this case? Am I missing anything? Any help will be appreciated.

tomvanner tomvanner · Accepted Answer · 2021-03-28T14:39:50

The problem

It looks as though the elements you're trying to select are loaded via javascript as a separate API call. If you visit the page, the table has message:

Please wait while we are fetching data...

The Scrapy docs have a section about this, with their recommendation being to find the source of the dynamically loaded content and simulate these requests from your crawling code.

The solution

The data source can be found by looking at the XHR network tab in Chrome dev tools.

In this case, it looks as though the data source for the table you're trying to parse is

https://www.valueresearchonline.com/funds/selector-data/primary-category/1/equity/?plan-type=direct&tab=snapshot&output=html-data

This seems to be a replica of the original URL, but with selector replaced with selector-data and a output=html-data query parameter on the end.

This returns a JSON object with the following format:

{
    title: ...,
    tracking_url: ...,
    tools_title: ...,
    html_data: ...,
    recordsTotal: ...
}

It looks as though html_data is the field you want, since that contains the dynamic table html you originally wanted. You can now simply load this html_data and parse it as before.

In order to simulate all of this in your scraping code, simply add a parse_table method to your spider to handle the above json response. You might also want to dynamically generate the table data source URL based on the page you're currently scraping, so it's worth adding a method that adds edits the original URL as detailed above.

Example code

I'm not sure how you've set up your spider, so I've written a couple of methods that can be easily ported into whatever spider setup you're currently using.

import json
import scrapy
from scrapy.http import Request
from urllib.parse import urlparse, urlencode, parse_qsl

class TableSpider(scrapy.Spider):
    name = 'tablespider'
    start_urls = ['https://www.valueresearchonline.com/funds/selector/primary-category/1/equity/?plan-type=direct&tab=snapshot']

    def _generate_table_endpoint(self, base_url):
        """Dyanmically generate the table data endpoint."""
        # Parse the base url
        parsed = urlparse(base_url)
        
        # Add output=html-data query param
        current_params = dict(parse_qsl(parsed.query))
        new_params = {'output': 'html-data'}
        merged_params = urlencode({**current_params, **new_params})
        
        # Update path to get selector data
        data_path = parsed.path.replace('selector', 'selector-data')
        
        # Update the URL with the new path and query params
        parsed = parsed._replace(path=data_path, query=merged_params)
        
        return parsed.geturl()

    def parse(self, response):
        # Any pre-request logic goes here
        # ...
        
        # Request and parse the table data source
        yield Request(
            self._generate_table_endpoint(response.url),
            callback=self.parse_table
        )
        
    def parse_table(self, response):
        try:
            # Load the json response into a dict
            res = json.loads(response.text)
            # Get the html_data value (containing the dynamic table html)
            table_html = res['html_data']
            
            # Your table data extraction goes here...
            # ===========================================================
            
        except:
            raise Exception('No table data present.')
        
        yield {'table_data': 'your response data'}

Unable to select the table data in scrapy

1 Answers

The problem

The solution

Example code