0
votes

just 2 quick doubts:

1- I want my final JSON file to replace the text extract (for example text extracted is ADD TO CART but I want to change to IN STOCK in my final JSON. Is it possible?

2- I also would like to add some custom data to my final JSON file that is not in the website, for example "Store name"... so every product that I scrape will have the store name after it. Is it possible?

I am using both Portia and Scrapy so your suggestions are welcome in both platforms.

My Scrapy spider code is below:

import scrapy
from __future__ import absolute_import
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Identity
from scrapy.spiders import Rule
from ..utils.spiders import BasePortiaSpider
from ..utils.starturls import FeedGenerator, FragmentGenerator
from ..utils.processors import Item, Field, Text, Number, Price, Date, Url, 
Image, Regex
from ..items import PortiaItem


class Advent(BasePortiaSpider):
    name = "advent"
    allowed_domains = [u'www.adventgames.com.au']
    start_urls = [u'http://www.adventgames.com.au/c/4504822/1/all-games-a---k.html',
                  {u'url': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page=[1-5]',
                   u'fragments': [{u'valid': True,
                                   u'type': u'fixed',
                                   u'value': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page='},
                                  {u'valid': True,
                                   u'type': u'range',
                                   u'value': u'1-5'}],
                   u'type': u'generated'}]
    rules = [
        Rule(
            LinkExtractor(
                allow=('.*'),
                deny=()
            ),
            callback='parse_item',
            follow=True
        )
    ]
    items = [
        [
            Item(
                PortiaItem,
                None,
                u'.DataViewCell > form > table',
                [
                    Field(
                        u'Title',
                        'tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text',
                        []),
                    Field(
                        u'Price',
                        'tr:nth-child(1) > td > .DataViewItemOurPrice *::text',
                        []),
                    Field(
                        u'Img_src',
                        'tr:nth-child(1) > td > .DataViewItemThumbnailImage > div > a > img::attr(src)',
                        []),
                    Field(
                        u'URL',
                        'tr:nth-child(1) > td > .DataViewItemProductTitle > a::attr(href)',
                        []),
                    Field(
                        u'Stock',
                        'tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)',
                        [])])]]
1

1 Answers

0
votes

I have never used the items class variable, it looks very unreadable and difficult to understand.

I would suggest you to have a callback method and parse it like this

def my_callback_func(self, response):

    myitem = PortiaItem()


    for item in response.css(".DataViewCell > form > table"):

        item['Title'] = item.css('tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text').extract_first()

        item['Stock'] = item.css('tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)').extract_first()

        if item['Stock'] == "ADD TO CART":

            item['is_available'] = "YES"

        ...... and so on

        yield item