0
votes

Web Scraping Needs

To scrape the title of events from the first page on eventbrite link here.

Approach

Whilst the page does not have much javascript and the page pagination is simple, grabbing the titles for every event on the page is quite easy and don't have problems with this.

However I see there's an API which I want to re-engineer the HTTP requests, for efficiency and more structured data.

Problem

I'm able to mimic the HTTP request using the requests python package, using the correct headers, cookies and parameters. Unfortunately when I use the same cookies with scrapy it seems to be complaining about three key's in the cookie dictionary that are blank 'mgrefby': '', 'ebEventToTrack': '', 'AN': '',. Despite the fact that they are blank in the HTTP request used with the request package.

Requests Package Code Example

import requests

cookies = {
    'mgrefby': '',
    'G': 'v%3D2%26i%3Dbff2ee97-9901-4a2c-b5b4-5189c912e418%26a%3Dd24%26s%3D7a302cadca91b63816f5fd4a0a3939f9c9f02a09',
    'ebEventToTrack': '',
    'eblang': 'lo%3Den_US%26la%3Den-us',
    'AN': '',
    'AS': '50c57c08-1f5b-4e62-8626-ea32b680fe5b',
    'mgref': 'typeins',
    'client_timezone': '%22Europe/London%22',
    'csrftoken': '85d167cac78111ea983bcbb527f01d2f',
    'SERVERID': 'djc9',
    'SS': 'AE3DLHRwcfsggc-Hgm7ssn3PGaQQPuCJ_g',
    'SP': 'AGQgbbkgEVyrPOfb8QOLk2Q893Bkx6aqepKtFsfXUC9SW6rLrY3HzVmFa6m91qZ6rtJdG0PEVaIXdCuyQOL27zgxTHS-Pn0nHcYFr9nb_gcU1ayxSx4Y0QXLDvhxGB9EMsou1MZmIfEBN7PKFp_enhYD6HUP80-pNUGLI9R9_CrpFzXc48lp8jXiHog_rTjy_CHSluFrXr2blZAJfdC8g2lFpc4KN8wtSyOwn8qTs7di3FUZAJ9BfoA',
}

headers = {
    'Connection': 'keep-alive',
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    'X-CSRFToken': '85d167cac78111ea983bcbb527f01d2f',
    'Content-Type': 'application/json',
    'Accept': '*/*',
    'Origin': 'https://www.eventbrite.com',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Dest': 'empty',
    'Referer': 'https://www.eventbrite.com/d/ny--new-york/human-resources/?page=2',
    'Accept-Language': 'en-US,en;q=0.9',
}

data = '{"event_search":{"q":"human resources","dates":"current_future","places":\n["85977539"],"page":1,"page_size":20,"online_events_only":false,"client_timezone":"Europe/London"},"expand.destination_event":["primary_venue","image","ticket_availability","saves","my_collections","event_sales_status"]}'

response = requests.post('https://www.eventbrite.com/api/v3/destination/search/', headers=headers, cookies=cookies, data=data)

Scrapy Code example

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['eventbrite.com']
    start_urls = []

    cookies = {
    'mgrefby': '',
    'G': 'v%3D2%26i%3Dbff2ee97-9901-4a2c-b5b4-5189c912e418%26a%3Dd24%26s%3D7a302cadca91b63816f5fd4a0a3939f9c9f02a09',
    'ebEventToTrack': '',
    'eblang': 'lo%3Den_US%26la%3Den-us',
    'AN': '',
    'AS': '50c57c08-1f5b-4e62-8626-ea32b680fe5b',
    'mgref': 'typeins',
    'client_timezone': '%22Europe/London%22',
    'csrftoken': '85d167cac78111ea983bcbb527f01d2f',
    'SERVERID': 'djc9',
    'SS': 'AE3DLHRwcfsggc-Hgm7ssn3PGaQQPuCJ_g',
    'SP': 'AGQgbbkgEVyrPOfb8QOLk2Q893Bkx6aqepKtFsfXUC9SW6rLrY3HzVmFa6m91qZ6rtJdG0PEVaIXdCuyQOL27zgxTHS-Pn0nHcYFr9nb_gcU1ayxSx4Y0QXLDvhxGB9EMsou1MZmIfEBN7PKFp_enhYD6HUP80-pNUGLI9R9_CrpFzXc48lp8jXiHog_rTjy_CHSluFrXr2blZAJfdC8g2lFpc4KN8wtSyOwn8qTs7di3FUZAJ9BfoA',
}

    headers = {
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest',
        'X-CSRFToken': '85d167cac78111ea983bcbb527f01d2f',
        'Content-Type': 'application/json',
        'Accept': '*/*',
        'Origin': 'https://www.eventbrite.com',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Dest': 'empty',
        'Referer': 'https://www.eventbrite.com/d/ny--new-york/human-resources/?page=1',
        'Accept-Language': 'en-US,en;q=0.9',
    }

    data = '{"event_search":{"q":"human resources","dates":"current_future","places":\n["85977539"],"page":1,"page_size":20,"online_events_only":false,"client_timezone":"Europe/London"},"expand.destination_event":["primary_venue","image","ticket_availability","saves","my_collections","event_sales_status"]}'


    def start_requests(self):
        url = 'https://www.eventbrite.com/api/v3/destination/search/'
        yield scrapy.Request(url=url, method='POST',headers=self.headers,cookies=self.cookies,callback=self.parse)
    def parse(self,response):
        print('request')

Output

2020-08-01 11:55:33 [scrapy.core.engine] INFO: Spider opened
2020-08-01 11:55:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-01 11:55:33 [test] INFO: Spider opened: test
2020-08-01 11:55:33 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\Aaron\projects\scrapy\eventbrite\.scrapy\httpcache
2020-08-01 11:55:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-01 11:55:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.eventbrite.com/robots.txt> (referer: None) ['cached']
2020-08-01 11:55:33 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'mgrefby', 'value': ''} ('value' is missing)
2020-08-01 11:55:33 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'ebEventToTrack', 'value': ''} ('value' is missing)
2020-08-01 11:55:33 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'AN', 'value': ''} ('value' is missing)   
2020-08-01 11:55:33 [scrapy.core.engine] DEBUG: Crawled (401) <POST https://www.eventbrite.com/api/v3/destination/search/> (referer: https://www.eventbrite.com/d/ny--new-york/human-resources/?page=1) ['cached']   
2020-08-01 11:55:33 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <401 https://www.eventbrite.com/api/v3/destination/search/>: HTTP status code is not handled or not allowed
2020-08-01 11:55:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-01 11:55:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1540,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 32163,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/401': 1,
 'elapsed_time_seconds': 0.187986,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 8, 1, 10, 55, 33, 202931),
 'httpcache/hit': 2,
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/401': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 12,
 'log_count/WARNING': 3,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 8, 1, 10, 55, 33, 14945)}
2020-08-01 11:55:33 [scrapy.core.engine] INFO: Spider closed (finished)

Attempts to solve issue

The 401 status seems to refer to authorisation, for which I can only presume it's not liking the cookie I'm sending.

  1. I've set COOKIES_ENABLED = True with the same output as before
  2. I've set COOKIES_DEBUG = True and see output below

Output with cookies_debug=True

2020-08-01 12:05:15 [scrapy.core.engine] INFO: Spider opened
2020-08-01 12:05:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-01 12:05:15 [test] INFO: Spider opened: test
2020-08-01 12:05:15 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\Aaron\projects\scrapy\eventbrite\.scrapy\httpcache
2020-08-01 12:05:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-01 12:05:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.eventbrite.com/robots.txt> (referer: None) ['cached']
2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'mgrefby', 'value': ''} ('value' is missing)
2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'ebEventToTrack', 'value': ''} ('value' is missing)
2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'AN', 'value': ''} ('value' is missing)   
2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST https://www.eventbrite.com/api/v3/destination/search/>
Cookie: G=v%3D2%26i%3Dbff2ee97-9901-4a2c-b5b4-5189c912e418%26a%3Dd24%26s%3D7a302cadca91b63816f5fd4a0a3939f9c9f02a09; eblang=lo%3Den_US%26la%3Den-us; AS=50c57c08-1f5b-4e62-8626-ea32b680fe5b; mgref=typeins; client_timezone=%22Europe/London%22; csrftoken=85d167cac78111ea983bcbb527f01d2f; SERVERID=djc9; SS=AE3DLHRwcfsggc-Hgm7ssn3PGaQQPuCJ_g; SP=AGQgbbkgEVyrPOfb8QOLk2Q893Bkx6aqepKtFsfXUC9SW6rLrY3HzVmFa6m91qZ6rtJdG0PEVaIXdCuyQOL27zgxTHS-Pn0nHcYFr9nb_gcU1ayxSx4Y0QXLDvhxGB9EMsou1MZmIfEBN7PKFp_enhYD6HUP80-pNUGLI9R9_CrpFzXc48lp8jXiHog_rTjy_CHSluFrXr2blZAJfdC8g2lFpc4KN8wtSyOwn8qTs7di3FUZAJ9BfoA

2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <401 https://www.eventbrite.com/api/v3/destination/search/>
Set-Cookie: SP=AGQgbbno_KHLNiLzDpLHcdI4kotUbRiTxMMY5N0t7VudPU_QGCm2Q0nH7-J99aoRZvGLxXfREH5YfPAtK52iiiLcEpnjh1G43ZBxKuo9qvJHykLV23ZIjaFK0iIr6ptOaczMoQhkaqE-7nJ8t2Ykt18CN196pKZ5QhFuXy6SnspZ0toEGChZcQgmrAAAVPfuoiiUmbTG_wJC8_KikL2sYl2s6-KWUOOpjRFJCko5RGgiyC2Osu9vxZ8; Domain=.eventbrite.com; httponly; Path=/; secure

Set-Cookie: G=v%3D2%26i%3D5cebebd2-2a7f-4638-9912-0abf19111a0c%26a%3Dd33%26s%3Df967e32d15dda2f06b392f22451af935d93f88d1; Domain=.eventbrite.com; expires=Sat, 31-Jul-2021 22:46:28 GMT; httponly; Path=/; secure     

Set-Cookie: ebEventToTrack=; Domain=.eventbrite.com; expires=Sun, 30-Aug-2020 22:46:28 GMT; httponly; Path=/; secure

Set-Cookie: SS=AE3DLHRgTIL46n9XiOZiJRSkccGnNXSMkA; Domain=.eventbrite.com; httponly; Path=/; secure

Set-Cookie: eblang=lo%3Den_US%26la%3Den-us; Domain=.eventbrite.com; expires=Sat, 31-Jul-2021 22:46:28 GMT; httponly; Path=/; secure

Set-Cookie: AN=; Domain=.eventbrite.com; expires=Sun, 30-Aug-2020 22:46:28 GMT; httponly; Path=/; secure

Set-Cookie: AS=350def0c-ed27-45ab-b12c-02e9fb68a8ae; Domain=.eventbrite.com; httponly; Path=/; secure

Set-Cookie: SERVERID=djc44; path=/; HttpOnly; Secure

2020-08-01 12:05:15 [scrapy.core.engine] DEBUG: Crawled (401) <POST https://www.eventbrite.com/api/v3/destination/search/> (referer: https://www.eventbrite.com/d/ny--new-york/human-resources/?page=1) ['cached']
2020-08-01 12:05:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <401 https://www.eventbrite.com/api/v3/destination/search/>: HTTP status code is not handled or not allowed
2020-08-01 12:05:15 [scrapy.core.engine] INFO: Closing spider (finished)
  1. I've tried a scrapy custom cookies downloader middleware for cookies persistence and again same error as before
  2. I've considered using browser automation to grab a cookie, again thinking about this as in future scrapes where I don't want to continuosly grab a cookie.

What I don't understand is with the same cookies, headers and parameters in the requests python package, the JSON object response is there. With scrapy it's complaining about blank dictionary values.

I would be grateful if anyone could look at the code if I've made a glaring mistake or see why the cookie which is accepted by the API endpoint via requests does not seem to work in Scrapy.

2

2 Answers

1
votes

It looks like they're using not value instead of the more accurate value is not None. Opening an issue is your only long-term recourse, but subclassing the cookie middleware is the short-term, non-hacky fix.

A hacky fix is to take advantage of the fact that they're not escaping the cookie value correctly when doing the '; '.join() so you can set the cookie's value to a legal cookie directive (I chose HttpOnly since you're not concerned about JS), and cookiejar appears to discard it, yielding the actual value you care about

>>> from scrapy.downloadermiddlewares.cookies import CookiesMiddleware
>>> from scrapy.http import Request
>>> cm = CookiesMiddleware(debug=True)
>>> req = Request(url='https://www.example.com', cookies={'AN': '; HttpOnly', 'alpha': 'beta'})
>>> cm.process_request(req, spider=None)
2020-08-01 15:08:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://www.example.com>
Cookie: AN=; alpha=beta
>>> req.headers
{b'Cookie': [b'AN=; alpha=beta']}
0
votes

Adding onto Mdaniel's response, I've opened an issue since we've encountered the same issue, and I've referenced your stackoverflow thread.

Our current solution is to use an older version of scrapy (2.2.0 or less) since the latest 2.3.0 is where this cookie check was added. https://github.com/scrapy/scrapy/commit/f6ed5edc31e7cc66225c0860e1534a6230511954 scrapy/downloadermiddlewares/cookies.py line 78

Here's the issue if you wanna add anything I've left out. https://github.com/scrapy/scrapy/issues/4766