3
votes

I want to scrape a few websites and see if the word "katalog" is present there. If yes, I want to retrieve the link of all the tabs/sub pages where that word is present. Is it possible to do so?

I tried following this tutorial but the wordlist.csv I get at the end is empty even though the word catalog does exist on the website.

https://www.phooky.com/blog/find-specific-words-on-web-pages-with-scrapy/

        wordlist = [
            "katalog",
            "downloads",
            "download"
            ]

def find_all_substrings(string, sub):
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    allowed_domains = ["www.reichelt.com/"]
    start_urls = ["https://www.reichelt.com/"]
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    crawl_count = 0
    words_found = 0                                 

    def check_buzzwords(self, response):

        self.__class__.crawl_count += 1

        crawl_count = self.__class__.crawl_count

        url = response.url
        contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
        data = response.body.decode('utf-8')

        for word in wordlist:
                substrings = find_all_substrings(data, word)
                print("substrings", substrings)
                for pos in substrings:
                        ok = False
                        if not ok:
                                self.__class__.words_found += 1
                                print(word + ";" + url + ";")
        return Item()

    def _requests_to_follow(self, response):
        if getattr(response, "encoding", None) != None:
                return CrawlSpider._requests_to_follow(self, response)
        else:
                return []

How can I find all instances of a word on a website and obtain the link of the page where the word is founded?

2
you send empty item return Item() so you get empty file. You should at least yield directory with data inside for-loop - like yield {"word": word, "url": url}. - furas
I don't understand why you use __class__ . And you could create wordlist at the beginning - even outside class. There is no need to create list again and again the same list. And you could use import re at the beginnnig. There is no need to import it again and again. And when all imports are at the beginning then other people can see what modules are needed to run this code. - furas
but first you should turn off JavaScript in web browser and load your page in web browser. You will see what scrapy can get from page - because scrapy can't run JavaScript. If page uses JavaScript to add items then you will need Selenium or Splash to control web browser which can run JavaScript. See Scrapy-Selenium and Scrapy-Splash - furas
this page shows me text in English and it doesn't have katalog but catalog. I have to use https://www.reichelt.com/?LANGUAGE=PL to get page in Polish with katalog - furas
I went through scrapy selenium but I don't really get how to use it in my case. Can I add a step into my existing code such that first javascript is turned off and then we look for the words? Also, I tried using "https://www.reichelt.com/?LANGUAGE=PL" in my existing code but I dont see any print statements for the substrings. @furas - x89

2 Answers

2
votes

Main problem is wrong allowed_domain - it has to be without path /

    allowed_domains = ["www.reichelt.com"]

Other problems can be this tutorial is 3 years old (there is link to documentation for Scarpy 1.5 but newest version is 2.5.0).

It also uses some useless lines of code.

It gets contenttype but never use it to decode request.body. Your url uses iso8859-1 for original language and utf-8 for ?LANGUAGE=PL - but you can simply use request.text and it will automatically decode it.

It also uses ok = False and later check it but it is totally useless.


Minimal working code - you can copy it to single file and run as python script.py without creating project.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re

wordlist = [
    "katalog",
    "catalog",
    "downloads",
    "download",
]

def find_all_substrings(string, sub):
    return [match.start() for match in re.finditer(re.escape(sub), string)]

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    
    allowed_domains = ["www.reichelt.com"]
    start_urls = ["https://www.reichelt.com/"]
    #start_urls = ["https://www.reichelt.com/?LANGUAGE=PL"]
    
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    #crawl_count = 0
    #words_found = 0                                 

    def check_buzzwords(self, response):
        print('[check_buzzwords] url:', response.url)
        
        #self.crawl_count += 1

        #content_type = response.headers.get("content-type", "").decode('utf-8').lower()
        #print('content_type:', content_type)
        #data = response.body.decode('utf-8')
        
        data = response.text

        for word in wordlist:
            print('[check_buzzwords] check word:', word)
            substrings = find_all_substrings(data, word)
            print('[check_buzzwords] substrings:', substrings)
            
            for pos in substrings:
                #self.words_found += 1
                # only display
                print('[check_buzzwords] word: {} | pos: {} | sub: {} | url: {}'.format(word, pos, data[pos-20:pos+20], response.url))
                # send to file
                yield {'word': word, 'pos': pos, 'sub': data[pos-20:pos+20], 'url': response.url}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(WebsiteSpider)
c.start() 

EDIT:

I added data[pos-20:pos+20] to yielded data to see where is substring and sometimes it is in URL like .../elements/adw_2018/catalog/... or other place like <img alt=""catalog"" - so using regex doesn't have to be good idea. Maybe better is to use xpath or css selector to search text only in some places or in links.


EDIT:

Version which search links with words from list. It uses response.xpath to search all linsk and later it check if there is word in href - so it doesn't need regex.

Problem can be that it treats link with -downloads- (with s) as link with word download and downloads so it would need more complex method to check (ie. using regex) to treats it only as link with word downloads

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

wordlist = [
    "katalog",
    "catalog",
    "downloads",
    "download",
]

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    
    allowed_domains = ["www.reichelt.com"]
    start_urls = ["https://www.reichelt.com/"]
    
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    def check_buzzwords(self, response):
        print('[check_buzzwords] url:', response.url)
        
        links = response.xpath('//a[@href]')
        
        for word in wordlist:
            
            for link in links:
                url = link.attrib.get('href')
                if word in url:
                    print('[check_buzzwords] word: {} | url: {} | page: {}'.format(word, url, response.url))
                    # send to file
                    yield {'word': word, 'url': url, 'page': response.url}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(WebsiteSpider)
c.start() 
1
votes

You can do it with requests-html and rendering the page:

from requests_html import HTMLSession

session = HTMLSession()
url = "https://www.reichelt.com/"

r = session.get(url)
r.html.render(sleep=2)

if "your_word" in r.html.text: #or r.html.html if you want it in raw html
    print([link for link in r.html.absolute_links if "your_word" in link])