following the information using scrapy in nested div and span tags

Question

I am trying to make web crawler, using scrapy from python, that extracts the information that google shows in the right side when you make a search, for example:

I want to extract the information in the box in the rigth side

The link is: search in google

The source code: source code

Part of the HTML code is:

<div class="g rhsvw kno-kp mnr-c g-blk" lang="es-419" data-hveid="CAoQAA" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQjh8oAHoECAoQAA">
    <div class="kp-blk knowledge-panel Wnoohf OJXvsb" data-hveid="CAoQAQ" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQww0oAHoECAoQAQ">
        <div class="xpdopen">
            <div class="ifM9O">
                <div>
                    <div></div>
                </div>
                <div data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ_xd6BAgKEAI">
                    <div class="kp-header" lang="es-419" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ3z56BAgEEAA">
                        <div lang="es-419">
                            <h2 class="bNg8Rb">Resultado del Gráfico de conocimiento
                            </h2>
                        </div>
                        <div class="kp-hc">
                            <div class="NFQFxe Hhmu2e viOShc LKPcQc mod" data-md="16" lang="es-419" style="clear:none" data-hveid="CAQQAQ" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQhygoADAbegQIBBAB">
                                <!--m-->
                                <div class="Ftghae iirjIb">
                                    <div class="rsir2d">
                                        <kno-share-button>
                                            <div jsaction="r._HouY4r6utk" data-rtid="iHUQypqXTr0Q" jsl="$t t-dhmk9MkDbvI;$x 0;" class="r-iHUQypqXTr0Q" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ-YABKAAwG3oECAQQAg"><span class="JP8rKe r8U5xb z1asCe Fp7My" aria-label="Compartir" role="button" tabindex="0"><svg focusable="false" xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24"><path d="M18 16.08c-.76 0-1.44.3-1.96.77L8.91 12.7c.05-.23.09-.46.09-.7s-.04-.47-.09-.7l7.05-4.11c.54.5 1.25.81 2.04.81 1.66 0 3-1.34 3-3s-1.34-3-3-3-3 1.34-3 3c0 .24.04.47.09.7L8.04 9.81C7.5 9.31 6.79 9 6 9c-1.66 0-3 1.34-3 3s1.34 3 3 3c.79 0 1.5-.31 2.04-.81l7.12 4.16c-.05.21-.08.43-.08.65 0 1.61 1.31 2.92 2.92 2.92 1.61 0 2.92-1.31 2.92-2.92s-1.31-2.92-2.92-2.92z"></path></svg></span>
                                                <div style="display:none" class="iHUQypqXTr0Q-YbcQq9Khf_8 r-im11Tgib5Xfc" jsaction="dg_dismissed:r.-FPnppROon0;kno_shr_close_button_clicked:r.giXQqEBMb3E" data-rtid="im11Tgib5Xfc" jsl="$t t-7hzFN84w9_k;$x 0;" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ2poBMBt6BAgEEAM">
                                                    <g-dialog class="im11Tgib5Xfc-0078sLar460 r-iuKAMqdareQ0" data-id="_RWTdXKfnLs_EswXNnaCQDw4" jsaction="dg_reg_content:r.J_j78ao4uyM" data-rtid="iuKAMqdareQ0" jsl="$t t-cuCqGEujB5w;$x 0;">
                                                        <div class="iuKAMqdareQ0-oPwtUFSp9U8" id="_RWTdXKfnLs_EswXNnaCQDw4" jsaction="dg_close:r.99yxp2ZuQP0;r.nUlQmbHCUts" data-rtid="iuKAMqdareQ0" jsl="$x 4;"></div>
                                                    </g-dialog>
                                                </div>
                                                <div style="display:none" class="iHUQypqXTr0Q--9_AnHJXi80" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQhc0CMBt6BAgEEAk"></div>
                                            </div>
                                        </kno-share-button>
                                    </div>
                                    <div class="SPZz6b">
                                        <div class="kno-ecr-pt kno-fb-ctx gsmt" data-local-attribute="d3bn" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ3B0oATAbegQIBBAK"><span>La Cuarta</span></div>
                                        <div class="wwUB2c kno-fb-ctx"><span data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ2kooAjAbegQIBBAL">Periódico</span></div>
                                    </div>
                                </div>
                                <!--n-->
                            </div><i class="GdltXd r-i5fJ88MOldfA" style="display:none" jsl="$t t-izLg50Mkmp4;$x 0;"></i></div>
                    </div>
                    <div class="SALvLe farUxc mJ2Mod">
                        <div class="i4J0ge">
                            <div class="mod" data-md="50" lang="es-419" style="clear:none" data-hveid="CAUQAA" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQkCkwHHoECAUQAA">
                                <!--m-->
                                <div class="PZPZlf hb8SAc kno-fb-ctx" data-attrid="description" data-hveid="CAUQAQ" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQziAoADAcegQIBRAB">
                                    <div jsl="$t t-oF0h478wPRI;$x 0;" class="r-igZyUtaLvb3g">
                                        <div class="kno-rdesc r-iNUajC5fIXTY" jsaction="sngtp:r.Eddvt4h-GI8;tp_btn:r.Eddvt4h-GI8" data-rtid="iNUajC5fIXTY" jsl="$t t-JgTEvN6zUII;$x 0;">
                                            <div>
                                                <h3 class="bNg8Rb">Descripción</h3><span>La Cuarta es un periódico chileno de circulación nacional diaria, editado por el consorcio Copesa. Su primer número fue publicado el 13 de noviembre de 1984. Su eslogan hasta 2017 fue El diario popular.</span><span><span> </span><a class="q ruhjFe NJLBac fl" href="https://es.wikipedia.org/wiki/La_Cuarta" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQmhMwHHoECAUQAg" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://es.wikipedia.org/wiki/La_Cuarta&amp;ved=2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQmhMwHHoECAUQAg">Wikipedia</a></span>
                                            </div>
                                        </div>
                                    </div>
                                </div>

I saw that the information i want is nested in a lot of div tags and finally is the description of a span tag, so I tried the following:

response.xpath("//div[@class='kno-rdesc']")
response.xpath("//div[@class='mod']")
response.xpath("//div[@class='i4J0ge']")

I just get emprty, I even tried like following each of the tags like this:

response.xpath("//div//div//div//div//div//div//div//div//div//span")

But still can't get to the info I want

Have you tried printing the response you get (response.body as bytes) to a file? Is it possible that Google is sending an empty or incomplete response? — Gallaecio
I tried response.body in the scrapy shell and got a very long result, but I have failt to write it to a file @Gallaecio — Joe
You can do that with regular Python code to write files, writing the contents of response.body — Gallaecio

Shubham gupta Shubham gupta · Accepted Answer · 2020-05-21T18:48:45

xpath is not always a good approach to get data. Many times xpaths is changed accordingly to changed in DOM and even changed in every load.

And use these modules with scrapy when crawl famous websites.

scrapy-rotating-proxies
scrapy-user-agents

otherwiese google detect you request as robot request and block the page load.

The better way to find something on page by classes and id

(Note - you have to notice that class and id not changed on every load and on every query changed).

following the information using scrapy in nested div and span tags

1 Answers