0
votes

I was using beautifulsoup to parse the page at: https://irs.thsrc.com.tw/IMINT/ In the rendering, there is a confirmation box popping out in front of all other tags asking for a confirmation.
The tag is inside the sole form tag with xpath: /html/body/div[1]/form/div[2].
Both this tag and the previous tag at /html/body/div[1]/form/div[1] are with style attribute "display:none".
The weird thing is that beautifulsoup will skip the second tag.
I was thinking maybe it is a bug in beautifulsoup and rewrote the code with LXML.
But it turned out that LXML also skipped the second tag.
In fact, if I use root.findall(".//div"), the returned list will also not include the skipped tag since there should be a tag with xpath /html/body/div[2]/form/div[2]/div[1]/div[1]/div[1]/div[1] according to the html source file. But this tag is skipped by findall(".//div") of lxml.

I copied part of the html code and the whole method that recursively scan all tags in the following. Some unicode data that cannot pass the stackoverflow filter is changed to ascii.

I will appreciate it very much if anyone can tell me how to get the tag which is the pop-out tag waiting for the confirmation.
Thanks

<html>
    <head>
        <title> taiwan hsrc </title>
    </head>
    <body topmargin="0" rightmargin="0" bottommargin="0" bgcolor="#FFFFFF" leftmargin="0">
        <!----- error message ends ----->
        <form action="/IMINT/;jsessionid=4A74C40B8D68474DF0B6F49E953DD825?wicket:interface=:0:BookingS1Form::IFormSubmitListener" id="BookingS1Form" method="post">
            <div style="display:none">
                <input type="hidden" name="BookingS1Form:hf:0" id="BookingS1Form:hf:0" />
            </div>
            <div style="display:none; padding:3px 10px 5px;text-align:center;" id="dialogCookieInfo" title="Taiwan high-speed rail" wicket:message="title=bookingdialog_3">
                <div class="JCon">
                    <div class="TCon">
                        <div class="overDiffText">
                            <div style="text-align: left;">
                                <span>for better service
                                    <a target="_blank" class="c" style="color:#FF9900;" href="https://www.thsrc.com.tw/tw/Article/ArticleContent/d1fa3bcb-a016-47e2-88c6-7b7cbed00ed5?tabIndex=1">
                                        privacy
                                    </a>
                                   。
                                </span>
                            </div>
                        </div>
                        <div class="action">
                            <table border="0" cellpadding="0" cellspacing="0" align="center">
                                <tr>
                                    <td>
                                        <input hidefocus="true" name="confirm" id="btn-confirm" type="button" class="button_main" value="我同意"/>
                                    </td>
                                </tr>
                            </table>
                        </div>
                    </div>
                </div>
            </div>
            <div id="content" class="content">
                <!----- marquee starts ----->
                <marquee id="marqueeShow" behavior="scroll" scrollamount="1" direction="left" width="755">
                </marquee>  
                <!----- marquee ends ----->
                <div class="tit">
                    <span>一般訂票</span>     
                </div>
            </form>
        |</div> 
    </body>  
</html>  

My code with LXML for scanning the html is the following.

 def actionableLXML(cls, e):
        global count 
        print ("rec[", count, "], xpath: ", xmlTree.getpath(e))
        countLabelActionableInside += 1
        flagActionableInside = False 
        if e.tag in cls._clickable_tags \
        or e.tag == 'input' or e.tag == 'select':
            flagActionableInside = True 
        else: 
            flagActionableInside = False 
        for c in e.getchildren(): 
            flagActionableInside |= cls.actionableLXML(c) 
        if e.attrib and 'style' in e.attrib \
        and 'display:' in e.attrib['style'] \
        and 'none' in e.attrib['style']:
            if not flagActionableInside: 
                e.getparent().remove(e)
        return flagActionableInside 

The code using BeautifulSoup is the following.

@classmethod 
def actionableBS(cls, e):
    global countLabelActionableInside 

    print ("rec actionable inside[", countLabelActionableInside, "], xpath: ", DomAnalyzer._get_xpath(e))
    countLabelActionableInside += 1
    flagActionableInside = False 
    if e.name == 'form': 
        print ("caught form!")
    if e.name in cls._clickable_tags or e.name == 'input' or e.name == 'select':
        flagActionableInside = True 
    else: 
        flagActionableInside = False 
    if hasattr(e, 'children'): 
        for c in e.children: 
            flagActionableInside |= cls.actionableBS(c) 

    if e.attrs and e.has_attr('style') and 'display:' in e['style'] and 'none' in e['style']:
        # if element.name in cls._clickable_tags or element.name == 'input' or element.name == 'select':
        if not flagActionableInside: 
            e.decompose() 

    return flagActionableInside
1
What is the output are you trying to get?Jack Fleeting
Thanks for the attention. Here is the two things that I was trying to get. Supposed that the skipped tag is s and its parent tag, the form tag, is t at xpath: /html/body/div[1]/form. First, I want s in t.getchildren(). But at the moment, s is not in t.getchildren(). Secondly, I also want to make sure that I can access the driver element of s via xpath.Farn Wang
If you can, please post your beautifulsoup code as well.Jack Fleeting
Thanks for the quick attention. I just added it to the end of the main post. As you can see, now I am using e.children to access the children list. But it does not contained tag s.Farn Wang
I don't see the xpath you mention: the relevant portion of the xpath I see is: /html/body/form; /html/body/form/div[1]; /html/body/form/div[1]/input; /html/body/form/div[2]; /html/body/form/div[2]/divJack Fleeting

1 Answers

0
votes

The reason of the anomaly is that the missing tag is later inserted dynamically. Some people showed us if the crawler waits for a little time slap, then the tag is in the html source code of the page.