I was using beautifulsoup to parse the page at:
https://irs.thsrc.com.tw/IMINT/
In the rendering, there is a confirmation box popping out in front of all other tags asking for a confirmation.
The tag is inside the sole form tag with xpath: /html/body/div[1]/form/div[2].
Both this tag and the previous tag at /html/body/div[1]/form/div[1] are with style attribute "display:none".
The weird thing is that beautifulsoup will skip the second tag.
I was thinking maybe it is a bug in beautifulsoup and rewrote the code with LXML.
But it turned out that LXML also skipped the second tag.
In fact, if I use root.findall(".//div"), the returned list will also not include the skipped tag since there should be a tag with xpath /html/body/div[2]/form/div[2]/div[1]/div[1]/div[1]/div[1] according to the html source file. But this tag is skipped by findall(".//div") of lxml.
I copied part of the html code and the whole method that recursively scan all tags in the following. Some unicode data that cannot pass the stackoverflow filter is changed to ascii.
I will appreciate it very much if anyone can tell me how to get the tag which is the pop-out tag waiting for the confirmation.
Thanks
<html>
<head>
<title> taiwan hsrc </title>
</head>
<body topmargin="0" rightmargin="0" bottommargin="0" bgcolor="#FFFFFF" leftmargin="0">
<!----- error message ends ----->
<form action="/IMINT/;jsessionid=4A74C40B8D68474DF0B6F49E953DD825?wicket:interface=:0:BookingS1Form::IFormSubmitListener" id="BookingS1Form" method="post">
<div style="display:none">
<input type="hidden" name="BookingS1Form:hf:0" id="BookingS1Form:hf:0" />
</div>
<div style="display:none; padding:3px 10px 5px;text-align:center;" id="dialogCookieInfo" title="Taiwan high-speed rail" wicket:message="title=bookingdialog_3">
<div class="JCon">
<div class="TCon">
<div class="overDiffText">
<div style="text-align: left;">
<span>for better service
<a target="_blank" class="c" style="color:#FF9900;" href="https://www.thsrc.com.tw/tw/Article/ArticleContent/d1fa3bcb-a016-47e2-88c6-7b7cbed00ed5?tabIndex=1">
privacy
</a>
。
</span>
</div>
</div>
<div class="action">
<table border="0" cellpadding="0" cellspacing="0" align="center">
<tr>
<td>
<input hidefocus="true" name="confirm" id="btn-confirm" type="button" class="button_main" value="我同意"/>
</td>
</tr>
</table>
</div>
</div>
</div>
</div>
<div id="content" class="content">
<!----- marquee starts ----->
<marquee id="marqueeShow" behavior="scroll" scrollamount="1" direction="left" width="755">
</marquee>
<!----- marquee ends ----->
<div class="tit">
<span>一般訂票</span>
</div>
</form>
|</div>
</body>
</html>
My code with LXML for scanning the html is the following.
def actionableLXML(cls, e):
global count
print ("rec[", count, "], xpath: ", xmlTree.getpath(e))
countLabelActionableInside += 1
flagActionableInside = False
if e.tag in cls._clickable_tags \
or e.tag == 'input' or e.tag == 'select':
flagActionableInside = True
else:
flagActionableInside = False
for c in e.getchildren():
flagActionableInside |= cls.actionableLXML(c)
if e.attrib and 'style' in e.attrib \
and 'display:' in e.attrib['style'] \
and 'none' in e.attrib['style']:
if not flagActionableInside:
e.getparent().remove(e)
return flagActionableInside
The code using BeautifulSoup is the following.
@classmethod
def actionableBS(cls, e):
global countLabelActionableInside
print ("rec actionable inside[", countLabelActionableInside, "], xpath: ", DomAnalyzer._get_xpath(e))
countLabelActionableInside += 1
flagActionableInside = False
if e.name == 'form':
print ("caught form!")
if e.name in cls._clickable_tags or e.name == 'input' or e.name == 'select':
flagActionableInside = True
else:
flagActionableInside = False
if hasattr(e, 'children'):
for c in e.children:
flagActionableInside |= cls.actionableBS(c)
if e.attrs and e.has_attr('style') and 'display:' in e['style'] and 'none' in e['style']:
# if element.name in cls._clickable_tags or element.name == 'input' or element.name == 'select':
if not flagActionableInside:
e.decompose()
return flagActionableInside
/html/body/form; /html/body/form/div[1]; /html/body/form/div[1]/input; /html/body/form/div[2]; /html/body/form/div[2]/div
– Jack Fleeting