2
votes

So there are a lot of questions that have been asked around dynamic content scraping on stackoverflow, and I went through all of these, but all the solutions suggested did not work for the following problem:

Context:

Issue:

I have not been able to access any of the DOM elements on this page. Note if I could get some hints on how to access the search bar, and the search button, that would be a great start. See page to scrape What I want in the end, is to go through a list of addresses, launch the search, and copy the information displayed on the right hand side of the screen.

I have tried the following:

  • Changed the browser for webdriver (from Chrome to Firefox)
  • Added waiting time for the page to load

    try:
        WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.ID, "addressInput")))
    except:
        print "address input not found"
    
  • Tried to access the item by ID, XPATH, NAME, TAG NAME, etc., nothing worked.

Questions

  • What else could I try that I have not so far (using Selenium webdriver)?
  • Are some websites really impossible to scrape? (I don't think that the city used an algorithm to generate any random DOM everytime I re-load the page).
1
find the search field with one of the find_by_* methods, send Keys.ENTERCorey Goldberg
The problem was that it could not find the elements... not about how to send keys.Audrey Bascoul
your question had 2 parts: "if I could get some hints on how to access the search bar, and the search button"... I supplied the various methods to look for (find_by_*) to locate an element. (the accepted answer chose find_element_by_id). Also note, hitting enter to bypass an element lookup and simulated click tends to be faster and more reliable in practice.Corey Goldberg

1 Answers

2
votes

You can use this url http://50.17.237.182/PIM/ to get the source:

In [73]: from selenium import webdriver


In [74]: dr = webdriver.PhantomJS()

In [75]: dr.get("http://50.17.237.182/PIM/")

In [76]: print(dr.find_element_by_id("addressInput"))
<selenium.webdriver.remote.webelement.WebElement object at 0x7f4d21c80950>

If you look at the source returned, there is a frame attribute with that src url:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>

<head>
  <title>San Francisco Property Information Map </title>
  <META name="description" content="Public access to useful property information and resources at the click of a mouse"><META name="keywords" content="san francisco, property, information, map, public, zoning, preservation, projects, permits, complaints, appeals">
</head>
<frameset rows="100%,*" border="0">
  <frame src="http://50.17.237.182/PIM" frameborder="0" />
  <frame frameborder="0" noresize />
</frameset>

<!-- pageok -->
<!-- 02 -->
<!-- -->
</html>

Thanks to @Alecxe, the simplest method it to use dr.switch_to.frame(0):

In [77]: dr = webdriver.PhantomJS()

In [78]: dr.get("http://propertymap.sfplanning.org/")

In [79]:  dr.switch_to.frame(0)  

In [80]: print(dr.find_element_by_id("addressInput"))
<selenium.webdriver.remote.webelement.WebElement object at 0x7f4d21c80190>

If you visit http://50.17.237.182/PIM/ in your browser, you will see exactly the same as propertymap.sfplanning.org/, the only difference is you have full access to the elements using the former.

If you want to input a value and click the search box, it is something like:

from selenium import webdriver


dr = webdriver.PhantomJS()
dr.get("http://propertymap.sfplanning.org/")

dr.switch_to.frame(0)

dr.find_element_by_id("addressInput").send_keys("whatever")
dr.find_element_by_xpath("//input[@title='Search button']").click()

But if you want to pull data, you may find querying using the url an easier option, you will get some json back from the query.

enter image description here