1
votes

Trying to scrape Transaction Value 取引値 from the url http://nextfunds.jp/lineup/1357/detail.html . If I use inspect element , I am able to see the value 1,875. (You can ctrl+f取引値 or 1,875 to see what value I need). But I dont see these values in the source code. My in tent is to scrape through python. I tried using

import requests
url='http://nextfunds.jp/lineup/1357/detail.html'
response = requests.get(url)
html = response.content
print html
soup = BeautifulSoup(html)

Since 1,875 or 取引値 are not in the html source code, would there be now way to scrape those values ? Thanks

Update 1: Tried lxml

from lxml import html
page = requests.get(url)
tree=html.fromstring(page.content)
#copied xpath using chrome inspect element
val= tree.xpath('//*[@id="include"]/div[1]/div[2]/table/tbody/tr[1]/td')
val
[]

Update 2: Tried Webkit (comes very close to being solved), using this link https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://nextfunds.jp/lineup/1357/detail.html'  
r = Render(url)  
result = r.frame.toHtml()
#now print result in a file and open it in browser to copy xpath of the desired table data
#but somehow some table values are missing (I thought it was a website issue but no !)

Update 3 ( got the values ! , stuck at selecting table)

>>> import dryscrape
>>> from bs4 import BeautifulSoup
>>> session = dryscrape.Session()
>>> session.visit(url)
>>> response = session.body()
>>> soup = BeautifulSoup(response)
>>> html = soup.prettify("utf-8")
>>> f1.write(html)
#Now I do see my required table values, but beautifulesoup doesnt let use xpath, I just need to select the table and save it as csv

Update 4 I found that the html I am interested in is given in the pagesource of the url. I only need to search for pattern src="http://nam.qri.jp/cgi-bin/nextfunds/json?SRC=nextfunds/lineup&code=1570&auth= in the page source to get the link. and then use the code given in the answer section. This is more of a regex problem now. I can do it using 'curlandgrep' but would like to it in python only.

1
This site uses JS to populate the table. In order to scrape that data, you can use Selenium or make direct request to site API. - vold
@vold can you share how/where you could find the direct link (this seems to be the json which is fetched by some js and given to table right ? !!) And is this solution made generic to run everyday ? We'll need to change url everyday? - pythonRcpp
I wrote in my answer where you can find the url. It seems that url is still working and you can get desired value but it's hard to tell how long it will work. - vold
yes, it takes time from epoch as the last part of string. (minutely basis it changes). Also i need values from table just below having 1,875円 (-1.16%. I am unable to find the url for it. Is it not loaded by network? - pythonRcpp

1 Answers

1
votes

The site populates those value via JS. You can simulate those request and get that data in json format. In order to get value from the second table you can use this code:

import requests
from lxml import html

def parse(link='http://nextfunds.jp/lineup/1357/detail.html'):
    source = requests.get(link)
    t = html.fromstring(source.content)
    # get a url to json page from startpage source
    url = t.xpath('//@src[contains(.,"json?")]')[0]
    # request to json page
    req = requests.get(url)
    tree = html.fromstring(req.content)
    # parse json page and get value
    data = tree.xpath('(//table)[2]//td/text()')
    for item in data[::2]:
        print(item.encode('ascii', 'ignore').decode())

parse()
# 1,847
# 184,101
# 2,112.71
# 1,209
# Can work with different url
parse('http://nextfunds.jp/lineup/1627/detail.html')