1
votes

I am new to python and am trying to scrape data from the following site. Although this code worked for a different site i cannot get it to work for nextgen stats. anyone have any thoughts as to why? below is my code and the error i am getting

import pandas as pd
import numpy as np
import html5lib

urlwk1 = 'https://nextgenstats.nfl.com/stats/receiving/2020/1'
urlwk2 = 'https://nextgenstats.nfl.com/stats/receiving/2020/2'

df11 = pd.read_html(urlwk1)
df11[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv

Below is the error I am getting

df11 = pd.read_html(urlwk1) Traceback (most recent call last): File "", line 1, in File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\util_decorators.py", line 296, in wrapper return func(*args, **kwargs) File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 1101, in read_html displayed_only=displayed_only, File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 917, in _parse raise retained File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 898, in _parse tables = p.parse_tables() File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 217, in parse_tables tables = self._parse_tables(self._build_doc(), self.match, self.attrs) File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 547, in _parse_tables raise ValueError("No tables found") ValueError: No tables found df11[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv Traceback (most recent call last): File "", line 1, in NameError: name 'df11' is not defined

2
We can't quite tell from that error exactly what's going on, but if it works on one site it's not guaranteed to work on the other as the structure is quite likely different. Are you familiar with using the debugger library? Have you checked what df11[0] is in the context above?chrymxbrwn
Thanks. I have updated the error i'm getting and provided exactly what it looks like. df11 is supposed to contain the scraped dataframe.wolfblitza
I am not familiar with the dubugger librarywolfblitza
Can you show us the output of df11?Ujjwal Agrawal
The error shared above is the output i get when i run the df11 linewolfblitza

2 Answers

1
votes

Pandas pandas.read_html is not capable of parsing dynamically loading html tables.

This page is fetching that table data using an API call

You can use this below code to fetch and parse the API response

import requests
import pandas as pd

headers = {
    'accept': 'application/json, text/plain, */*',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
    'referer': 'https://nextgenstats.nfl.com/',
    'accept-language': 'en-US,en;q=0.9,hi;q=0.8',
}

response = requests.get('https://appapi.ngs.nfl.com/statboard/receiving?season=2020&seasonType=REG&week=2', headers=headers)

df = pd.read_json(response.content)
df.to_csv ('NFL_Receiving_Page1.csv', index=False)

See it in action here

0
votes

Read HTML using Selenium Driver and read html

I think the page address you mentioned is dynamically loading. Please refer to the post above and then try the code below.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chromedriver_path = '/home/user/chromedriver'

d = webdriver.Chrome(chromedriver_path,chrome_options=chrome_options)
d.get('https://nextgenstats.nfl.com/stats/receiving/2020/1')
time.sleep(3)
html = d.page_source
df = pd.read_html(html)

After you properly install chrome driver in whatever system you have, this code will work. Try setting time.sleep() as per your internet speed and chromedrive path as in your system.