0
votes

I'm new to this so sorry if I confuse anything. I'm writing a Selenium webscraper with Python to scrape all Headlines and Dates from the NYTimes Article Archives.

Here's the link: https://www.nytimes.com/search?dropmab=true&endDate=20120103&query=&sections=Business%7Cnyt%3A%2F%2Fsection%2F0415b2b0-513a-5e78-80da-21ab770cb753&sort=best&startDate=20070101

There's a 'Show More' button at the bottom of the page that loads 10 more articles every time you click on it. So I essentially want this to click the "Show More" button until there are no more articles to load and then scrape the whole page for the Headlines and the Dates. Here is my try:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import pandas as pd


options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, 
executable_path=r"//usr/local/Caskroom/chromedriver/81.0.4044.69/chromedriver")
driver.get("https://www.nytimes.com/search?dropmab=true&endDate=20120103&query=&sections=Business%7Cnyt%3A%2F%2Fsection%2F0415b2b0-513a-5e78-80da-21ab770cb753&sort=best&startDate=20070101")

WebDriverWait(driver, 40).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='css-vsuiox']//button[@data-testid='search-show-more-button']")))
while True:
    try:
        WebDriverWait(driver, 40).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='css-vsuiox']//button[@data-testid='search-show-more-button']"))).click()
    print("MORE button clicked")
    except TimeoutException:
        break
driver.quit()


headlines_element = browser.find_elements_by_xpath('//h4[@class="css-2fgx4k"]')
headlines = [x.text for x in headlines_element]
print('headlines:')
print(headlines, '\n')

dates_element = browser.find_elements_by_xpath("//time[@class='css-17ubb9w']")
dates = [x.text for x in dates_element]
print("dates:")
print(dates, '\n')

for headlines, dates in zip(headlines, dates):
    print("Headlines : Dates")
    print(headlines + ": " + dates, '\n')

But when I run the script the show more button clicks it a few times and then randomly clicks on one of the article and moves away. I also tried nesting the headline and date scraping inside of the While loop but I just kept getting a "TabError: inconsistent use of tabs and spaces in indentation"

Please Help! Thanks!

2

2 Answers

0
votes
wait = WebDriverWait(driver, 10)
driver.get("https://www.nytimes.com/search?dropmab=true&endDate=20120103&query=&sections=Business%7Cnyt%3A%2F%2Fsection%2F0415b2b0-513a-5e78-80da-21ab770cb753&sort=best&startDate=20070101")

times=wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='css-46b038']//ol[*]//li//time")))


elements=wait.until(EC.presence_of_all_elements_located((By.XPATH, "//h4")))
for element in elements:
    for time in times:
        print time.text
        print element.text
        break

Output:

enter image description here

0
votes

I feel you using tabs which I recommend not to use it.

You can use one of the following options.

Option 1:

Use autopep8 in your python code. just use this command. autopep8 -i yourFileName.py

here is the documentation for autopep8: https://pypi.org/project/autopep8/

Option 2:

 1. set your IDE to use indentation with 4 spaces
 2. In your existing code please replace all the tabs with 4 spaces