0
votes

I've been trying with no success to importxml using google sheets to scrape the Advanced Receiving table data from the url https://www.pro-football-reference.com/boxscores/201912290car.htm.

I've tried the XPath copied directly from the inspect chrome page of: //*[@id="div_receiving_advanced"] where I always get the "Imported content is empty" error message.

I'm stumped because it works with the Passing, Rushing, & Receiving table data using the XPath of: //*[@id="div_player_offense"]

When I use the XPath of: //*[@id="all_receiving_advanced"], I get the following results.

unparsed results

However, I'd like to parse the data from the 2nd column so it looks like this.

parsed results

Any help would be greatly appreciated.

1

1 Answers

1
votes

Since some players don't have value for specific columns (for eg : "Rec/Br"), transforming directly the data returned by IMPORTXML will produce a scrambled table.

2 solutions :

A) Use IMPORTFROMWEB addon (number of requests are limited in the free plan) with JS rendering activated and a base selector option to keep the data structure. XPath expressions needed for data :

/th/a
/td[@data-stat="team"]
/td[@data-stat="targets"]
/td[@data-stat="rec"]
/td[@data-stat="rec_yds"]
/td[@data-stat="rec_first_down"]
/td[@data-stat="rec_air_yds"]
/td[@data-stat="rec_air_yds_per_rec"]
/td[@data-stat="rec_yac"]   
/td[@data-stat="rec_yac_per_rec"]   
/td[@data-stat="rec_broken_tackles"]    
/td[@data-stat="rec_broken_tackles_per_rec"]    
/td[@data-stat="rec_drops"] 
/td[@data-stat="rec_drop_pct"]

for the headers :

//div[@id="div_receiving_advanced"]//th[contains(@class,"poptip")]

for the base selector :

//div[@id="div_defense_advanced"]//tr[@data-row][not(@class)]

Formula used in C6 :

IMPORTFROMWEB(B1;B2:O2;B3:C4)

Output :

IFWrev2f

Side note : IMPORTFROMWEB often output loading errors.

B) Use IMPORTDATA and formulas to generate the table. First we load the data of interest with a filter (QUERY). Then we fix the blank cells problem with SUBSTITUTE. After that we extract the data with REGEXEXTRACT. Finally we apply a last filter and SPLIT the data to populate the cells.

Formula :

=ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA(B3);"select Col1 where Col1 contains 'rec_broken_tackles_per_rec'");"></td>";">0</td>"));".+htm.+?>(.+?)<.+team.+([A-Z]{3}).+targets.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11;$12;$13;$14"));"select * WHERE NOT Col1 contains '<'");";"))

Output :

IDATA

In both cases, blank cells are replaced with 0.

Blank

My working workbook is here.

EDIT :

For "Advanced Defense Table" with IMPORTDATA :

=ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA(B3);"select Col1 where Col1 contains 'def_tgt_yds_per_att'");"></td>";">0</td>"));".+htm.+?>(.+?)<.+team.+([A-Z]{3})<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?bli.+?>(.+?)<.+?qb_.+?>(.+?)<.+?qb_.+?>(.+?)<.+?sac.+?>(.+?)<.+?pre.+?>(.+?)<.+?tac.+?>(.+?)<.+?tac.+?>(.+?)<.+?tac.+?>(.+?)<.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11;$12;$13;$14;$15;$16;$17;$18;$19;$20;$21;$22"));"select * WHERE NOT Col1 contains '<'");";"))

Output :

ADT