1
votes

Looking to extract the author name from articles. Currrently using =IMPORTXML(G2,"//*[@class='author-details']")

When I do this, it creates 4 cells underneath which contain the word 'By', which I can't get rid of.

Very new to code - what am I doing wrong?

Attached example: https://docs.google.com/spreadsheets/d/1Mi1D5G1-_gNsQwVQ6I_ealDqcWixKA2p-hFqJpjlGt4/edit?usp=sharing

1

1 Answers

0
votes

You can use:

=index(IMPORTXML(G2,"//*[@class='author-details']"),1,2)

This displays only the first row of the second column of what is returned. The information You are after.

Edit:

Additionally, since you highlighted that you want author name. If all the names are in that format "By FIRST LAST @TwitterHandle Affiliation" then you can use this to get just the author's name:

=trim(split(right(index(IMPORTXML(G2,"//*[@class='author-details']"),1,2),len(index(IMPORTXML(G2,"//*[@class='author-details']"),1,2))-3),"@",true,true))

Will likely look like voodoo but paste it in, it works. It removes the first 3 characters ("By "), splits the text at the "@" symbol, and then keeps only the text on the left side of it, the name.