I am following this tutorial: https://docs.microsoft.com/en-us/academic-services/graph/tutorial-azure-databricks-hindex
I have obtained access to the Microsoft Academic Graph data set and want to issue some basic pySpark code against it, precisely per the tutorial.
For example, this code:
# Get affiliations
Affiliations = MAG.getDataframe('Affiliations')
Affiliations = Affiliations.select(Affiliations.AffiliationId, Affiliations.DisplayName)
Affiliations.show(3)
When I run the code with 'Shift + Enter', it goes into a state of 'Running command' - and never seems to finish, even after half an hour. I have inserted a screen shot of this and attached to my post.
I have run these commands individually, and it's the last one (Affiliations.show(3)
) that causes the slowness.
For example, when I run the command (Affiliations = MAG.getDataframe('Affiliations')
) by itself, I actually get a result:
AffiliationId:long
Rank:integer
NormalizedName:string
DisplayName:string
GridId:string
OfficialPage:string
WikiPage:string
PaperCount:long
CitationCount:long
Latitude:float
Longitude:float
CreatedDate:date
Question: how can I debug this to find out what's causing the slowness?