0
votes

I am following this tutorial: https://docs.microsoft.com/en-us/academic-services/graph/tutorial-azure-databricks-hindex

I have obtained access to the Microsoft Academic Graph data set and want to issue some basic pySpark code against it, precisely per the tutorial.

For example, this code:

# Get affiliations
Affiliations = MAG.getDataframe('Affiliations')
Affiliations = Affiliations.select(Affiliations.AffiliationId, Affiliations.DisplayName)
Affiliations.show(3)

When I run the code with 'Shift + Enter', it goes into a state of 'Running command' - and never seems to finish, even after half an hour. I have inserted a screen shot of this and attached to my post.

I have run these commands individually, and it's the last one (Affiliations.show(3)) that causes the slowness.

For example, when I run the command (Affiliations = MAG.getDataframe('Affiliations')) by itself, I actually get a result:

AffiliationId:long
Rank:integer
NormalizedName:string
DisplayName:string
GridId:string
OfficialPage:string
WikiPage:string
PaperCount:long
CitationCount:long
Latitude:float
Longitude:float
CreatedDate:date

Question: how can I debug this to find out what's causing the slowness?

enter image description here

1

1 Answers

1
votes

Debugging a distributed application is still challenging in the notebook environment. Even though the web UI has the necessary information, there is a gap between web UIs and the development environment: it’s usually difficult to locate information in the web UI that is relevant to the code you are investigating; and there is no easy way to find historical runtime information.

enter image description here

Understanding how to debug with the Databricks Spark UI:

The Spark UI contains a wealth of information you can use for debugging your Spark jobs. There are a bunch of great visualizations, and we have a blog post here about those features.

enter image description here

For more details, click on Jobx View (Stages):

enter image description here

Reference: Tips to Debug Apache Spark UI with Databricks

Hope this helps.