9
votes

I need some advice on how to diagnose a hanging build. It’s only been happening in the last week or two and I have good reason to suspect it’s something that I’ve done recently and not just a coincidence

Setup

  • TFS 2013
  • 4 machine setup - 2 app tiers (in process of deprecating one of them), 1 sql server, 1 build server running 2 agents.
  • Build Controller is running on 2nd app tier along with the Job Agent
  • 1st App tier is serving the website (although that machine will soon be shutdown and everything will be passed to the 2nd app tier as the machine is getting old)

Symptoms

  • All executed builds (doesn’t appear to matter which build process template) never get marked as done, the last step always seems to be the same step “Copy Files to Drop Location”/“Workspace and Copy Files to Drop Location”/”Copy Binaries to drop, Reset the environment” (named differently in each build template)
  • The files appear to be getting dropped successfully in the build drop folder
  • Looking at the task manager it appears that all the build processes on the build server are exited (only TFSBuildServiceHost
  • Builds show their normal steps/logging while executing
  • Primary app tier has related warnings in the event logs (see warnings below)

Recent Changes

  • Installed Xamarin Android/iOS on the build server
  • Installed a few custom built plugins for Job Agent, Message Queue, and Web Services (been using them for years just had them disabled the last few weeks due to a app tier migration)
  • Installed Tiago’s Task Board Enhancer (again been using this for a long time, just had it disabled recently)
  • About a month ago we added the 2nd app tier and moved the sql off to another machine

What I’ve Tried

  • Rebooting both App tiers and build server
  • Uninstalling Xamarin (although I suspect some parts are still floating around as the Bonjour service appears to still be installed)
  • Removing the custom plugins
  • Turned logging diagnostics right up on one of the builds – nothing particularly of interest seems to turn up
  • Run the Best Practice Analyzer (nothing too unusual shows up)
  • Multiple build process templates (defaulttemplate, defaulttemplate.11.1, tfvctemplate.12.xaml)
  • Multiple build definitions
  • Checked the event logs of both AppTiers and Build server

The Team Foundation service host request monitor has detected the following condition: Date (UTC): 3/02/2014 12:54:06 a.m. Machine: CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130357641583538280 Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service Host: 0dc282b5-59a8-4941-b541-a4f7d314cd0f Process Details: Process Name: w3wp Process Id: 2508 Thread Id: 2504

Detailed Message: A request for service host XXXX has been executing for 37 seconds, exceeding the warning threshold of 30. Request details: Request Context Details Url: /tfs/XXXX/XXXX/_api/_build/stop?__v=4 Method: ApiBuild.stop Parameters: uri = vstfs:///Build/Build/34064 User Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 Unique Id: 00000000-0000-0000-0000-000000000000

The Team Foundation service host request monitor has detected the following condition: Date (UTC): 30/01/2014 11:10:01 p.m. Machine: CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130355232548668648 Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service Host: 0dc282b5-59a8-4941-b541-a4f7d314cd0f Process Details: Process Name: w3wp Process Id: 70320 Thread Id: 14540

Detailed Message: A request for service host XXXX has been executing for 37 seconds, exceeding the warning threshold of 30. Request details: Request Context Details Url: /tfs/XXXX/Build/v4.0/BuildService.asmx Method: StopBuilds Parameters: uris[0] = vstfs:///Build/Build/34051 uris = Count = 1 User Agent: Team Foundation (devenv.exe, 12.0.21005.1, Premium, SKU:16) Unique Id: 4d2d3213-fd41-4c4d-8ab0-b87619c96a42

The Team Foundation service host request monitor has detected the following condition: Date (UTC): 31/01/2014 3:14:17 a.m. Machine: CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130355232548668648 Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service Host: Process Details: Process Name: w3wp Process Id: 70320
Thread Id: 14540

Detailed Message: There are no active requests for service host XXXX that exceed the warning threshold of 30.

A quick google suggests upping the timeout in the tfs registry (http://xavierdilipkumar.com/post/2013/07/04/TFS-event-7005-and-7006-warning.aspx) I've tried that and it doesn't appear to change anything.

7
After rereading those warnings it appears that they are related to me stopping the build from the web front end and not necessarily directly related to the issue.Betty
kinda related but just give it a shot: geekswithblogs.net/kjones/Default.aspxIsaiah4110
Turns out the fix for us was to add an entry to the hosts file that pointed the SharePoint URL to 127.0.0.1 (the loopback address). We had already configured this for the other three SharePoint web application we had launched. We had neglected to do this with our new web application, which was only recently put into production.Isaiah4110
@Isaiah4110 no I'm pretty sure that's completely unrelated.Betty
I know they are completely unrelated, but what i was talking about was the nature of the fix. Anyways, where was the build controller running before you added the 2nd app tier? Did the user running the build controller change? did it start happening after the build controller was moved to the 2 app tierIsaiah4110

7 Answers

3
votes

can you look in the tfs bs logs at

Event Viewer -> Applications and Services Logs -> Microsoft -> Team Foundation Server -> Build-Services -> Operational

these timeouts generally relate to permissions. you should look for TF215106 access denied events. Although the files appear to be there, are they all the current date or are there some with different (older) dates? Also are they any alerts/steps happening when the file drop occurs?

Other than that it could be timing out because one of the dependencies is being used by another service.

2
votes

You might fire up Sysinternals Process Monitor to see when the processes actually exit and what they were doing (Process Monitor monitors "real-time file system, Registry and process/thread activity").

1
votes

The best course of action is to call Microsoft Support and open a Service Request. Make sure it gets priority A - your TFS production environment is not working - and be prepared to give them support and access.

The only hint from the log is that call to ApiBuild.stop. It suggests that the build workflow completed, so the code hosting it is calling back to the AT to mark the build completed. As you have no warnings from previous calls, it could be some problem at the database level. You may try activating SQL Tracing but it's not a trivial task, as you should be able to compare the trace with a working one.

Good luck

1
votes

I'm reluctant to mark this as an answer because I'm not entirely sure why it worked.

Suspecting something was wrong with the build machine I created a new Build Agent on a fresh install - the hanging issue still occurred.

I then added a Build Controller to that machine and noticed that new builds using that controller would complete. This suggested that there was a communication issue between either the BA and the BC, or the BA and the primary AT.

Given that our primary AT had other issues we decided to remove it from the picture, we switched the DNS to point at the second AT and disabled all services on the old primary. Instantly builds started to complete (including the ones that had been stuck for a number of days).

I still don't know which component was broken or why, especially since it worked fine in this configuration for a month prior. I can only assume there was either another change that I am not aware of, or the corruption of the primary AT was causing bigger issues.

1
votes

We were having the same problem here, the builds were kept open even after successfully passing all workflow stages.

I logged into the build machine and noticed the build controller was "running 6 builds" for some reason, even though there were no builds at all showing in the queue in Visual Studio.

After restarting the controller, the next build worked the first time.

Just wanted to let this one here as a possible answer. I'm not sure yet why the controller had those stuck builds though.

0
votes

I had this issue when an activity tried to log a huge message in the build log (namely the FxCopCmd activity from the CodePlex TFS Build Extensions project).

The build agent would successfully finish the build but the controller had to chew the huge message into the build log, and it was silently crashing/hanging.

I was able to track the issue down by navigating to C:\Users\[TfsServiceAccount]\AppData\Local\Temp\BuildAgent\[AgentNumber]\Logs\[BuildNumber]\ActivityLog.xml.

The last build message was truncated and by looking at the content, I recognized the FxCop output. In my case, I just set the LogToConsole parameter to False for the FxCop activity in the build process template, and the build completed successfully.

0
votes

Also appears to happen if the build agent cannot connect to the build controller server on port 9191.

Easily testable with a telnet client.

Appears that my server decided it was on an unknown network and kicked the firewall into overdrive. (The second time I got this issue, not sure if this was the reason I got it the first time but it seems reasonable).