Recently we have been encountering a problem as described in the title once a month. On the metastore node, we have installed and started the ntpd service to synch time with kerberos server. The krb5.conf on the node looks like this:
[libdefaults]
default_realm = EXAMPLE.COM
dns_lookup_realm = true
dns_lookup_kdc = true
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
So it seems less likely that the time on metastore out of synch with kerberos server(>=5min) resulting to the problem or due to network block.
Seen from the metastore log, the "Clock skew too great" exception logging time is out of order, such as,
2016-01-16 18:18:48,071 ERROR [pool-3-thread-63735]
2016-01-16 19:07:03,699 ERROR [pool-3-thread-63798]
2016-01-16 19:06:55,998 ERROR [pool-3-thread-63796]
2016-01-16 19:06:41,653 ERROR [pool-3-thread-63812]
2016-01-16 19:04:28,659 ERROR [pool-3-thread-63806]
2016-01-16 19:04:13,937 ERROR [pool-3-thread-63804]
2016-01-16 19:02:19,312 ERROR [pool-3-thread-63809]
2016-01-16 19:02:13,115 ERROR [pool-3-thread-63794]
2016-01-16 19:02:06,028 ERROR [pool-3-thread-63800]
2016-01-16 19:01:50,767 ERROR [pool-3-thread-63795]
2016-01-16 18:59:36,926 ERROR [pool-3-thread-63810]
2016-01-16 18:59:36,394 ERROR [pool-3-thread-63797]
The exception stack:
2016-01-16 18:59:36,394 ERROR [pool-3-thread-63797]: transport.TSaslTransport (TSaslTransport.java:open(296)) - SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Clock skew too great (37))]
at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:177)
at org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:509)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:264)
at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$HiveSaslServerTransportFactory.getTransport(HadoopThriftAuthBridge.java:172)
at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge20S.java:678)
at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge20S.java:675)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1536)
at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingTransportFactory.getTransport(HadoopThriftAuthBridge20S.java:675)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Clock skew too great (37))
at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$HiveSaslServerTransportFactory.getTransport(HadoopThriftAuthBridge.java:172)
... 10 more
env:
java version "1.7.0_45" Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) hive-0.13.1.2.1.10.0-hdp
So what should I do if i want to figure out the root cause? Any suggestions? Many thanks.