0
votes

Recently we have been encountering a problem as described in the title once a month. On the metastore node, we have installed and started the ntpd service to synch time with kerberos server. The krb5.conf on the node looks like this:

[libdefaults]
default_realm = EXAMPLE.COM
dns_lookup_realm = true
dns_lookup_kdc = true
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true

So it seems less likely that the time on metastore out of synch with kerberos server(>=5min) resulting to the problem or due to network block.
Seen from the metastore log, the "Clock skew too great" exception logging time is out of order, such as,

2016-01-16 18:18:48,071 ERROR [pool-3-thread-63735]
2016-01-16 19:07:03,699 ERROR [pool-3-thread-63798]
2016-01-16 19:06:55,998 ERROR [pool-3-thread-63796]
2016-01-16 19:06:41,653 ERROR [pool-3-thread-63812]
2016-01-16 19:04:28,659 ERROR [pool-3-thread-63806]
2016-01-16 19:04:13,937 ERROR [pool-3-thread-63804]
2016-01-16 19:02:19,312 ERROR [pool-3-thread-63809]
2016-01-16 19:02:13,115 ERROR [pool-3-thread-63794]
2016-01-16 19:02:06,028 ERROR [pool-3-thread-63800]
2016-01-16 19:01:50,767 ERROR [pool-3-thread-63795]
2016-01-16 18:59:36,926 ERROR [pool-3-thread-63810]
2016-01-16 18:59:36,394 ERROR [pool-3-thread-63797]

The exception stack:

2016-01-16 18:59:36,394 ERROR [pool-3-thread-63797]: transport.TSaslTransport (TSaslTransport.java:open(296)) - SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Clock skew too great (37))]
        at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:177)
        at org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:509)
        at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:264)
        at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$HiveSaslServerTransportFactory.getTransport(HadoopThriftAuthBridge.java:172)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge20S.java:678)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge20S.java:675)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:356)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1536)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server$TUGIAssumingTransportFactory.getTransport(HadoopThriftAuthBridge20S.java:675)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Clock skew too great (37))
        at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
        at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
        at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
        at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$HiveSaslServerTransportFactory.getTransport(HadoopThriftAuthBridge.java:172)  
        ... 10 more

env:

 java version "1.7.0_45"
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
 hive-0.13.1.2.1.10.0-hdp

So what should I do if i want to figure out the root cause? Any suggestions? Many thanks.

3
Have you tried to synchronize time everywhere with NTP - Konstantin V. Salikhov
Yes, we have tried before. When the exception occurs, it turns out that the metastore is too busy to response to the request. We have to restart it. - Dengzh
Many thanks. Checked from the metastore log, the "Clock skew too great" surprises me a lot, it implies that that the current authentication request carries a timestamp 5min afterwards or forwards compared to the kerberos server. Unlucky we haven't gotten the kerberos debug info so far until next time the same problem occurs. I try to figure out the potential causes now and recreate the problem. - Dengzh
Thanks @Konstantin V. Salikhov and Samson Scharfrichter, the problem may due to some other cluster users trying to connect the metastore. - Dengzh

3 Answers

3
votes

I've seen this error as well, and in my case, the root cause had nothing to do with Kerberos. If you are using a MySql database as your data store, there is a pretty severe memory leak, https://issues.apache.org/jira/browse/HIVE-15551, that was introduced in 0.13 and not fixed until Hive 1.3.0. Basically, whoever originally wrote the code either forgot or didn't realize that you have to explicitly close JDBC statements, and this leads to excessive garbage collection when your process reaches its memory limits. Once this happens, everything in the process gets progressively slower, to the point where you start to see these clock skew errors.

You can tell if this is your issue by running a jmap live histogram on the metastore process. If you see JDBC objects towards the top of the list (in my case it was com.mysql.jdbc.JDBC42ResultSet and com.mysql.jdbc.StatementImpl), you are likely hitting this issue. I recommend that you either apply the patch, upgrade to Hive 1.3.0, or use the workaround mentioned in the issue to see if this clears things up.

1
votes

Use the kdestroy command and then kinit.

The kdestroy command destroys the user’s active Kerberos authorization tickets and deletes the credentials cache that contains them.

kinit is used to obtain and cache Kerberos ticket-granting tickets

Removing cache and 'kinit'ing again might solve your problem. If there is no cache, kdestroy would return "kdestroy: No credentials cache found while destroying cache".

The kdestroy documentation can be found here.

0
votes

Run this command to sync your time clock with KDC :

/sbin/service ntpd stop; /usr/sbin/ntpdate IP_Address_of_KDC_server; /sbin/service ntpd start