After getting used to Cassandra and OpsCenter for some weeks now, there is still one problem I am unable to find a solution for. I managed to install and configure OpsCenter and the agents so that they can connect with each other. Than I had a look at some of the different metrics that are available in OpsCenter. Thereby I realized that I don't get any information for OS: Memory Used, which is why I tried to track down the source of this problem.
In the Agent log files, I saw the message "Short os-stats collector failed" over and over again. I tried searching for the reason for this message, but all answers I found where not helpful in my case. I increased the log level of the agent to trace and found this:
TRACE [os-metrics-2] 2015-08-27 08:39:34,644 Output from iostat -x -d -m 60 2... :
TRACE [os-metrics-2] 2015-08-27 08:39:34,648 Starting process: iostat -x -d -m 60 2
TRACE [os-metrics-7] 2015-08-27 08:39:34,667 Starting process: free -m
TRACE [os-metrics-7] 2015-08-27 08:39:34,669 Output from free -m... :
ERROR [os-metrics-7] 2015-08-27 08:39:34,670 Short os-stats collector failed
TRACE [os-metrics-5] 2015-08-27 08:39:34,677 Output from iostat -c -m 60 2... :
TRACE [os-metrics-5] 2015-08-27 08:39:34,678 Starting process: iostat -c -m 60 2
TRACE [os-metrics-6] 2015-08-27 08:39:34,682 Starting process: df --print-type --no-sync --block-size=1G --local
TRACE [os-metrics-6] 2015-08-27 08:39:34,684 Output from df --print-type --no-sync --block-size=1G --local... :
I guess the command the causes the problem in any way is free -m, which would make sense since this shows how many memory is free and used. I tried researching some more with this new information and found a possible solution: people suggested to check if the cassandra user on my system has the necessary permission to issue the command. I logged in as the user and there was no problem with any permission, free -m shows the usual output (same as it shows when logged in as root):
total used free shared buff/cache available
Mem: 7752 4768 145 27 2838 2705
Swap: 8191 64 8127
I am running out of ideas what the problem might be, which is why I hope for some help here.
Some more information about my system and my cluster:
OS: CentOS 7.1.1503
Cassandra Version: DataStax Community Version 2.1.8
OpsCenter and Agent Version: 5.2.0
Cassandra and the agents run as cassandra user, OpsCenter runs as root.
Hope anyone has any idea what the problem might be. Thanks in advance and if you need any further information I am happy to provide them.
Btw: The OS: Memory Used is the only statistics I found so far that does not work. The OS: Memory Free graph works for whatever reason.
Edit: I just saw that in some cases I get a stack trace as well with the error message:
java.lang.NullPointerException
at clojure.lang.Numbers.ops(Numbers.java:942)
at clojure.lang.Numbers.lt(Numbers.java:219)
at clojure.lang.Numbers.min(Numbers.java:4007)
at opsagent.rollup$add_value.invoke(rollup.clj:173)
at opsagent.rollup$process_keypair$fn__1465.invoke(rollup.clj:250)
at opsagent.cache$update_cache_value_default$fn__1166$fn__1167.invoke(cache.clj:25)
at clojure.lang.AFn.applyToHelper(AFn.java:161)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.lang.Ref.alter(Ref.java:174)
at clojure.core$alter.doInvoke(core.clj:2244)
at clojure.lang.RestFn.invoke(RestFn.java:425)
at opsagent.cache$update_cache_value_default$fn__1166.invoke(cache.clj:25)
at clojure.lang.AFn.call(AFn.java:18)
at clojure.lang.LockingTransaction.run(LockingTransaction.java:263)
at clojure.lang.LockingTransaction.runInTransaction(LockingTransaction.java:231)
at opsagent.cache$update_cache_value_default.invoke(cache.clj:24)
at opsagent.rollup$process_keypair.invoke(rollup.clj:250)
at opsagent.rollup$process_metric_map.invoke(rollup.clj:256)
at opsagent.os.collection$start_os_stat_collection$send_metric__16618.invoke(collection.clj:80)
at opsagent.os.linux_metrics$sendmap.invoke(linux_metrics.clj:12)
at opsagent.os.linux_metrics$report_mem_stats.invoke(linux_metrics.clj:134)
at opsagent.os.linux_metrics$collectors$wrap_short_collector__10821$fn__10822.invoke(linux_metrics.clj:270)
at opsagent.os.collection$start_pool$fn__16589.invoke(collection.clj:39)
at clojure.lang.AFn.run(AFn.java:24)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)