6
votes

I have an application that publishes a number of stats to graphite via statsd. One of the stats simply sends a stat increment to statsd every time a message is received by the service. I need to display a graph that shows the the relative traffic over time for this stat. Generally speaking, I should be able to display a graph that refreshes every, say 10 seconds, and displays how many messages were recived in those 10 seconds as well as the history for a given period of time. However, no matter how I format my API query I cannot seem to get accurate data. I've read a number of articles including this one:

http://code.hootsuite.com/accurate-counting-with-graphite-and-statsd/

That seems to give some good insight but is still not quite giving me what I need. this is the closes I have come:

integral(hitcount(stats.recieved, "10seconds"))

However, I don't like the cumulative result of this and when I run this I get statistics that come nowhere near to what I see n my logs for messages received. I am ok with accepting some packet loss but we talking about orders of magnitude. I know I am doing something wrong. Just hoping someone can give me some insight as to what.

1

1 Answers

13
votes

A couple of things to check/try:

Configure Graphite for Statsd

Check to make sure that you've used the retention schema and aggregation settings in Graphite that match how Statsd will be sending data (i.e. it sends one data point per 10 second flush interval).

Run a single Statsd aggregator

Be sure you are only running one instance of Statsd as running multiple statsd daemons will cause metrics to be dropped (as Graphite will be configured to only store one data point for it's highest precision of 10s:6h)

Limit the time range in the UI or URL API to less than 6 hours

When displaying graphs with data that crosses over the 6 hour threshold (e.g. from now to 7 hours ago), you will begin seeing 1 minute worth of aggregated count data for the displayed graph (if you've configured Graphite for statsd with retentions = 10s:6h,1min:7d,10min:5y). Rollups will occur based on the oldest data point in the time range (e.g. now till 7+ days = you'll get 10 min rollups).

If sending sparse or "bursty" data AND displaying old time range (triggering aggregation)

Confirm that your xFilesFactor is low enough that aggregation produces non null values even with a high rate of nulls. For example, 100 requests in the first 10 seconds, and none for the remaining 50 seconds in a minute would cause a storage of 100, null, null, null, null, null which would be summed up to null when the data ages if the XFilesFactor is higher than 1/6. Using the statsd recommended graphite configuration handles this, but it is good to know about... as this can give the appearance of lost data.

Saving schema or aggregation changes

If you changed the graphite schema or aggregation settings after any metrics were stored (in whisper = graphite's storage) you'll need to either delete the .wsp files for the metric (graphite will recreate them) or run whisper-resize.py.

Validating settings

You can verify the settings against some whisper data by running whisper-info.py on a .wsp file. Find the .wsp file for one of your metrics in /graphite/storage/whisper/ Run: whisper-info.py my_metric_data.wsp. whisper-info.py output should tell you more about how the storage settings are working.

TLDR;

You should ensure that Graphite is set to store one data point per 10 second interval for metrics coming from StatsD. You should make sure that Graphite is summing (not averaging) for count data coming from Statsd. Both of these can be handled by using the recommended Statsd configuration settings. Don't run more than one Statsd aggregator. When using the UI, limit the data returned to less than 6 hours OR understand what rollup you are viewing when looking at data that crosses retention thresholds. Lastly, make sure the settings take (if you've already been sending metrics).