3
votes

I'm faced with the challenge of analysing different system logfiles based on following requirements:

  • several hundred systems
  • millions of logs every day in different formats

Beside many other objectives my biggest challenge is a realtime correlation analysis of all incoming logs on all current system logs and also on partially historical log events.

Currently we're focusing on MongoDB, ElasticSearch, Hadoop, ... to meet this challenge.

On the other hand I've read some interesting things about Google Bigtable and Bigquery.

So my question is, is Bigtable and/or Bigquery a solution worth looking at, in order to do this realtime analysis ?

I've no experience with these two products, so I'm hoping for some tips whether these Google solutions could be an alternative for my requirements.

THX & BR bdriven

EDIT:

too broad. you need to show actual analisis you need to make. bigquery will be much much cheaper that homemade with nosql

Our goal is, to develop a system, which is able to generate warnings based on current log events (or a combination of different log events) and their past interactions on other systems behavior.

Therefore we have to be able to do fast correlation analysis for current events against huge amounts of unstructured historical data.

I know that this requirement description is probably not the most specific one, but we're right at the beginning of this project. So my goal with this question is to get some arguments for our next team meeting, whether we should consider to take a closer look at Bigtable / Bigquery or not.

1
too broad. you need to show actual analisis you need to make. bigquery will be much much cheaper that homemade with nosql - Zig Mandel

1 Answers

3
votes

One of my favorite features of BigQuery is its ability to run correlations.

Here's a correlations with BigQuery tutorial I wrote a couple years ago: http://nbviewer.ipython.org/gist/fhoffa/6459195

For example, to rank and find the most correlated airports in terms of flight delays:

SELECT a.departure_state, b.departure_state, corr(a.avg, b.avg) corr, COUNT(*) c
FROM
(SELECT date, departure_state, AVG(departure_delay) avg , COUNT(*) c
FROM [bigquery-samples:airline_ontime_data.flights]  
GROUP BY 1,2 HAVING c > 5  
) a
JOIN
(SELECT date, departure_state , 
AVG(departure_delay) avg, COUNT(*) c FROM [bigquery-samples:airline_ontime_data.flights] 
GROUP BY 1,2 HAVING c > 5  ) b
ON a.date=b.date
WHERE a.departure_state < b.departure_state
GROUP EACH BY 1, 2
HAVING c > 5
ORDER BY corr DESC;

Try it yourself in the next 5 minutes! A quick getting started tutorial: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/