Query time analytics with Solr faceting and pivoting and log file data

Question

I am doing some analytics using Solr and specifically using the faceting and pivot functionality for a large set of log files. I have a large log file that I have indexed in Solr along the lines of.

    Keyword  Visits log_date_ISO
1    red      1,938  2013-01-01
2    blue     435    2013-02-01
3    green    318    2013-04-01
4    red blue 279    2013-01-01

I then run a query and facet by 'log_date_ISO' to give me keyword counts by date that contain the query term. Two questions:

(1) Is there a way to sum the visits per keyword for each date - because what I really want is to sum visits across keywords that contain the query:

-> e.g. if I ran query 'red' for the above - I would want date 2013-01-01 to have a count of 1938 + 279 = 2217 (i.e. the sum of the visits associated with the keywords that contain the query 'red') rather than '2' (i.e. the count of the keywords containing the query).

(2) Is there a way to normalise by monthly query volume?

-> e.g. if the query volume for '2013-01-01' was 10,000 then the normalised volume for the query 'red' would be 2217/10000 = 0.2217

LAST RESORT: If these are not possible, I will pre-process the log file using pandas/python to group by date, then by keyword then normalise - but was wondering if it was possible in Solr.

Thanks in advance.

Andy Hayden Andy Hayden · Accepted Answer · 2013-09-02T21:39:34

Here's one way (similar to Dan Allen's answer here):

In [11]: keywords = df.pop('Keyword').apply(lambda x: pd.Series(x.split())).stack()

In [12]: keywords.index = keywords.index.droplevel(-1)

In [13]: keywords.name = 'Keyword'

In [14]: df1 = df.join(keywords)

In [15]: df1
Out[15]:
   Visits  log_date_ISO  Keyword
1    1938    2013-01-01      red
2     435    2013-02-01     blue
3     318    2013-04-01    green
4     279    2013-01-01      red
4     279    2013-01-01     blue

Then you can do the relevant groupby:

In [16]: df1.groupby(['log_date_ISO', 'Keyword']).sum()
Out[16]:
                        Visits
log_date_ISO  Keyword
2013-01-01    blue         279
              red         2217
2013-02-01    blue         435
2013-04-01    green        318

To get the visits as a percentage (to avoid double-counts) I'd do a transform first:

df['VisitsPercentage'] = df.groupby('log_date_ISO')['Visits'].transform(lambda x: x / x.sum())

# follow the same steps as above

In [21]: df2 = df.join(keywords)

In [22]: df2
Out[22]:
   Visits  log_date_ISO  VisitsPercentage  Keyword
1    1938    2013-01-01          0.874154      red
2     435    2013-02-01          1.000000     blue
3     318    2013-04-01          1.000000    green
4     279    2013-01-01          0.125846      red
4     279    2013-01-01          0.125846     blue

Query time analytics with Solr faceting and pivoting and log file data

2 Answers