1
votes

I am doing some analytics using Solr and specifically using the faceting and pivot functionality for a large set of log files. I have a large log file that I have indexed in Solr along the lines of.

    Keyword  Visits log_date_ISO
1    red      1,938  2013-01-01
2    blue     435    2013-02-01
3    green    318    2013-04-01
4    red blue 279    2013-01-01

I then run a query and facet by 'log_date_ISO' to give me keyword counts by date that contain the query term. Two questions:

(1) Is there a way to sum the visits per keyword for each date - because what I really want is to sum visits across keywords that contain the query:

-> e.g. if I ran query 'red' for the above - I would want date 2013-01-01 to have a count of 1938 + 279 = 2217 (i.e. the sum of the visits associated with the keywords that contain the query 'red') rather than '2' (i.e. the count of the keywords containing the query).

(2) Is there a way to normalise by monthly query volume?

-> e.g. if the query volume for '2013-01-01' was 10,000 then the normalised volume for the query 'red' would be 2217/10000 = 0.2217

LAST RESORT: If these are not possible, I will pre-process the log file using pandas/python to group by date, then by keyword then normalise - but was wondering if it was possible in Solr.

Thanks in advance.

2

2 Answers

1
votes

Here's one way (similar to Dan Allen's answer here):

In [11]: keywords = df.pop('Keyword').apply(lambda x: pd.Series(x.split())).stack()

In [12]: keywords.index = keywords.index.droplevel(-1)

In [13]: keywords.name = 'Keyword'

In [14]: df1 = df.join(keywords)

In [15]: df1
Out[15]:
   Visits  log_date_ISO  Keyword
1    1938    2013-01-01      red
2     435    2013-02-01     blue
3     318    2013-04-01    green
4     279    2013-01-01      red
4     279    2013-01-01     blue

Then you can do the relevant groupby:

In [16]: df1.groupby(['log_date_ISO', 'Keyword']).sum()
Out[16]:
                        Visits
log_date_ISO  Keyword
2013-01-01    blue         279
              red         2217
2013-02-01    blue         435
2013-04-01    green        318

To get the visits as a percentage (to avoid double-counts) I'd do a transform first:

df['VisitsPercentage'] = df.groupby('log_date_ISO')['Visits'].transform(lambda x: x / x.sum())

# follow the same steps as above

In [21]: df2 = df.join(keywords)

In [22]: df2
Out[22]:
   Visits  log_date_ISO  VisitsPercentage  Keyword
1    1938    2013-01-01          0.874154      red
2     435    2013-02-01          1.000000     blue
3     318    2013-04-01          1.000000    green
4     279    2013-01-01          0.125846      red
4     279    2013-01-01          0.125846     blue
0
votes

One can use solr to group by one field in the records and sum another field in the records, by group, using

(1) Facets/pivots (groups data by a specified field)

(2) StatComponent (calculates field statistics for specified field - including the sum)

The call I made is (differently from the names in the question, the 'Keyword' field is called 'q_string', 'Visits' above is called 'q_visits' and 'log_date_ISO' is called 'q_date' in the below):

http://localhost:8983/solr/select?q=neuron&stats=true&stats.field=q_visits&rows=1&indent=true&stats.facet=q_date

This provides basic statistics - including the sum - for the *q_visits* field by date - the specific value I was interested in was the sum:

<double name="min">1.0</double>
<double name="max">435.0</double>
<long name="count">263</long>
<long name="missing">0</long>
<double name="sum">845.0</double>
<double name="sumOfSquares">192917.0</double>
<double name="mean">3.2129277566539924</double>
<double name="stddev">26.94368427501248</double>

The field for which the statics are gathered is declared as type float in schema.xml (if its declared as a string then sum, sd, mean will not be shown).