0
votes

I'm visualizing a dataset that has, for instance, a categorical field and a temporal field. I want to create a strip chart that shows the temporal distribution of the different categories sorted in 'ascending'/'descending' order depending on their cardinality. This can simply be achieved with altair:

import pandas as pd
import altair as alt

data = {0:{'Name':'Mary', 'Sport':'Tennis', 'competition':'2018/06/01'},
    1:{'Name':'Cal', 'Sport':'Tennis','competition':'2018/06/05'},
    2:{'Name':'John', 'Sport':'Tennis','competition':'2018/05/28'},
    3:{'Name':'Jane', 'Sport':'Tennis','competition':'2018/05/20'},
    4:{'Name':'Bob', 'Sport':'Golf','competition':'2018/03/01'},
    5:{'Name':'Jerry', 'Sport':'Golf','competition':'2018/03/03'},
    6:{'Name':'Gustavo', 'Sport':'Golf','competition':'2018/02/28'},
    7:{'Name':'Walter', 'Sport':'Swimming','competition':'2018/01/01'},
    8:{'Name':'Jessy', 'Sport':'Swimming','competition':'2018/01/03'},
    9:{'Name':'Patric', 'Sport':'Running','competition':'2018/02/01'},
    10:{'Name':'John', 'Sport':'Shooting','competition':'2018/04/01'}}

df = pd.DataFrame(data).T

alt.Chart(df).mark_tick().encode(
    x='yearmonthdate(competition):T',
    y=alt.Y('Sport:N',
        sort=alt.SortField(field='count(Sport:N)', order='ascending', op='sum')
    ),
)

Strip plot

Now suppose I'm interested only in the first three most numerous categories. Following the accepted solution for "Altair/Vega-Lite bar chart: filter top K bars from aggregated field", this time the plot does't show up:

alt.Chart(df).mark_tick().encode(
    x='yearmonthdate(competition):T',
    y=alt.Y('Sport:N',
        sort=alt.SortField(field='count', order='ascending', op='sum')
    ),
).transform_aggregate(
    count='count()',
    groupby=['Sport']
).transform_window(
    window=[{'op': 'rank', 'as': 'rank'}],
    sort=[{'field': 'count', 'order': 'descending'}]
).transform_filter('datum.rank <= 3')

Filtered Strip plot

Notice that even the y-labels order isn't as expected.

1

1 Answers

0
votes

Reading (and understanding) the documentation more in depth, I think I can state that what I asked is currently (June 2018) unfeasible with altair/Vega-Lite. Here it is my explanation...

Performing an aggregate transform on the data, is equivalent of adding a GROUP BY clause on a SQL query so we are no more able to associate to an encoded channel any “original” data field in its “unaggregated” form: when I try to refer to competition in the x channel this is therefore undefined.

I could try to "selfjoin" using the lookup transform but, even in this case, the final result isn’t what I was looking for because this is equivalent to a left join so I get just one value for each aggregated class.

alt.Chart(df).mark_tick().encode(
    x=alt.X(field='competition',type='temporal', timeUnit='yearmonthdate'),
    y=alt.Y('Sport:N',
        sort=alt.SortField(field='count', order='ascending', op='sum')
    ),
).transform_aggregate(
    countX='count()',
    groupby=['Sport']
).transform_window(
    window=[{'op': 'rank', 'as': 'rank'}],
    sort=[{'field': 'countX', 'order': 'descending'}]
).transform_filter('datum.rank <= 3').transform_lookup(
    lookup='Sport',
    from_=alt.LookupData(data=df, key='Sport',
                         fields=['competition'])
)

Plot after join

I discovered that what is necessary to achieve the result I want, is currently supported in Vega but not in Vega-Lite nor Altair: it is the JoinAggregate transform that “extends” original data with the result of one or more aggregations.

For the following input data:

[
  {"foo": 1, "bar": 1},
  {"foo": 1, "bar": 2},
  {"foo": null, "bar": 3}
]

The join aggregate transform:

{
  "type": "joinaggregate",
  "fields": ["foo", "bar", "bar"],
  "ops": ["valid", "sum", "median"],
  "as": ["v", "s", "m"]
}

produces the output:

[
  {"foo": 1, "bar": 1, "v": 2, "s": 6, "m": 2},
  {"foo": 1, "bar": 2, "v": 2, "s": 6, "m": 2},
  {"foo": null, "bar": 3, "v": 2, "s": 6, "m": 2}
]