I am trying to do some Vega-lite visualizations with my data set of data sets. Fields in my data set are: record_id, subject, tag. Record_id is a unique identifier for a data set but each data set can have multiple subjects and multiple tags, so there is one row per possible combination of subject and tag for each data set. I want a bar chart showing, for each tag, how many data sets were tagged with that tag. But there are hundreds of tags, too many to show in a bar chart, so I want to limit to top K but which tags show up the most.
I tried to follow this "Top-K Plot With Others in Vega-Lite" example where he plots top K directors by aggregate worldwide gross. But maybe there's a simpler way to do this when I'm just selecting top k based on the same criteria I'm plotting by? I am also open to different ways to show the same relationship.
VegaLite({
data: {values: data},
title: "Top k Tags",
mark: {type: "bar", tooltip: null},
transform: [
{ aggregate: {
op: "distinct",
field: "record_id",
as: "tag_count"},
groupby: ["tag"]}, // aggregate on "tag" field and count within the groups
{ window: [
{ op: "row_number",
as: "tag_rank"}],
sort: [{
field: ["tag_count"],
order: "descending" }]},
{ filter: `datum.tag_rank < 21`}
],
encoding: {
x: {
aggregate: "distinct",
field: "record_id",
type: "quantitative",
axis: {title: "Data Sets with this Tag"}
},
y: {
field: "tag",
type: "nominal",
sort: { op: "distinct", field: "record_id", order: "descending" }
}
}
})
I expect to see a horizontal bar chart with 20 bars with values of between 1632 and 100 (I know from doing the same analysis in pandas that tag counts are between 1 and 1632.)
I see the right number of bars but x-axis goes from 0 to 1.0 and each bar extends to 1.0.
