1
votes

I am trying to do some Vega-lite visualizations with my data set of data sets. Fields in my data set are: record_id, subject, tag. Record_id is a unique identifier for a data set but each data set can have multiple subjects and multiple tags, so there is one row per possible combination of subject and tag for each data set. I want a bar chart showing, for each tag, how many data sets were tagged with that tag. But there are hundreds of tags, too many to show in a bar chart, so I want to limit to top K but which tags show up the most.

I tried to follow this "Top-K Plot With Others in Vega-Lite" example where he plots top K directors by aggregate worldwide gross. But maybe there's a simpler way to do this when I'm just selecting top k based on the same criteria I'm plotting by? I am also open to different ways to show the same relationship.

VegaLite({
      data: {values: data},
      title: "Top k Tags",
      mark: {type: "bar", tooltip: null},
      transform: [
        { aggregate: { 
           op: "distinct", 
           field: "record_id", 
           as: "tag_count"}, 
          groupby: ["tag"]},    // aggregate on "tag" field and count within the groups         
        { window: [
          { op: "row_number", 
            as: "tag_rank"}], 
          sort: [{ 
            field: ["tag_count"], 
            order: "descending" }]},
        { filter: `datum.tag_rank < 21`}     
      ],
      encoding: {
        x: {
          aggregate: "distinct",
          field: "record_id", 
          type: "quantitative", 
          axis: {title: "Data Sets with this Tag"}
        },
        y: {
          field: "tag",
          type: "nominal",
          sort: { op: "distinct", field: "record_id", order: "descending" }
        }
      }
    })

I expect to see a horizontal bar chart with 20 bars with values of between 1632 and 100 (I know from doing the same analysis in pandas that tag counts are between 1 and 1632.)

I see the right number of bars but x-axis goes from 0 to 1.0 and each bar extends to 1.0.

1

1 Answers

0
votes

Instead of using {field: "record_id", aggregate: "distinct"} for the x encoding, you should use the already-computed aggregated value, {field: "tag_count"}, and then your chart will work as expected.

Edit: here's what your chart looks like with the data you provided in the comment, using this approach: vega editor link

enter image description here