What's the tidy way of having tibble columns of class tibble (instead of class list or data.frame)?
It's clearly possible to have columns of class data.frame in tibbles (see
example below), but none of the "tidy ways of data manipulation" (i.e.
dplyr::mutate() or purrr::map*_df()) seem to work for me when trying to cast the columns to tibble instead of data.frame
Current ouput of jsonlite::fromJSON()
# 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
Desired result
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
Why having data.frame columns can be very misleading
https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/
Related
- Recursively ensuring tibbles instead of data frames when parsing/manipulating nested JSON
- Ensure that data frames become tibbles when reading MongoDB data with {mongolite}
Example
Example data
library(magrittr)
json <- '[
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
},
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 10,
"z": false
},
{
"x": "B",
"y": 20,
"z": true
}
]
}
},
"schema": "0.0.1"
}
]'
When visualizing this, you'll see that there's a subtle but important distinction between objects (which map to data.frames) and array (which map to lists):
Parsing JSON and converting to tibble
x <- json %>%
jsonlite::fromJSON() %>%
tibble::as_tibble()
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
So it's clearly possible to have columns that are of class data.frame.
Casting data.frame to tibble columns: "the bad way"
But I'd like tibbles instead of data frames, so let's try the only thing I got to work: explicit re-assigning the respective list levels, or data frame/tibble columns, to be more precise:
# Make a copy so we don't mess with the initial state of `x`
y <- x
y$levelOne <- y$levelOne %>%
tibble::as_tibble()
y$levelOne$levelTwo <- y$levelOne$levelTwo %>%
tibble::as_tibble()
y$levelOne$levelTwo$levelThree <- y$levelOne$levelTwo$levelThree %>%
purrr::map(tibble::as_tibble)
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
That works, but is not in line with "tidy data manipulation pipes".
Casting data.frame to tibble columns: "the better way" (trying and failing)
# Yet another copy so we can compare:
z <- x
# Just to check that this works
z$levelOne %>%
tibble::as_tibble()
# # A tibble: 2 x 1
# levelTwo$levelThree
# <list>
# 1 <df[,3] [2 × 3]>
# 2 <df[,3] [2 × 3]>
# Trying to get this to work with `dplzr::mutate()` fails:
z %>%
dplyr::mutate(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
dplyr::transmute(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
# Same goes for `{purrr}`:
z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map_df(tibble::as_tibble)
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
tibble::add_column(levelOne = z$levelOne %>% tibble::as_tibble())
# Error: Can't add duplicate columns with `add_column()`:
# * Column `levelOne` already exists in `.data`.
# Works, but not what I want:
z %>%
tibble::add_column(test = z$levelOne %>% tibble::as_tibble()) %>%
str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 4 variables:
# [...]
# $ test :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
The only thing that worked (is not what we want)
Wrapping tibble::as_tibble() by purrr::map() seems to work, but the result is clearly not what we want as we duplicate everything below levelOne (compare to desired output above)
# Works, but not what I want:
z_new <- z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map(tibble::as_tibble)
)
z_new %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
EDIT (follow-up investigation)
Got it to work with Hendrik's help!
Still, IMO this topic raises some interesting follow-up questions regarding
whether or not one should - or even could - do it any other way if the primary
goal is to end up with tidy nested tibbles that play nicely with
tidyr::unnset() and tidyr::nest() (see comments in Hendrik's answer below).
As to the proposed approach in https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/: I might be overlooking something obvious, but I think it only works for JSON docs with a single document.
First, let's modify df_to_tibble() (see Hendrik's answer below) to only turn
"leaf" data frames into tibbles while turning "branch" data frames into lists:
leaf_df_to_tibble <- function(x) {
if (is.data.frame(x)) {
if (!any(purrr::map_lgl(x, is.list))) {
# Only captures "leaf" DFs:
tibble::as_tibble(x)
} else {
as.list(x)
}
} else {
x
}
}
This would give us results that are in line with the proposed way in the blog post, but only for "single object" JSON docs as illustrated below
df <- json %>% jsonlite::fromJSON()
# Only take the first object from the parsed JSON:
df_subset <- df[1, ]
Transforming df_subset:
df_subset_tibble <- purrr::reduce(
0:purrr::vec_depth(df_subset),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df_subset
) %>%
tibble::as_tibble()
df_subset_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 3 variables:
# $ labels :List of 1
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 1
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 1
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# $ schema : chr "0.0.1"
Transforming df:
df_tibble <- purrr::reduce(
0:purrr::vec_depth(df),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df
) %>%
tibble::as_tibble()
df_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
As we see, "listifying" nested JSON structures actually may results in copying
the "leafs". It just doesn't jump at you as long as n = 1 (number of JSON
docs), but strikes you as soon as n > 1.

jsonlite::fromJSON(), but it unfortunately maps todata.frameinstead oftibblecolumns) - Rappsterdata.framecolumns can be very misleading: hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble - Rappster