【发布时间】:2017-12-16 17:16:53
【问题描述】:
我得到了一个由 2 列组成的数据集。 “WEBDATA”列在每个单元格中包含一个列表。这是我第一次处理包含列表的数据集,我被卡住了......
我的数据集如下所示:
WORD | WEBDATA
Home | list(Domain = c(77, 25, 7, 97, 71, 1, 42, 35, 37, 58, 9
Baby | list(Domain = c(77, 25, 7, 97, 71, 1, 42, 35, 37, 58, 9
Dog | list(Domain = c(77, 25, 7, 97, 71, 1, 42, 35, 37, 58, 9
Food | list(Domain = c(77, 25, 7, 97, 71, 1, 42, 35, 37, 58, 9
当我检查 WEBDATA 列的每个单元格内的内容时,它会返回:
> dataset$WEBDATA[[1]]
Domain
1 website1.com
2 mysuperwebsite.com
3 bestwebsite.uk
Url
1 https://www.website1.com/product2/
2 https://www.mysuperwebsite.com/productB/
3 https://www.bestwebsite.uk/product67/
为了确保它是列表并检查它的外观,我尝试了这个:
class(dataset$WEBDATA)
[1] "list"
testdataset <- data.frame(dataset$WEBDATA[[2]])
Domain | Url
1 website1.com | https://www.website1.com/product2/
2 mysuperwebsite.com | https://www.mysuperwebsite.com/productB/
3 bestwebsite.uk | https://www.bestwebsite.uk/product67/
我的目标是将 WEBDATA 列表分成几行。
最终的数据集应如下所示:
WORD | Number | Domain | Url
Home | 1 | website1.com | https://www.website1.com/product2/
Home | 2 | mysuperwebsite.com | https://www.mysuperwebsite.com/productB/
Home | 3 | bestwebsite.uk | https://www.bestwebsite.uk/product67/
Baby | 1 | websitezz.uk | https://www.websitezz.uk/page/
Baby | 2 | websiteabc.com | https://www.websiteabc.com/post/
Baby | 3 | thewebsite.com | https://www.thewebsite.com/post75/
我想到了 strsplit() 函数,但是对于列表,我真的不知道如何制作它。你能帮忙吗?
这是一个示例数据集,您可以将其粘贴到 R 中:
theDataReconstituted <- structure(list(
WORD = structure(c(8L, 7L, 6L, 10L, 9L), .Label = c("dog dood", "dog foo", "dog food uk", "dog foof", "dogfood", "burns dog food", "canagan dog food", "dog food", "skinners dog food", "wainwrights dog food" ), class = "factor"),
WEBDATA = list(
structure(list(
Domain = structure(c(1L, 2L, 2L), .Label = c("pet-supermarket.co.uk", "petsathome.com" ), class = "factor"),
Url = structure(c(3L, 1L, 2L), .Label = c("petsathome.com/shop/en/pets/dog/dog-food-and-treats", "petsathome.com/shop/en/pets/dog/dog-food-and-treats/dry-dog-food", "pet-supermarket.co.uk/Dog/Dog-Food-Treats/Dog-Food/c/PSGB00070" ), class = "factor")),
.Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)),
structure(list(
Domain = structure(c(1L, 1L, 1L), .Label = "canagan.co.uk", class = "factor"),
Url = structure(c(1L, 3L, 2L), .Label = c("canagan.co.uk/", "canagan.co.uk/products-cat.html", "canagan.co.uk/products.html" ), class = "factor")),
.Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)),
structure(list(
Domain = structure(c(1L, 1L, 2L), .Label = c("burnspet.co.uk", "petsathome.com"), class = "factor"),
Url = structure(1:3, .Label = c("burnspet.co.uk/", "burnspet.co.uk/burns-dog-food-products/", "petsathome.com/shop/en/pets/merch-groups/burns" ), class = "factor")),
.Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)),
structure(list(
Domain = structure(c(1L, 1L, 1L), .Label = "petsathome.com", class = "factor"),
Url = structure(c(2L, 3L, 1L), .Label = c("petsathome.com/shop/en/pets/merch-groups/feature/wainwrights-dog-food", "petsathome.com/shop/en/pets/merch-groups/mg-004", "petsathome.com/shop/en/pets/merch-groups/wainwrights-dog-" ), class = "factor")),
.Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)),
structure(list(
Domain = structure(c(1L, 1L, 1L), .Label = "skinnerspetfoods.co.uk", class = "factor"),
Url = structure(c(1L, 3L, 2L), .Label = c("skinnerspetfoods.co.uk/", "skinnerspetfoods.co.uk/our-range/", "skinnerspetfoods.co.uk/product-category/field-trial-range/" ), class = "factor")),
.Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)))),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame" ),
.Names = c("WORD", "WEBDATA"))
【问题讨论】:
-
您能否对具有代表性的数据样本调用
dput的结果进行编辑?您有嵌套的列表列,因此任何人都不可能重现准确的情况。 -
您如何将
Home与Website1.com等联系起来?这些网站似乎属于与Baby关联的第二项。 -
Website1.com 包含在与主页位于同一行的列表中。感谢您注意到,上面的代码有错误,我编辑了。
-
@Remi 按照@alistaire 的要求,请放入
dput(dataset)输出的子集(可能是您帖子中的子集)。 -
只需
library(tidyverse); theDataReconstituted %>% unnest() %>% group_by(WORD) %>% mutate(Number = row_number())即可。你会得到一些关于强制因素到字符的错误,但这不会导致任何问题。