【发布时间】:2016-11-20 00:57:45
【问题描述】:
我在玩 Yelp 数据集,想根据类别过滤业务集。
我将 JSON 文件导入到 R 中
yelp_business = stream_in(file("yelp_academic_dataset_business.json"))
然后产生以下数据框:
'data.frame': 77445 obs. of 15 variables:
$ business_id : chr "5UmKMjUEUNdYWqANhGckJw" "UsFtqoBl7naz8AVUBZMjQQ" "3eu6MEFlq2Dg7bQh8QbdOg" "cE27W9VPgO88Qxe4ol6y_g" ...
$ full_address : chr "4734 Lebanon Church Rd\nDravosburg, PA 15034" "202 McClure St\nDravosburg, PA 15034" "1 Ravine St\nDravosburg, PA 15034" "1530 Hamilton Rd\nBethel Park, PA 15234" ...
$ hours :'data.frame': 77445 obs. of 7 variables:
..$ Friday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Tuesday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Thursday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Wednesday:'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Monday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Sunday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr NA NA NA NA ...
.. ..$ open : chr NA NA NA NA ...
..$ Saturday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr NA NA NA NA ...
.. ..$ open : chr NA NA NA NA ...
$ open : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
$ categories :List of 77445
..$ : chr "Fast Food" "Restaurants"
..$ : chr "Nightlife"
..$ : chr "Auto Repair" "Automotive"
..$ : chr "Active Life" "Mini Golf" "Golf"
..$ : chr "Shopping" "Home Services" "Internet Service Providers" "Mobile Phones" ...
..$ : chr "Bars" "American (New)" "Nightlife" "Lounges" ...
..$ : chr "Active Life" "Trainers" "Fitness & Instruction"
..$ : chr "Bars" "American (Traditional)" "Nightlife" "Restaurants"
..$ : chr "Auto Repair" "Automotive" "Tires"
..$ : chr "Active Life" "Mini Golf"
..$ : chr "Home Services" "Contractors"
..$ : chr "Veterinarians" "Pets"
..$ : chr "Libraries" "Public Services & Government"
..$ : chr "Automotive" "Auto Parts & Supplies"
我现在想根据业务类别过滤所有行,并希望在类别列表中包含所有包含食物的类别。
但是,如果我只是这样尝试:
input ="food"
engage = filter(yelp_business, grepl(input, categories))
我收到以下错误代码:
Error: data_frames can only contain 1d atomic vectors and lists
我首先怀疑嵌套结构是其中的一个原因。但是,使用 tidyjson 也无济于事,因为 category 是一个列表,而不是主数据框中的数据框。
有人知道如何解决这个问题吗?我只需要一个所有食品餐厅的业务 ID 的列表,然后从 Yelp 过滤评论 json 文件以提取书面评论。
非常感谢您对此的任何帮助!非常感谢!
【问题讨论】:
-
试试
yelp_business$categories <- unlist(yelp_business$categories) -
谢谢皮埃尔,我也试过了,但问题是类别中的每一行都有不同数量的类别标签。取消列出会导致 227451 新行而不是必要的 77445 行,因此我收到以下错误消息:
Error in$(*tmp*, "categories", value = c("Fast Food", : replacement has 227451 rows, data has 77445 -
这不是问题。见
grepl("a", list(c("a", "b"), "c"))。问题在于上面的嵌套数据框。 -
可以加
dput(yelp_business[1:2, 1:5])吗?
标签: json r filter nested dplyr