【问题标题】:How to parse this data type in R?如何在 R 中解析这种数据类型?
【发布时间】:2019-10-06 08:13:31
【问题描述】:

我正在尝试在 R 中解析这种数据类型,以便在 Tableau 中获得可视化效果(使用 Tableau 进行了尝试,但它非常复杂并且没有给出好的结果)。

以下是我尝试解析的数据类型的示例:

{"010010208022":215,"010010205002":195,"010010208012":184,"010010208021":165,"010010206001":132,"010010205003":110,"010010211001":100,"010010208024":93,"010010201002":91,"010010205001":84,"010010210002":82,"010510310002":75,"010010203001":74}

基本上这只是一个条目。我有这种类型的三列。我正在尝试解析每个条目并总结数字(在每个冒号之后),以便为三个变量中的每一个获取一个新列,其中包含每个条目的数字总和。

这是为了在 Tableau 中使用它来使用这些总计获得可视化。

我拥有的数据是一个 Excel 表,其中包括三个变量,这些变量存储为 Excel 中的文本,具有上述数据类型。

R 中导入的数据集如下所示:

> dput(head(dataset))
structure(list(census_block_group = c(10010201001, 10010201002, 
10010202001, 10010202002, 10010203001, 10010203002), date_range_start = c(1538352000, 
1538352000, 1538352000, 1538352000, 1538352000, 1538352000), 
    date_range_end = c(1541030400, 1541030400, 1541030400, 1541030400, 
    1541030400, 1541030400), raw_visit_count = c(54544, 19583, 
    24552, 61089, 41139, 91053), raw_visitor_count = c(14460, 
    2351, 3020, 7123, 5132, 20907), visitor_home_cbgs = c("{\"010010208022\":215,\"010010205002\":195,\"010010208012\":184,\"010010208021\":165,\"010010206001\":132,\"010010205003\":110,\"010010211001\":100,\"010010208024\":93,\"010010201002\":91,\"010010205001\":84,\"010010210002\":82,\"010510310002\":75,\"010010203001\":74,\"010010208011\":73,\"010010204002\":62,\"010510309011\":61,\"010510309012\":61,\"010010207001\":58,\"010010209004\":58,\"010010207002\":55,\"010010209002\":54,\"010010208023\":53,\"011250112001\":53}", 
    "{\"010010208022\":143,\"010010208021\":140,\"010010205002\":116,\"010010206001\":87,\"010010208012\":83,\"010010205003\":56}", 
    "{\"010010208022\":193,\"010010205002\":170,\"010010208012\":112,\"010010206001\":96,\"010010203001\":84,\"010010205003\":82,\"010010208021\":74,\"010010208024\":71,\"010010205001\":67,\"010010201002\":65,\"010010204002\":55,\"010010208011\":55}", 
    "{\"010010205002\":326,\"010010208022\":287,\"010010208012\":184,\"010010206001\":170,\"010010208021\":162,\"010010208024\":145,\"010010205003\":141,\"010010205001\":125,\"010010203001\":121,\"010510310002\":113,\"010010204002\":99,\"010010211001\":96,\"010010207001\":95,\"010010201002\":94,\"010510309011\":94,\"010010208011\":88,\"010010209004\":79,\"010510309012\":78,\"010510310001\":72,\"010010207002\":70,\"010010210002\":69,\"010010208023\":67,\"010510301001\":58,\"010510313001\":57,\"010010202001\":56,\"010010209002\":56,\"010510311002\":52}", 
    "{\"010010208022\":247,\"010010205002\":235,\"010010208024\":135,\"010010208012\":121,\"010010205003\":116,\"010010206001\":111,\"010010208021\":99,\"010010209004\":91,\"010010204002\":86,\"010010205001\":83,\"010010201002\":62,\"010010207002\":61,\"010510310002\":60,\"010010209002\":58,\"010010207001\":56,\"010010208023\":56,\"010010208011\":55,\"010510309011\":54,\"010010204001\":52}", 
    "{\"010010205002\":391,\"010010208022\":328,\"010010208012\":181,\"010010208024\":174,\"010010206001\":169,\"010010205003\":161,\"010510310002\":160,\"010010208021\":158,\"010010203001\":152,\"010010205001\":138,\"010010204002\":130,\"010510309011\":129,\"010510309012\":128,\"010010207001\":113,\"010510310001\":105,\"010010209004\":99,\"010010201002\":96,\"010010211001\":93,\"010510313001\":89,\"010010207002\":85,\"010010208011\":85,\"010510301001\":81,\"010510311002\":78,\"010010209002\":76,\"010010210002\":71,\"010010208023\":70,\"010010202002\":63,\"010510307012\":60,\"010010204001\":59,\"010010209001\":56,\"010010202001\":54,\"010010204003\":54,\"010510301004\":53,\"010010206002\":53,\"010510309023\":51,\"010510301003\":50}"
    ), visitor_work_cbgs = c("{\"010010208022\":233,\"011250112001\":106,\"011010001001\":70,\"011010002001\":70,\"010510313001\":63,\"010010209004\":59,\"010010205002\":54}", 
    "{\"010010208022\":223}", "{\"010010208022\":258}", "{\"010010208022\":262,\"011010002001\":90,\"010510313001\":87,\"011010001001\":86,\"010010209004\":72,\"010010205002\":68,\"010510311001\":57,\"010010202001\":50}", 
    "{\"010010208022\":265,\"010010209004\":66,\"010510313001\":59,\"011010001001\":57,\"011010002001\":55,\"010010202001\":54}", 
    "{\"010010208022\":278,\"011010002001\":110,\"011010001001\":106,\"010510313001\":104,\"010510311001\":82,\"010010209004\":80,\"010010205002\":78,\"011010009002\":61,\"011010054072\":59,\"010010202001\":54,\"011010053021\":53,\"010010207002\":50}"
    ), distance_from_home = c(45094, 7346, 6977, 13528, 11597, 
    16162), related_same_day_brand = c("[\"Winn Dixie\",\"Chick-fil-A\",\"walmart\"]", 
    "[\"Winn Dixie\",\"walmart\",\"Chick-fil-A\"]", "[\"America's Thrift Store\",\"Winn Dixie\",\"Chick-fil-A\"]", 
    "[\"Winn Dixie\",\"walmart\",\"America's Thrift Store\",\"Chick-fil-A\",\"Dollar General\"]", 
    "[\"America's Thrift Store\",\"walmart\",\"Chick-fil-A\"]", 
    "[\"Winn Dixie\",\"walmart\",\"America's Thrift Store\"]"
    ), related_same_month_brand = c("[\"walmart\",\"Dollar General\",\"Chick-fil-A\",\"America's Thrift Store\",\"mcdonalds\",\"Zaxby's\",\"Jack's Family Restaurants\",\"Winn Dixie\",\"Waffle House\",\"Chevron\"]", 
    "[\"walmart\",\"America's Thrift Store\",\"Chick-fil-A\",\"Dollar General\",\"mcdonalds\",\"Winn Dixie\",\"Zaxby's\",\"Wendy's\",\"Sonic\",\"Taco Bell\"]", 
    "[\"walmart\",\"America's Thrift Store\",\"Dollar General\",\"Chick-fil-A\",\"mcdonalds\",\"Winn Dixie\",\"Zaxby's\",\"Sonic\",\"Wendy's\",\"Taco Bell\"]", 
    "[\"walmart\",\"Dollar General\",\"Chick-fil-A\",\"mcdonalds\",\"America's Thrift Store\",\"Winn Dixie\",\"Zaxby's\",\"Sonic\",\"Wendy's\",\"Publix Super Markets\"]", 
    "[\"walmart\",\"Dollar General\",\"mcdonalds\",\"America's Thrift Store\",\"Chick-fil-A\",\"Winn Dixie\",\"Zaxby's\",\"Wendy's\",\"Taco Bell\",\"Sonic\"]", 
    "[\"America's Thrift Store\",\"Dollar General\",\"walmart\",\"Winn Dixie\",\"Chick-fil-A\",\"Zaxby's\",\"Waffle House\",\"Jack's Family Restaurants\",\"Publix Super Markets\",\"Sonic\"]"
    ), top_brands = c("[]", "[]", "[]", "[\"Alabama DMV\",\"Fox's Pizza Den\",\"America's Thrift Store\",\"CrossFit\",\"Habitat for Humanity\"]", 
    "[]", "[\"Winn Dixie\",\"Planet Fitness\",\"America's Thrift Store\",\"Dollar General\",\"fred's\",\"Church's Chicken\",\"Advance Auto Parts\",\"Sunoco\",\"Boost Mobile\",\"Factory Connection\"]"
    ), popularity_by_hour = c("[1840,1666,1562,1541,1735,1977,3045,6630,3445,3681,3529,3894,4224,4292,4531,6463,5084,5378,4814,4162,3394,2918,2615,2074]", 
    "[2201,2137,2121,2108,2118,2175,2465,4458,1723,1664,1483,1471,1585,1586,1740,3185,2166,2566,2441,2339,2359,2319,2300,2255]", 
    "[1280,1233,1220,1199,1206,1288,1887,6141,2267,2196,2028,2059,2239,2337,2793,4935,2234,2685,2327,2052,1829,1616,1433,1340]", 
    "[1518,1326,1272,1199,1285,1539,2641,7010,4374,4681,4997,5268,5662,5346,5733,7468,6253,6882,5971,4754,3619,2757,2292,1799]", 
    "[3318,3261,3232,3212,3213,3309,4079,7363,3271,3115,3113,3187,3305,3445,3601,7069,4687,5433,5104,4775,4394,3905,3744,3429]", 
    "[2109,1738,1979,2393,2579,2953,4381,8526,5665,5653,5859,6539,7119,7046,6950,9672,8220,9003,7618,6015,4763,3713,3230,2629]"
    ), popularity_by_day = c("{\"Monday\":7542,\"Tuesday\":8610,\"Wednesday\":9274,\"Thursday\":7473,\"Friday\":8060,\"Saturday\":7222,\"Sunday\":6363}", 
    "{\"Monday\":3160,\"Tuesday\":3167,\"Wednesday\":3353,\"Thursday\":2721,\"Friday\":2807,\"Saturday\":2315,\"Sunday\":2060}", 
    "{\"Monday\":3688,\"Tuesday\":4192,\"Wednesday\":4221,\"Thursday\":3876,\"Friday\":3887,\"Saturday\":2846,\"Sunday\":1842}", 
    "{\"Monday\":9091,\"Tuesday\":9653,\"Wednesday\":10016,\"Thursday\":8397,\"Friday\":9017,\"Saturday\":7919,\"Sunday\":6996}", 
    "{\"Monday\":6126,\"Tuesday\":6681,\"Wednesday\":6811,\"Thursday\":5935,\"Friday\":6251,\"Saturday\":5275,\"Sunday\":4060}", 
    "{\"Monday\":13680,\"Tuesday\":13569,\"Wednesday\":14006,\"Thursday\":11815,\"Friday\":13480,\"Saturday\":13697,\"Sunday\":10806}"
    )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

将 excel 导入 R 后,变量被存储为字符。

然后我运行 toJSON 命令将它们转换为 JSON 类型。在 toJSON 命令之后运行 typeof 命令时,结果又是 character

我现在得到的错误与在列表 (res) 上运行的 sum 命令有关。我得到的错误是:

Error in sum(unlist(res_home_visitor)) : 
  invalid 'type' (character) of argument

unlist 命令的结果是:

visitor_home_cbgs1                                                                                                                                                                                                                                                                                "{\"010010208022\":215,\"010010205002\":195,\"010010208012\":184,\"010010208021\":165,\"010010206001\":132,\"010010205003\":110,\"010010211001\":100,\"010010208024\":93"   
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                visitor_home_cbgs2    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "{\"010010208022\":143,\"010010208021\":140,\"010010205002\":116,\"010010206001\":87,\"010010208012\":83,\"010010205003\":56}"

现在我是编码新手,我已经查看了上述数据类型,但我不确定它是称为关联数组还是 JSON(因此是问题中的标签)。

我研究和尝试的是使用 Tableau 中的拆分功能。但是,它会产生问题,因为条目的大小不同并且很多条目都很大,因此拆分功能会创建很多列,这会导致 Tableau 出现滞后和很多麻烦(由于对列数等等..)

** 这是我想要获得的示例:

{"010479561003":52,"010479561002":52, "110478541030":33}

对于上述条目,总数应为:52+52+33= 137

这个总数 (137) 将被放入一个名为“totals”的新列中,这是可视化所需的。

【问题讨论】:

  • “在将 excel 导入 R 时” - 你是怎么做的,你的 R 会话中的对象是什么,它是带有列表列的 data.frame 吗?如果我们有这个对象的示例,它会更容易提供帮助。
  • 我基本上使用了 R 中的“导入数据集”,它基本上运行以下命令:library(readxl) dataset <- read_excel("D:/path/dataset.xlsx", sheet = "cbg")。运行typeof(dataset) 命令后,结果为list
  • 好的,接下来如果您将dput(head(dataset)) 的输出添加到您的问题中会有所帮助。这将提供我可以用来复制并粘贴到我自己的 R 会话中的代码。然后我可以使用与您完全相同的数据。
  • 更新了问题。
  • 您包含的dput 在我复制和粘贴时出错;有没有省略括号?

标签: r json parsing tableau-api associative-array


【解决方案1】:

您有一个data.frame,其中一些列是 JSON。我的理解是你想独立对待每一行。因此,您可以逐行sumfromJSON 的结果。

在这个例子中,我使用library(dplyr)

  1. rowwise() - 将代码应用于每一行
  2. mutate() - 创建一个新列,这是sum( unlist( fromJSON() ) ) 的结果

library(dplyr)
library(jsonlite)

res <- dataset %>%
  rowwise() %>%
  mutate(visitor_home_cbgs_sum = sum( unlist (jsonlite::fromJSON( visitor_home_cbgs ) ) ) ) %>%
  mutate(visitor_work_cbgs_sum = sum( unlist (jsonlite::fromJSON( visitor_work_cbgs ) ) ) ) 

现在我们可以通过仅选择原始 JSON 列和我们新创建的 sum 列来查看结果

res %>%
  select(visitor_home_cbgs, visitor_home_cbgs_sum, visitor_work_cbgs, visitor_work_cbgs_sum)

# Source: local data frame [6 x 4]
# Groups: <by row>
#   
#   # A tibble: 6 x 4
#   visitor_home_cbgs                     visitor_home_cbg… visitor_work_cbgs                    visitor_work_cbg…
# <chr>                                             <int> <chr>                                            <int>
# 1 "{\"010010208022\":215,\"01001020500…              2188 "{\"010010208022\":233,\"0112501120…               655
# 2 "{\"010010208022\":143,\"01001020802…               625 "{\"010010208022\":223}"                           223
# 3 "{\"010010208022\":193,\"01001020500…              1124 "{\"010010208022\":258}"                           258
# 4 "{\"010010205002\":326,\"01001020802…              3054 "{\"010010208022\":262,\"0110100020…               772
# 5 "{\"010010208022\":247,\"01001020500…              1838 "{\"010010208022\":265,\"0100102090…               556
# 6 "{\"010010205002\":391,\"01001020802…              4093 "{\"010010208022\":278,\"0110100020…              1115

在此示例中,您只有两列要更新,因此我将保留其中的两个 mutate() 语句以明确说明其工作原理。

如果您有许多/未知数量的包含 JSON 值的列,您可能需要考虑另一种方法。

【讨论】:

  • 非常感谢您的快速回复。我现在就试试这个,让你知道它是否有效。
  • 这些列在 Excel 中保存为文本。但是,在将 excel 导入 R 并检查 R 列的类型后,结果为列表类型。而且 fromJSON 的命令当然不接受。你觉得在 list 类型上尝试 toJSON 命令将其转换为 JSON 值得一试吗?
  • @R.A 我已经用你的新 dataset 对象更新了我的答案
  • 非常感谢您在这里的帮助。我希望这行得通。我实际上有 3 列与上面的数据类型相同。但是,如果这适用于那里可用的两列,那应该是完美的。
  • 是的,只需对第三列使用相同的代码,您就会被设置
猜你喜欢
  • 1970-01-01
  • 2021-07-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-02-01
  • 1970-01-01
  • 2020-07-28
相关资源
最近更新 更多