【发布时间】:2019-05-27 14:03:40
【问题描述】:
我正在尝试清理我的数据,以便将包含“gamecentre-playbyplay-event”的行正下方的每一行标记为目标,将包含“gamecentre-playbyplay-event”的每一行直接标记为“目标”下方" 行被标记为主要辅助,而在“主要辅助”行正下方包含“gamecentre-playbyplay-event”的每一行都被标记为辅助辅助。
数据如下所示:
mydata
# A tibble: 15 x 1
value
<chr>
1 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-bat gamecentre-playby"
2 "<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
3 "<a href=\"/players/16639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
4 "<a href=\"/players/17027\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
5 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"
6 "<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
7 "<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
8 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"
9 "<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
10 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
11 "<a href=\"/players/17522\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
12 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"
13 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
14 "<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
15 "<a href=\"/players/14757\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
这里有一些问题。
- 我需要设置条件以便正确标记行。
- 如果没有“辅助辅助”行,则该行标记为
NA。 - 如果没有“主要辅助”行,该行也被标记为
NA。
我正在尝试为此使用dplyr::lag(),但是当没有主要或次要辅助时我想要NAs 会令人困惑。
这是我目前所拥有的基础:
goals <- mydata %>%
filter(dplyr::lag(str_detect(value, "gamecentre-playbyplay-event team-border"), 1))
goals
# A tibble: 4 x 1
value
<chr>
1 "<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re
2 "<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re
3 "<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re
4 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re
这就是我希望我的数据在所有这些结束时的样子。我认为使用dplyr::lag() 是可行的方法,但我不确定。
# A tibble: 4 x 3
goal primary_assist secondary_assist
<chr> <chr> <chr>
1 "<a href=\"/players/14695\" class=\"gam~ "<a href=\"/players/16639\" class=\"gamecent~ "<a href=\"/players/17027\" class=\"gamecentr~
2 "<a href=\"/players/17453\" class=\"gam~ "<a href=\"/players/14639\" class=\"gamecent~ NA
3 "<a href=\"/players/18061\" class=\"gam~ "<a href=\"/players/14752\" class=\"gamecent~ "<a href=\"/players/17522\" class=\"gamecentr~
4 "<a href=\"/players/14752\" class=\"gam~ "<a href=\"/players/14639\" class=\"gamecent~ "<a href=\"/players/14757\" class=\"gamecentr~
有什么想法吗?
输入:
mydata <- structure(list(value = c("<div class=\"gamecentre-playbyplay-event team-border--lhjmq-bat gamecentre-playby",
"<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/16639\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/17027\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby",
"<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby",
"<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/17522\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby",
"<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/14757\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
)), .Names = "value", class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -15L))
【问题讨论】:
-
如果你有多个辅助助攻怎么办?
-
这是不可能的。我给出的示例非常适合处理边缘情况。每个进球最多 1 个进球/主要助攻/次要助攻,但可能没有主要助攻或次要助攻。
标签: r dplyr data-cleaning