【发布时间】:2021-08-06 10:13:02
【问题描述】:
我正在练习抓取和数据清理,并且有一张我从维基百科上抓取的表格。我正在尝试改变表以创建一个列,该列从现有列中清除逗号以返回数字。我得到的只是一列 NA。
这是我的输出:
> library(dplyr)
> library(rvest)
>
> pg <- read_html("https://en.wikipedia.org/wiki/Rugby_World_Cup")
> rugby <- pg %>% html_table(., fill = T)
>
> rugby_table <- rugby[[3]]
>
> rugby_table
# A tibble: 9 x 8
Year `Host(s)` `Total attendance` Matches `Avg attendance` `% change in avg att.` `Stadium capacity` `Attendance as % o~
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1987 Australia New Zealand 604,500 32 20,156 — 1,006,350 60%
2 1991 England France Ireland Scotland Wales 1,007,760 32 31,493 +56% 1,212,800 79%
3 1995 South Africa 1,100,000 32 34,375 +9% 1,423,850 77%
4 1999 Wales 1,750,000 41 42,683 +24% 2,104,500 83%
5 2003 Australia 1,837,547 48 38,282 –10% 2,208,529 83%
6 2007 France 2,263,223 48 47,150 +23% 2,470,660 92%
7 2011 New Zealand 1,477,294 48 30,777 –35% 1,732,000 85%
8 2015 England 2,477,805 48 51,621 +68% 2,600,741 95%
9 2019 Japan 1,698,528 45† 37,745 –27% 1,811,866 90%
>
> rugby_table2 <- rugby %>%
+ .[[3]] %>%
+ tbl_df %>%
+ mutate(Attendance=as.numeric(gsub("[^0-9.-]+","",'Total attendance')))
>
> rugby_table2
# A tibble: 9 x 9
Year `Host(s)` `Total attendance` Matches `Avg attendance` `% change in avg~ `Stadium capaci~ `Attendance as~ Attendance
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1987 Australia New Zealand 604,500 32 20,156 — 1,006,350 60% NA
2 1991 England France Ireland Scotland Wales 1,007,760 32 31,493 +56% 1,212,800 79% NA
3 1995 South Africa 1,100,000 32 34,375 +9% 1,423,850 77% NA
4 1999 Wales 1,750,000 41 42,683 +24% 2,104,500 83% NA
5 2003 Australia 1,837,547 48 38,282 –10% 2,208,529 83% NA
6 2007 France 2,263,223 48 47,150 +23% 2,470,660 92% NA
7 2011 New Zealand 1,477,294 48 30,777 –35% 1,732,000 85% NA
8 2015 England 2,477,805 48 51,621 +68% 2,600,741 95% NA
9 2019 Japan 1,698,528 45† 37,745 –27% 1,811,866 90% NA
有什么想法吗?
【问题讨论】:
标签: r data-cleaning rvest gsub dplyr