【问题标题】:reformatting table in RR中的重新格式化表
【发布时间】:2020-05-02 10:13:48
【问题描述】:

我正在从网站上的表格中提取信息。该表的输出如下所示(见下文)。

1. Saturday
2. 4:00 PM
3. 5:30 PM
4. Sunday
5. 8:30 AM
6. 10:00 AM

我真的需要它像这样度过(见下文)。我不认为我可以使用 html_table() 函数对其进行转换,但我希望有人知道如何在 R 中重新格式化它。

1. Saturday    4:00 PM
2. Saturday    5:30 PM
3. Sunday      8:30 AM
4. Sunday      10:00 AM

这是我正在使用的代码:

urls <- 'https://www.life.church/edmond/'

times <- function(x){ 
  try( x %>%
         read_html()%>%
         html_table(header = F)%>%
         data.frame(x))

}


#Apply function to the urls
m <- lapply(urls, times)

#Convert to a dataframe 
data <-data.frame(unnest(tibble(m)))

【问题讨论】:

    标签: r web-scraping html-table reformatting


    【解决方案1】:

    这就是我会做的:

    library(dplyr)
    library(xml2)
    library(rvest)
    library(tidyr)
    library(purrr)
    
    times <- function(x){ 
      try(
        x %>%
          read_html() %>%
          html_table(header = FALSE) %>% 
          flatten() %>% 
          as_tibble()
      )
    }
    
    urls <- c('https://www.life.church/edmond/', 'https://www.life.church/fortworth/')
    
    lapply(urls, times) %>% 
      set_names(urls) %>% 
      bind_rows(.id = "URL") %>% 
      separate(X1, into = c("Time", "Day"), sep = "(?=^\\D)") %>% 
      fill(Day) %>% 
      filter(Time != "") %>% 
      select(URL, Day, Time)
    
    # A tibble: 16 x 3
       URL                                Day       Time    
       <chr>                              <chr>     <chr>   
     1 https://www.life.church/edmond/    Saturday  4:00 PM 
     2 https://www.life.church/edmond/    Saturday  5:30 PM 
     3 https://www.life.church/edmond/    Sunday    8:30 AM 
     4 https://www.life.church/edmond/    Sunday    10:00 AM
     5 https://www.life.church/edmond/    Sunday    11:30 AM
     6 https://www.life.church/edmond/    Sunday    1:00 PM 
     7 https://www.life.church/edmond/    Sunday    4:00 PM 
     8 https://www.life.church/edmond/    Sunday    5:30 PM 
     9 https://www.life.church/edmond/    Wednesday 7:00 PM 
    10 https://www.life.church/fortworth/ Saturday  4:00 PM 
    11 https://www.life.church/fortworth/ Saturday  5:30 PM 
    12 https://www.life.church/fortworth/ Sunday    8:30 AM 
    13 https://www.life.church/fortworth/ Sunday    10:00 AM
    14 https://www.life.church/fortworth/ Sunday    11:30 AM
    15 https://www.life.church/fortworth/ Sunday    1:00 PM 
    16 https://www.life.church/fortworth/ Wednesday 7:00 PM
    

    separate() 使用前瞻正则表达式将以数字开头的条目分隔到新列Day

    【讨论】:

    • 你太棒了!谢谢!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-10-21
    • 1970-01-01
    • 2016-10-07
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多