【问题标题】:Creating new columns in multiple dataframes in R在 R 中的多个数据框中创建新列
【发布时间】:2021-04-30 10:30:40
【问题描述】:

my previous question 之后,我在 R 中处理了大量数据帧,每个数据帧都有不同的列数。我想同化这些数据集,以便它们都具有相同数量的列和新添加列的 NA 值。我已经写了一个循环,但我不确定如何更新真实的数据帧。

first_df   = data.frame(matrix(rnorm(20), nrow=10))
second_df  = data.frame(matrix(rnorm(20), nrow=4))
third_df   = data.frame(matrix(rnorm(20), nrow=5))

library(tidyverse)

min_max <- mget(ls(pattern = "_df")) %>%
  map_dbl(ncol) %>%
  enframe() %>%
  arrange(value) %>%
  slice(1, n())

min_max

# A tibble: 2 x 2
#  name      value
#  <chr>     <dbl>
#1 first_df      2
#2 second_df     5

diff <- setdiff(names(get(min_max$name[2])), names(get(min_max$name[1])))

for (col_name in diff)
    
#     all dataframes whose names contain "_df"
    for (df_index in 1:length(ls(pattern = "_df")))
    
    {
#     capturing the dataframe
        data = get(ls(pattern = "_df")[df_index]);
        
     if (!(col_name %in% names(data)))
         
    {data[,col_name] <- NA}
#          I don't know how to update the real datasets
#     get(ls(pattern = "_df")[df_index]) <- data
                   
    }

【问题讨论】:

    标签: r list purrr


    【解决方案1】:

    我快速查了一下,解决方法是 assign() 函数。

    所以这是你的分配代表。但我还了解到,将您的数据框收集到一个列表中会很有用,然后您可以更改我认为的列表位置的名称。

    first_df   = data.frame(matrix(rnorm(20), nrow=10))
    second_df  = data.frame(matrix(rnorm(20), nrow=4))
    third_df   = data.frame(matrix(rnorm(20), nrow=5))
    
    library(tidyverse)
    
    min_max <- mget(ls(pattern = "_df")) %>%
      map_dbl(ncol) %>%
      enframe() %>%
      arrange(value) %>%
      slice(1, n())
    
    min_max
    
    diff <- setdiff(names(get(min_max$name[2])), names(get(min_max$name[1])))
    
    for (col_name in diff) {
      
      #     all dataframes whose names contain "_df"
      for (df_index in 1:length(ls(pattern = "_df"))) {
        
        #     capturing the dataframe
        data = get(ls(pattern = "_df")[df_index]);
        
        if (!(col_name %in% names(data))) {
          data[,col_name] <- NA
        assign(ls(pattern = "_df")[df_index], data)
        }
        #          I don't know how to update the real datasets
        #     get(ls(pattern = "_df")[df_index]) <- data
        
      }
    }
    

    【讨论】:

      【解决方案2】:

      这是一个摆脱循环的替代方案;它使用dplyr::bind_rows() 将最大尺寸的数据帧放在一起,并在需要的地方填充NA。

      first_df   = data.frame(matrix(rnorm(20), nrow=10))
      second_df  = data.frame(matrix(rnorm(20), nrow=4))
      third_df   = data.frame(matrix(rnorm(20), nrow=5))
      
      library(tidyverse)
      
      df_names <- ls(pattern = "_df")
      df_list <- mget(df_names)
      
      new_df_list <-
        df_list %>%
        bind_rows(.id = "id") %>%       # put together with biggest number of columns
        group_split(id) %>%             # break down to list 
        set_names(df_names) %>%
        map(., ~ dplyr::select(., -id)) # remove the id column 
      
      # save each df back to global environment
      list2env(new_df_list, globalenv())
      
      # check
      first_df
      

      【讨论】:

        猜你喜欢
        • 2014-12-31
        • 1970-01-01
        • 2020-10-03
        • 2020-09-17
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多