【问题标题】:Random merging between 2 dataframes R2个数据帧R之间的随机合并
【发布时间】:2021-07-22 23:16:20
【问题描述】:

我的第一个数据框看起来像这样:

Department Municipality Location Lat. Long.
ANTIOQUIA MEDELLIN PALMITAS 6.343341 -75.69004
ANTIOQUIA MEDELLIN SANTA ELENA 6.209718 -75.50191
ANTIOQUIA MEDELLIN ALTAVISTA 6.223150 -75.62856

还有第二个数据框:

Department_Name Municipality_Name
ANTIOQUIA MEDELLIN
ANTIOQUIA MEDELLIN

我想像这样随机合并两个数据框:

Department_Name Municipality_Name Location Lat Long.
ANTIOQUIA MEDELLIN SANTA ELENA 6.209718 -75.50191
ANTIOQUIA MEDELLIN PALMITAS 6.343341 -75.69004

关注这个话题:Join data frames and select random row when there are multiple matches这是我尝试过的

library(dplyr)

df2<-subset(df2, select=c(Department_Name, Municipality_Name, Location,Long., Lat.))

df2 <- df2 %>% rename(Department = Department_Name, Municipality=Municipality_Name)

df1[df2, on = .(Department, Municipality, Location,Long., Lat.),
   {ri <- sample(.N, 1L)
   .(Department = Department[ri], Municipality = Municipality[ri])}, by = .EACHI]

样本错误(.N,1L):找不到对象“.N”

我的编程背景不足以理解本主题中提供的代码,所以如果有人能帮助解决这个错误,那就太好了!

【问题讨论】:

    标签: r dataframe merge


    【解决方案1】:

    一种使用dplyr的方式-

    library(dplyr)
    
    df2 %>%
      count(Department_Name, Municipality_Name) %>%
      left_join(df1, by = c('Department_Name' = 'Department', 
                            'Municipality_Name' = 'Municipality')) %>%
      group_by(Department_Name, Municipality_Name) %>%
      sample_n(first(n)) %>%
      ungroup
    

    【讨论】:

    • 它适用于我的第一个数据集,但是当我尝试另一个数据集(124k 观察)时,我收到以下错误消息:“Erreur:size 必须小于或等于 1(大小为data),设置replace = TRUE 以使用带替换的采样。”,它出现在 sample_n(first(n)) 之后,你知道它的来源吗?
    • 我认为这是因为df2 中的行数多于df1 中的相应行数。您可以在sample_n 中添加replace = TRUE
    【解决方案2】:

    第二个数据帧是第一个数据帧的真子集。附加第二个数据帧不会向第一个数据帧提供额外的信息。因此,您可以通过观察样本来实现您的目标:

    iris[sample(x = nrow(iris), size = 5, replace = FALSE), ]
    

    【讨论】:

      【解决方案3】:

      如果您使用dplyr,则依靠sample_n 获取数据帧的随机样本,并使用left_join 进行合并可能会得到更易于解释的代码。

      这里我提供略有不同的数据框示例:

      library(dplyr)
      
      df_veredas <- #This is a sample dataframe with info for veredas
        data.frame(departamento = c("ANTIOQUIA", "ANTIOQUIA", "ANTIOQUIA", "CUNDINAMARCA", "CUNDINAMARCA", "CUNDINAMARCA"), 
                   municipio = c("MEDELLIN", "MEDELLIN", "MEDELLIN", "GUADUAS", "GUADUAS", "GUADUAS"), 
                   vereda = c("PALMITAS", "SANTA ELENA", "ALTAVISTA", "CEDRALES", "EL DIAMANTE", "CARRAPAL"),
                   lat = c(6.343341, 6.209718, 6.22315, 5.05369106131653, 5.03379856537084, 5.26723603834365), 
                   long = c(-75.69004, -75.50191, -74.6004510276649, -74.6904256, -74.5475269556119, -74.5892936214298))
      
      df_municipios <- # This is a sample data frame with info for municipalities
        data.frame(Department_Name = c("ANTIOQUIA", "CUNDINAMARCA", "ATLÁNTICO"), 
                   Municipality_Name = c("MEDELLIN", "GUADUAS", "BARRANQUILLA"),
                   DIVIPOLA = c("05001", "25320", "08001" ))
      
      # Below is where the sampling and merging happen.
      sample_n(#This is where the sampling occurs. I get 2 random observations from df_veredas
      tbl = df_veredas,  
               size = 2, 
               replace = FALSE) %>% 
        left_join(# This is where the merge happens
          df_municipios, #Merges the sampled df_veredas with df_municipios
          by = c("departamento" = "Department_Name", #Indexing by department, which is written differently in each table
                 "municipio"    = "Municipality_Name" #And also indexing by municiplaity, which is also written differently in each table
      ))
      
        departamento municipio   vereda      lat      long DIVIPOLA
      1 CUNDINAMARCA   GUADUAS CEDRALES 5.053691 -74.69043    25320
      2 CUNDINAMARCA   GUADUAS CARRAPAL 5.267236 -74.58929    25320
      

      【讨论】:

        【解决方案4】:
        library(data.table)
        df1_bis = data.table(df1)
        df2_bis=data.table(df2) try <- data.frame(df1_bis[df2_bis, on = .(Department, Municipality),
                   {ri <- sample(.N, 1L)
                   .(Long. = Long.[ri], Lat. = Lat[ri])}, by = .EACHI])`
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2018-09-27
          • 1970-01-01
          • 1970-01-01
          • 2017-12-30
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多