如何连接具有公共列的两个数据框？答案

【问题标题】：how to join the two dataframes that have a common column?如何连接具有公共列的两个数据框？
【发布时间】：2015-08-20 08:34:12
【问题描述】：

我想根据另一个列的行组合添加一个新列。

例如，假设我有一个如下所示的数据框：

library(dplyr)
library(minpack.lm)
library(broom)
No  =  c(replicate(1,rep(letters[1:6],each=10)))
ACME <- as.character(rep(rep(c(78,110),each=10),times=3))
ARGON <- as.character(rep(rep(c(256,320,384),each=20),times=1))
V <- rep(c(seq(2,40,length.out=5),seq(-2,-40,length.out=5)),times=1)
DQ0 = c(replicate(2, sort(runif(10,0.001,1))))
direc <- rep(rep(c("North","South"),each=5),times=6)

df <- data.frame(No,ACME,ARGON,V,DQ0,direc)


>df
    No ACME ARGON     V        DQ0 direc
1    a   78   256   2.0 0.07532351 North
2    a   78   256  11.5 0.13785481 North
3    a   78   256  21.0 0.27397961 North
4    a   78   256  30.5 0.44296243 North
5    a   78   256  40.0 0.45721902 North
6    a   78   256  -2.0 0.68077463 North
7    a   78   256 -11.5 0.68764283 North
8    a   78   256 -21.0 0.76284209 North
9    a   78   256 -30.5 0.81040056 North
10   a   78   256 -40.0 0.95336230 North
11   b  110   256   2.0 0.04190305 South
12   b  110   256  11.5 0.17484353 South
13   b  110   256  21.0 0.22409319 South
----------------

我使用来自minpack.lm 包的nlsLM 函数来适应这个df

->适合零件

nls_fit=nlsLM(DQ0~ifelse(df$direc=="North"&V<J1, exp((-t_pw)/f0*exp(-del1*(1-V/J1)^2)),1)*ifelse(df$direc=="South"&V>J2, exp((-t_pw)/f0*exp(-del2*(1-V/J2)^2)),1)
            ,data=df,start=c(del1=1,J1=15,del2=1,J2=-15),trace=T)

拟合后我想创建一个新的数据框df_new，新列名为address

  df_new<- df%>%
  group_by(No)%>%
  do(data.frame(model=tidy(nls_fit)))%>% # **this part is related fit fitting result. After this process I got "model.term" and "model.estimate"** columns and in the next step I renamed them.
  select_("delta"="model.term","value"= "model.estimate")%>%
  filter(delta%in%c("del1","del2"))%>% #**I filter some fitting parameters**
  mutate(adress=interaction(ACME,ARGON))%>% #this part is not working  
  ungroup

我收到错误提示

错误：大小不兼容 (%d)，需要 %d（组大小）或 1

最后我有一个没有mutatate部分的输出

df_new

    No delta    value
1   a  del1 1.479056
2   a  del2 1.016404
3   b  del1 1.479056
4   b  del2 1.016404
5   c  del1 1.479056
6   c  del2 1.016404
7   d  del1 1.479056
8   d  del2 1.016404
9   e  del1 1.479056
10  e  del2 1.016404
11  f  del1 1.479056
12  f  del2 1.016404

我希望得到这样的东西；

    No delta  value    adress
1   a  del1 1.479056   78.256
2   a  del2 1.016404   78.256
3   b  del1 1.479056  110.256
4   b  del2 1.016404  110.256
5   c  del1 1.479056   78.320
6   c  del2 1.016404   78.320
7   d  del1 1.479056  110.320
8   d  del2 1.016404  110.320
9   e  del1 1.479056   78.384
10  e  del2 1.141958   78.384
11  f  del1 1.019201  110.384
12  f  del2 1.141958  110.384

【问题讨论】：

nls_fit 来自哪里？请包括您使用的软件包。
@Jaap 你要我添加配件吗？ nls_fit 来自 minpack.lm 包。我安装了df 的一些列并在此处排除它们，因为它们与此处的问题无关。我把输出df_new放在这里。
@Jaap 好的，我附上了相关的包。
最好发reproducible example。包含无法复制的代码将无助于获得答案。因此，如果您也包含 nls_fit 对象，那就太好了。
好的，我明白了。但在这种情况下，您可以从问题中省略相当多的信息，因为它不是必需的。您实际上是在询问如何加入两个数据框。请参阅我的答案以获得解决方案。

标签： r dataframe dplyr

【解决方案1】：

您真正想要的是df_new 和df 之间的连接。您可以使用例如data.table：

library(data.table) #v1.9.5+
setDT(df_new)[df, adr:=adress, on="No"]

如果您想使用 CRAN 的最新版本，您可以这样做：

setDT(df_new, key="No")[setDT(df, key="No"), adr:=adress]

两者都给出以下结果：

> dt_new
    No delta    value     adr
 1:  a  del1 1.479056  78.256
 2:  a  del2 1.016404  78.256
 3:  b  del1 1.479056 110.256
 4:  b  del2 1.016404 110.256
 5:  c  del1 1.479056  78.320
 6:  c  del2 1.016404  78.320
 7:  d  del1 1.479056 110.320
 8:  d  del2 1.016404 110.320
 9:  e  del1 1.479056  78.384
10:  e  del2 1.016404  78.384
11:  f  del1 1.479056 110.384
12:  f  del2 1.016404 110.384

dplyr 的方法：

df_new2 <- df %>% select(No, adress) %>% group_by(No) %>% 
  summarise(adr = unique(adress)) %>% 
  left_join(df_new, ., by="No")

给出相同的结果：

> identical(df_new2, setDF(df_new))
[1] TRUE

注意：我使用了development version of data.table

【讨论】：

非常感谢。我们是否也可以在 df_new 内部使用 mutate 进行操作？除了为什么我在所有组中都得到了相同的拟合结果，尽管它们在可重复的例子中被复制了？
我的意思是 del1 和 del2 应该不同。
@aoronbarlow 添加了dplyr 方法。我不确定您所说的“del1 和 del2 应该不同”是什么意思。它们在生成的数据框/数据表中是相同的，因为连接仅在 No 上。在delta 上也加入是不可能的，因为该变量不是df 的一部分。
感谢您也添加了 dplyr 方法。我很抱歉造成误解。我的意思是它们应该不同的是，使用tidy 函数的拟合结果为每个组a:f 提供相同的 del1 和 del2 值。如果你有想法可以评论这个问题.[link(stackoverflow.com/questions/32107224/…)
现在我意识到dt_new <- setDT(df_new)[df, adr:=adress, on="No"] 不起作用。它说 [.data.table(setDT(df_new), df, :=(adr, adress), on = "No") 中的错误：未使用的参数 (on = "No")