【问题标题】:Using spread with duplicate identifiers for rows giving error对给出错误的行使用带有重复标识符的扩展
【发布时间】:2024-07-02 16:10:02
【问题描述】:

我的数据如下所示:

df <- read.table(header = T, text =
        "GeneID    Gene_Name   Species    Paralogues    Domains   Functional_Diversity
         1234      DDR1        hsapiens   14            2         8.597482
         5678      CSNK1E      celegans   70            4         8.154788
         9104      FGF1        Chicken    3             0         5.455874
         4575      FGF1        hsapiens   4             6         6.745845")

我需要它看起来像:

   Gene_Name    hsapiens    celegans    ggalus
   DDR1         8.597482    NA          NA
   CSNK1E       NA          8.154788    NA
   FGF1         6.745845    NA          5.455874

我尝试过使用:

library(tidyverse)
df %>% 
    select(Gene_Name, Species, Functional_Diversity) %>% 
    spread(Species, Functional_Diversity)

我的实际数据包含 130,000 行(许多基因名称大约 14,000 个唯一),由 9 个物种组成。

当我将此方法应用于我得到的实际数据时:

Error: Duplicate identifiers for rows (16691, 19988), (20938, 21033), (1232, 21150), (2763, 21465), (1911, 20844), (17274, 17657, 18293, 18652, 18726, 19006, 19025), (496, 22555), (17227, 17608, 18211, 18605, 18676, 18967, 19002), (13569, 21807), (10261, 21014, 21607), (20816, 21553), (2244, 22025), (6194, 21910), (12217, 21555), (2936, 21078), (16484, 20911), (12216, 21851), (9289, 21791), (10340, 21752), (1714, 22077), (13216, 22618), (6076, 22371), (14731, 21717), (160, 22472), (11553, 22635), (17183, 17583, 18510, 18608, 18661, 18896, 19108), (138, 20028), (17185, 17584, 18330, 18415, 18500, 18981, 19063), (9726, 22440), (17238, 17617, 18905, 18960, 18996, 19134), (1638, 21645), (4631, 20821), (9162, 22463), (319, 20900), (13600, 22227), (9312, 20011), (14825, 21711, 21764), (3381, 21134), (505, 21133), (5954, 20013), (5948, 21313), (17233, 17612, 18187, 18311, 18411, 18708, 18980), (16953, 20902, 21845), (20710, 22477), (20519, 20973), (10204, 21197, 21213), (2933, 20707), (4302,

【问题讨论】:

标签: r dataframe tidyr spread biomart


【解决方案1】:

要仅查看具有“重复标识符”的行,您可以使用...

df %>% 
  group_by(Gene_Name, Species) %>% 
  mutate(n = n()) %>% 
  filter(n > 1)

为确保spread 正常工作,即使您有重复标识符的行,您也可以添加一个行号列,以保证每一行都是唯一的...

df %>% 
  select(Gene_Name, Species, Functional_Diversity) %>% 
  mutate(row = row_number()) %>% 
  spread(Species, Functional_Diversity)

【讨论】: