【问题标题】:Data wrangling into long format in R在 R 中将数据整理成长格式
【发布时间】:2021-11-03 14:45:38
【问题描述】:

我有一个自然文章的源数据集。我想知道如何将第 4 行和第 12 行的值提取为具有相关分配组的长数据格式(即低效/高效)。

这是我用来将数据导入 R 的代码。


# load the required libraries 
library(ggsignif) 
library(readxl) 
library(svglite) 
library(tidyverse) 
library(tidyr) 
library(dplyr) 

# The paper from which the figure is taken is Tasdogen et al. (2020)
# Metabolic heterogeneity confers differences in melanoma metastatic potential 

# The figure is 2b and can be accessed at 
# https://www.nature.com/articles/s41586-019-1847-2#MOESM3 

# The link to the raw data used in the article is given below and directly improted for plotting 

url <-'https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1847-2/MediaObjects/41586_2019_1847_MOESM3_ESM.xlsx' 

#create a dataframe from the Excel data 
temp <- tempfile() 

download.file(url, temp, mode='wb') 

myData <- read_excel(path = temp) 

我不知道如何插入数据集的图像,但它应该与前面的代码一起显示。我需要 2-31 列来表示高效,2 到 37 列表示低效。

我希望这些信息足以让人们理解我在说什么。

【问题讨论】:

  • 您好 Jago,有两个问题 1. 您的色谱柱规格是否正确?这些值是重叠的。 2.“长数据格式”是指具有两列的数据框:高效和低效?
  • @AdriaanNeringBögel 抱歉,我意识到当我指的是行时我放了列。我希望数据为 2 列。第一列的标题应该是组,第二列的标题应该是值。在这之后应该有 66 行(即 2-67),前 30 行应该被称为高效,而后 36 行应该被称为低效。第二列应具有从原始数据帧(即我的数据)中的第 4 行和第 12 行获取的相应值。我希望这有助于大声笑。
  • 组列中标签的顺序实际上并不重要,只要相应的值正确即可。干杯。

标签: r data-wrangling


【解决方案1】:

对于这样的一般阅读来说,这些数据的结构确实不太好,但我会尽量做到:

### myData <- read_excel(...)
Data_wide<- myData[c(2:4,10:12), c(2:37)] 
tmp <- as.data.frame(t(Data_wide))
head(tmp)
#             V1 V2                    V3          V4 V5                  V6
# ...2 Efficient #1   0.47699999999999998 Inefficient #1 0.48499999999999999
# ...3 Efficient #2                 0.376 Inefficient #2 0.47399999999999998
# ...4 Efficient #3                 0.496 Inefficient #3 0.48799999999999999
# ...5 Efficient #4   0.32500000000000001 Inefficient #4 0.45600000000000002
# ...6 Efficient #5 8.8999999999999996E-2 Inefficient #5 0.53100000000000003
# ...7 Efficient #6 4.5999999999999999E-2 Inefficient #6               0.318
tmp <- rbind(tmp[,1:3], setNames(tmp[,4:6], names(tmp)[1:3]))
head(tmp)
#             V1 V2                    V3
# ...2 Efficient #1   0.47699999999999998
# ...3 Efficient #2                 0.376
# ...4 Efficient #3                 0.496
# ...5 Efficient #4   0.32500000000000001
# ...6 Efficient #5 8.8999999999999996E-2
# ...7 Efficient #6 4.5999999999999999E-2
tmp <- tmp[complete.cases(tmp),]
tmp$V3 <- as.numeric(tmp$V3)
rownames(tmp) <- NULL
head(tmp,3); tail(tmp,3)
#          V1 V2    V3
# 1 Efficient #1 0.477
# 2 Efficient #2 0.376
# 3 Efficient #3 0.496
#             V1  V2     V3
# 64 Inefficient #34 0.2451
# 65 Inefficient #35 0.2450
# 66 Inefficient #36 0.2529

使用这种结构,您可以子集(删除V2,尽管我想知道为什么您觉得它不重要)并重命名(colnames(tmp) &lt;- c(...))。

【讨论】:

    【解决方案2】:

    虽然它可能不漂亮,但我相信这将是您仅使用 readxltidyverse 包的解决方案:

    # Select first set of rows with group and value
    set1 <- 
      myData %>% 
      filter(row_number() %in% c(2, 4))
    
    # Select second set of rows with group and value
    set2 <- 
      myData %>% 
      filter(row_number() %in% c(10, 12))
    
    # Join both sets of data, so that all group labels are in one row and all values are in one row.
    left_join(set1, set2, by = "Fractional enrichment of glucose m+6 in primary subcutaneous tumors after [U-13C]glucose infusion") %>% 
      #pivot the table to a long format with group lable and value labels in separate columns
      pivot_longer(cols = !`Fractional enrichment of glucose m+6 in primary subcutaneous tumors after [U-13C]glucose infusion`) %>% 
      # pivot wider to a format with group lable and value labels in separate columns
      pivot_wider(names_from = `Fractional enrichment of glucose m+6 in primary subcutaneous tumors after [U-13C]glucose infusion`, values_from = value) %>% 
      # Remove old column names/numbers
      select(-name)
    
    # A tibble: 72 x 2
       Group       `Glucose m+6`      
       <chr>       <chr>              
     1 Inefficient 0.48499999999999999
     2 Inefficient 0.47399999999999998
     3 Inefficient 0.48799999999999999
     4 Inefficient 0.45600000000000002
     5 Inefficient 0.53100000000000003
     6 Inefficient 0.318              
     7 Inefficient 0.26600000000000001
     8 Inefficient 0.30399999999999999
     9 Inefficient 0.309              
    10 Inefficient 0.33               
    # ... with 62 more rows
    
    

    【讨论】:

      【解决方案3】:

      解决问题的一种简洁方法是使用库tidyxlunpivotr。 起初它们可能看起来相当复杂,但它可能是处理 excel 文件的最简洁的方法。我留下了一些 cmets 来帮助你完成它。

      我建议你看看unpivotr vignettes

      # libraries
      library(tidyverse) 
      library(tidyxl)
      library(unpivotr)
      
      # download data
      url <-'https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1847-2/MediaObjects/41586_2019_1847_MOESM3_ESM.xlsx' 
      temp <- tempfile() 
      download.file(url, temp, mode='wb') 
      
      
      # read excel file
      myData <- xlsx_cells(path = temp)
      
      # select the sheet
      figure1a <- myData %>% filter(sheet == "Figure 1 A")
      
      # you can visualize data in an excel-like format with 
      # View(rectify(figure1a))
      
      # since the sheet is composed by two tables
      # get the top-left corner of each table (where in the first column you find Group)
      corners <- figure1a %>% filter(character == "Group")
      
      # partition the spreadsheet based on the corners you just got
      # select the rows you will need
      partitions <- figure1a %>% filter(row %in% c(3:5, 11:13)) %>% partition(corners)
      
      # get the two partitions and edit them
      # with purrr::map it will be easy
      df <- partitions$cells %>% 
        
        # the first column for each partition shows the headers
        map(behead, "left", "header") %>%
        
        # the first row for each partition shows the Group: Efficient/Inefficient
        map(behead, "up", "Group") %>%
                  
        # the second row for each partition shows the mouse id
        # and bind the edited partitions together
        map_dfr(behead, "up", "Mouse_ID") %>%
                  
        # select the columns we need
        select(Group, Mouse_ID, Glucose_m6 = numeric)
      
      # the final result
      df
      #> # A tibble: 66 x 3
      #>    Group     Mouse_ID Glucose_m6
      #>    <chr>     <chr>         <dbl>
      #>  1 Efficient #1            0.477
      #>  2 Efficient #2            0.376
      #>  3 Efficient #3            0.496
      #>  4 Efficient #4            0.325
      #>  5 Efficient #5            0.089
      #>  6 Efficient #6            0.046
      #>  7 Efficient #7            0.213
      #>  8 Efficient #8            0.082
      #>  9 Efficient #9            0.359
      #> 10 Efficient #10           0.306
      #> # ... with 56 more rows
      

      reprex package (v2.0.0) 于 2021-11-04 创建

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2018-11-18
        • 1970-01-01
        • 2012-10-15
        • 2021-11-16
        • 2013-10-22
        • 2023-02-25
        • 1970-01-01
        相关资源
        最近更新 更多