在 R 中将数据整理成长格式答案

【问题标题】：Data wrangling into long format in R在 R 中将数据整理成长格式
【发布时间】：2021-11-03 14:45:38
【问题描述】：

我有一个自然文章的源数据集。我想知道如何将第 4 行和第 12 行的值提取为具有相关分配组的长数据格式（即低效/高效）。

这是我用来将数据导入 R 的代码。


# load the required libraries 
library(ggsignif) 
library(readxl) 
library(svglite) 
library(tidyverse) 
library(tidyr) 
library(dplyr) 

# The paper from which the figure is taken is Tasdogen et al. (2020)
# Metabolic heterogeneity confers differences in melanoma metastatic potential 

# The figure is 2b and can be accessed at 
# https://www.nature.com/articles/s41586-019-1847-2#MOESM3 

# The link to the raw data used in the article is given below and directly improted for plotting 

url <-'https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1847-2/MediaObjects/41586_2019_1847_MOESM3_ESM.xlsx' 

#create a dataframe from the Excel data 
temp <- tempfile() 

download.file(url, temp, mode='wb') 

myData <- read_excel(path = temp)

我不知道如何插入数据集的图像，但它应该与前面的代码一起显示。我需要 2-31 列来表示高效，2 到 37 列表示低效。

我希望这些信息足以让人们理解我在说什么。

【问题讨论】：

您好 Jago，有两个问题 1. 您的色谱柱规格是否正确？这些值是重叠的。 2.“长数据格式”是指具有两列的数据框：高效和低效？
@AdriaanNeringBögel 抱歉，我意识到当我指的是行时我放了列。我希望数据为 2 列。第一列的标题应该是组，第二列的标题应该是值。在这之后应该有 66 行（即 2-67），前 30 行应该被称为高效，而后 36 行应该被称为低效。第二列应具有从原始数据帧（即我的数据）中的第 4 行和第 12 行获取的相应值。我希望这有助于大声笑。
组列中标签的顺序实际上并不重要，只要相应的值正确即可。干杯。

标签： r data-wrangling

【解决方案1】：

对于这样的一般阅读来说，这些数据的结构确实不太好，但我会尽量做到：

### myData <- read_excel(...)
Data_wide<- myData[c(2:4,10:12), c(2:37)] 
tmp <- as.data.frame(t(Data_wide))
head(tmp)
#             V1 V2                    V3          V4 V5                  V6
# ...2 Efficient #1   0.47699999999999998 Inefficient #1 0.48499999999999999
# ...3 Efficient #2                 0.376 Inefficient #2 0.47399999999999998
# ...4 Efficient #3                 0.496 Inefficient #3 0.48799999999999999
# ...5 Efficient #4   0.32500000000000001 Inefficient #4 0.45600000000000002
# ...6 Efficient #5 8.8999999999999996E-2 Inefficient #5 0.53100000000000003
# ...7 Efficient #6 4.5999999999999999E-2 Inefficient #6               0.318
tmp <- rbind(tmp[,1:3], setNames(tmp[,4:6], names(tmp)[1:3]))
head(tmp)
#             V1 V2                    V3
# ...2 Efficient #1   0.47699999999999998
# ...3 Efficient #2                 0.376
# ...4 Efficient #3                 0.496
# ...5 Efficient #4   0.32500000000000001
# ...6 Efficient #5 8.8999999999999996E-2
# ...7 Efficient #6 4.5999999999999999E-2
tmp <- tmp[complete.cases(tmp),]
tmp$V3 <- as.numeric(tmp$V3)
rownames(tmp) <- NULL
head(tmp,3); tail(tmp,3)
#          V1 V2    V3
# 1 Efficient #1 0.477
# 2 Efficient #2 0.376
# 3 Efficient #3 0.496
#             V1  V2     V3
# 64 Inefficient #34 0.2451
# 65 Inefficient #35 0.2450
# 66 Inefficient #36 0.2529

使用这种结构，您可以子集（删除V2，尽管我想知道为什么您觉得它不重要）并重命名（colnames(tmp) <- c(...)）。

【讨论】：

【解决方案2】：

虽然它可能不漂亮，但我相信这将是您仅使用 readxl 和 tidyverse 包的解决方案：

# Select first set of rows with group and value
set1 <- 
  myData %>% 
  filter(row_number() %in% c(2, 4))

# Select second set of rows with group and value
set2 <- 
  myData %>% 
  filter(row_number() %in% c(10, 12))

# Join both sets of data, so that all group labels are in one row and all values are in one row.
left_join(set1, set2, by = "Fractional enrichment of glucose m+6 in primary subcutaneous tumors after [U-13C]glucose infusion") %>% 
  #pivot the table to a long format with group lable and value labels in separate columns
  pivot_longer(cols = !`Fractional enrichment of glucose m+6 in primary subcutaneous tumors after [U-13C]glucose infusion`) %>% 
  # pivot wider to a format with group lable and value labels in separate columns
  pivot_wider(names_from = `Fractional enrichment of glucose m+6 in primary subcutaneous tumors after [U-13C]glucose infusion`, values_from = value) %>% 
  # Remove old column names/numbers
  select(-name)

# A tibble: 72 x 2
   Group       `Glucose m+6`      
   <chr>       <chr>              
 1 Inefficient 0.48499999999999999
 2 Inefficient 0.47399999999999998
 3 Inefficient 0.48799999999999999
 4 Inefficient 0.45600000000000002
 5 Inefficient 0.53100000000000003
 6 Inefficient 0.318              
 7 Inefficient 0.26600000000000001
 8 Inefficient 0.30399999999999999
 9 Inefficient 0.309              
10 Inefficient 0.33               
# ... with 62 more rows

【讨论】：

【解决方案3】：

解决问题的一种简洁方法是使用库tidyxl 和unpivotr。起初它们可能看起来相当复杂，但它可能是处理 excel 文件的最简洁的方法。我留下了一些 cmets 来帮助你完成它。

我建议你看看unpivotr vignettes。

# libraries
library(tidyverse) 
library(tidyxl)
library(unpivotr)

# download data
url <-'https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1847-2/MediaObjects/41586_2019_1847_MOESM3_ESM.xlsx' 
temp <- tempfile() 
download.file(url, temp, mode='wb') 


# read excel file
myData <- xlsx_cells(path = temp)

# select the sheet
figure1a <- myData %>% filter(sheet == "Figure 1 A")

# you can visualize data in an excel-like format with 
# View(rectify(figure1a))

# since the sheet is composed by two tables
# get the top-left corner of each table (where in the first column you find Group)
corners <- figure1a %>% filter(character == "Group")

# partition the spreadsheet based on the corners you just got
# select the rows you will need
partitions <- figure1a %>% filter(row %in% c(3:5, 11:13)) %>% partition(corners)

# get the two partitions and edit them
# with purrr::map it will be easy
df <- partitions$cells %>% 
  
  # the first column for each partition shows the headers
  map(behead, "left", "header") %>%
  
  # the first row for each partition shows the Group: Efficient/Inefficient
  map(behead, "up", "Group") %>%
            
  # the second row for each partition shows the mouse id
  # and bind the edited partitions together
  map_dfr(behead, "up", "Mouse_ID") %>%
            
  # select the columns we need
  select(Group, Mouse_ID, Glucose_m6 = numeric)

# the final result
df
#> # A tibble: 66 x 3
#>    Group     Mouse_ID Glucose_m6
#>    <chr>     <chr>         <dbl>
#>  1 Efficient #1            0.477
#>  2 Efficient #2            0.376
#>  3 Efficient #3            0.496
#>  4 Efficient #4            0.325
#>  5 Efficient #5            0.089
#>  6 Efficient #6            0.046
#>  7 Efficient #7            0.213
#>  8 Efficient #8            0.082
#>  9 Efficient #9            0.359
#> 10 Efficient #10           0.306
#> # ... with 56 more rows

^{由reprex package (v2.0.0) 于 2021-11-04 创建}

【讨论】：