根据截止将区域分成更小的区域答案

【问题标题】：Break region into smaller regions based on cutoff根据截止将区域分成更小的区域
【发布时间】：2016-09-12 21:07:51
【问题描述】：

这是我假设的一个有点简单的编程问题，但我一直在努力解决它。主要是因为我不知道该用什么词，也许吧？

给定一组“范围”（以 1-一组如下数字、2-IRanges 或 3-GenomicRanges 的形式），我想将其拆分为一组更小的范围。

示例开头：

Chr    Start     End
1        1        10000
2        1        5000

间隔大小示例：2000

新数据集：

Chr    Start    End
1        1       2000
1        2001    4000
1        4001    6000
1        6001    8000
1        8001    10000
2        1       2000
2        2001    4000
2        4001    5000

我在 R 中执行此操作。我知道我可以使用 seq 简单地生成这些，但我希望能够基于区域列表/df 来执行此操作，而不必每次都手动执行时间我有一个新的地区列表。

这是我使用 seq 制作的示例：

给定 22 条染色体，遍历它们并将每条染色体分成几块

# initialize df
Regions <- data.frame(Chromosome = c(), Start = c(), End = c())
# for each row, do the following
for(i in 1:nrow(Chromosomes)){
     # create a sequence from the minimum start to the max end by some value
     breks <- seq(min(Chromosomes$Start[Chromosomes$Chromosome == i]), max(Chromosomes$End[Chromosomes$Chromosome == i]), by=2000000)

     # put this into a dataframe
     database <- data.frame(Chromosome = i, Start = breks, End = c(breks[2:length(breks)]-1, max(Chromosomes$End[Chromosomes$Chromosome == i])))

     # bind with what we already have
     Regions <- rbind(Regions, database)
     rm(database)
}

这很好用，我想知道一个包中是否已经内置了一些东西可以作为单线或者更灵活，因为这有其局限性。

【问题讨论】：

所以，明确地说，您的目标是一个函数，它接收您显示为“示例开始”的数据框以及参数breaks = 2000 并输出“新数据集”？如果是这样，我同意。您可以很容易地seq - 只需根据变量执行此操作并将其包装在function(){} 中，然后您就拥有自己的自定义函数。
我会选择seq 这样的解决方案，请问我们为什么要这样做？
也许像library(dplyr); library(tidyr); breaks <- 2000L; df %>% group_by(Chr) %>% expand(Start = seq(Start, End, breaks), End = End) %>% mutate(End = if_else(Start+breaks>End, End, as.integer(Start+breaks-1)))。但是对于这个问题有更优雅的解决方案。
我正在开发一个基于 seq 的解决方案，一旦它正常工作，我将发布它，但希望有更简单的东西。谢谢！ @zx8754，我将遍历这些区域并应用我写给它们的自定义函数。这与这个问题无关，但我会计算每个地区不同事件的频率，以便绘制每个地区的频率图。
您发布了一个问题，其中所有染色体的间隔宽度恒定为 2000。如果您想要更复杂的东西，那么您需要定义迄今为止缺少的额外复杂性的“维度”。例如，如果每个染色体的宽度可能不同，则在您的示例中添加一个新列。请编辑问题，而不是在 cmets 中回复。

标签： r bioinformatics bioconductor iranges

【解决方案1】：

使用 R / Bioconductor 包 GenomicRanges，这是您的初始范围

library(GenomicRanges)
rngs = GRanges(1:2, IRanges(1, c(10000, 5000)))

然后在整个基因组中创建一个滑动窗口，首先生成一个列表（每个染色体一组图块），然后根据您的问题中的格式不列出

> windows = slidingWindows(rngs, width=2000, step=2000)
> unlist(windows)
GRanges object with 8 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 [   1,  2000]      *
  [2]        1 [2001,  4000]      *
  [3]        1 [4001,  6000]      *
  [4]        1 [6001,  8000]      *
  [5]        1 [8001, 10000]      *
  [6]        2 [   1,  2000]      *
  [7]        2 [2001,  4000]      *
  [8]        2 [4001,  5000]      *

  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

使用as(df, "GRanges") 或as(unlist(tiles), "data.frame") 强制从/到data.frame。

在?"slidingWindows,GenomicRanges-method" 寻求帮助（标签完成是你的朋友，?"slidingW<tab>）。

尴尬的是，这似乎只在 GenomicRanges 的'devel' version 中实现（v. 1.25.93？）； tile 做了类似的事情，但在跨越 GRange 的宽度时将范围的宽度四舍五入到大致相等。这是穷人的版本

windows <- function(gr, width, withMcols=FALSE) {
    starts <- Map(seq, start(rngs), end(rngs), by=width)
    ends <- Map(function(starts, len) c(tail(starts, -1) - 1L, len),
                starts, end(gr))
    seq <- rep(seqnames(gr), lengths(starts))
    strand <- rep(strand(gr), lengths(starts))
    result <- GRanges(seq, IRanges(unlist(starts), unlist(ends)), strand)
    seqinfo(result) <- seqinfo(gr)
    if (withMcols) {
        idx <- rep(seq_len(nrow(gr)), lengths(starts))
        mcols(result) = mcols(gr)[idx,,drop=FALSE]
    }
    result
}

调用为

> windows(rngs, 2000)

如果该方法有用，请考虑在 Bioconductor support site 上提出后续问题。

【讨论】：

我知道 Granges 中有一些功能，但无法轻松谷歌搜索，也无法在手册中找到。您介意添加一个链接到记录此功能的手册吗？它是 GenomicRanges/IRanges 的一部分吗，似乎在帮助??slidingWindows 中找不到它？
@zx8754 哎呀，对不起，这似乎只在 GenomicRanges 的开发版本中可用；我在答案中提供了一个临时解决方案。