找到重叠或完全嵌套的范围并标记它们答案

【问题标题】：find overlaps or completely nested ranges and flag them找到重叠或完全嵌套的范围并标记它们
【发布时间】：2019-08-27 17:55:11
【问题描述】：

import pandas as pd
df = pd.DataFrame({'region_name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], 'start' : [1913, 46430576, 52899183, 58456122, 62925929, 65313395, 65511483, 65957829], 'stop' : [90207973, 90088654, 90088654, 74708723, 84585795, 90081985, 90096995, 83611443], 'chr':[1, 1, 1, 1, 1, 1, 1, 2]})

如果chr 与连续的start-stop 对相同，则在按最小的start 到最大的start 排序后，我想在连续的start-stop 范围内找到重叠或完全嵌套的范围。

输出应如下所示：

到目前为止我有：

df = df.sort_values(by=['chr', 'start'], ascending=[True, True])
for i in range(1,len(df['region_name'])):
    if df['critical_error'][i] == True:
        continue
    for j in range(0,i):
        if df['start'][i] <= df['stop'][j] and df['stop'][i] <= df['stop'][j] and df['chr'][i] == df['chr'][j]:
            df['overlap'][i] ='no overlap, nested with region %s' % df['region_name'][j]
            break
        elif df['start'][i] < df['stop'][j] and df['chr'][i] == df['chr'][j]:
            df['overlap'][i] = 'overlap within region ' + df['region_name'][j]
        else:
            continue

上面遗漏了很多案例，感谢任何帮助，谢谢！

【问题讨论】：

您好！所以这不是 python 实现，但我注意到你有基因组数据，bedtools 实际上是一个非常强大、高效的工具，可以满足你的需求。然后，您可以使用快速脚本跟进。

标签： python pandas python-2.7

【解决方案1】：

我没有得到这个部分：

...如果连续开始-停止对的 chr 相同。

我仍然给你写了一些代码，这些代码在某些方面与你给定的表格相同。如果您澄清您的观点，我可能会更新该答案。也许它仍然对您有所帮助，并且您可以将缺少的部分放入：

import pandas as pd
import numpy as np

df = pd.DataFrame({'region_name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], 'start' : [1913, 46430576, 52899183, 58456122, 62925929, 65313395, 65511483, 65957829], 'stop' : [90207973, 90088654, 90088654, 74708723, 84585795, 90081985, 90096995, 83611443], 'chr':[1, 1, 1, 1, 1, 1, 1, 2]})

# store texts for each row in that list
overlaps_texts = []

# iterate over all rows
for i, row in df.iterrows():
    # extract entries' data
    start, stop, ch = row[1:4]

    # Check if I am completely inside (nested into something)
    # Note that this will always return indexers where each entry if True or False
    # So nested will be something like [False, False, True, ...] where True means
    # that start > start_other AND stop < stop_other (="I am nested")
    nested = ((start > df.loc[:, 'start']) & (stop < df.loc[:, 'stop']))

    # hanging out left
    overlap_1 = ((stop > df.loc[:, 'start']) &
                 (stop < df.loc[:, 'stop'])
                 )

    # starting before stop of other but ending after (hanging out right)
    overlap_2 = ((start < df.loc[:, 'stop']) & (start > df.loc[:, 'start']))

    # one of both overlaps good
    overlap = (overlap_1 | overlap_2) & ~nested

    # identical chr? I didnt get that part. That may be different for your application
    overlap &= df.loc[:, 'chr'] == ch
    nested &= df.loc[:, 'chr'] == ch

    # generate text
    text = ''

    # check if any nestings
    if np.any(nested):
        nested_indices = [*filter(lambda x: x[1], zip(range(len(nested)), nested))]
        text = "I am nested within: "
        region_names = []
        for index, _ in nested_indices:
            region_names.append(df.iloc[index,0])

        text += ", ".join(region_names)+"; "

    # check if any overlaps (obviously one can write that more DRY), since it repeats the pattern from above
    if np.any(overlap):
        overlap_indices = [*filter(lambda x: x[1], zip(range(len(overlap)), overlap))]
        text += "I overlap: "
        region_names = []
        for index, _ in overlap_indices:
            region_names.append(df.iloc[index,0])
        text += ", ".join(region_names)

    if text == '':
        text = 'I am not nested nor do I overlap something'

    overlaps_texts.append(text)

df.loc[:, 'overlap'] = overlaps_texts

print(df)

输出：

   start                       ...                                                                 overlap
0      1913                       ...                              I am not nested nor do I overlap something
1  46430576                       ...                                     I am nested within: A; I overlap: G
2  52899183                       ...                                  I am nested within: A; I overlap: B, G
3  58456122                       ...                         I am nested within: A, B, C; I overlap: E, F, G
4  62925929                       ...                         I am nested within: A, B, C; I overlap: D, F, G
5  65313395                       ...                         I am nested within: A, B, C; I overlap: D, E, G
6  65511483                       ...                         I am nested within: A; I overlap: B, C, D, E, F
7  65957829                       ...                              I am not nested nor do I overlap something

【讨论】：

您好，感谢您提供到目前为止的代码。相同的 chr 我的意思是最后一行（最后一个开始-停止对）chr 是 2，而其余的对是 1，所以没有重叠或与其余部分嵌套，您在答案中确实输出了）
好的，只有当且仅当它也具有相同的 chr 字段时，代码才会计算嵌套/重叠。如果这就是你想要的，那就去吧。如果您还有其他问题，请随时提出。
抱歉，我无法重现您的解决方案。我在nested = ((start > df.loc[:, 'start']) & (stop < df.loc[:, 'stop'])) 行收到invalid type comparison 错误。我还在原始问题中编辑了df 以包含region_name 列
您必须采用更改后的数据框：开始、停止、ch = row[:3] 必须更改为您的新布局。
是的，星号属于那里。但你似乎使用 Python 2。你可以用模式替换这两个表达式：overlap_indices = [x for x in zip(range(len(overlap)), overlap) if x[1]] if youre in python 2