【问题标题】:Find the shortest substring whose replacement makes the string contain equal number of each character找到最短的子字符串,其替换使字符串包含相等数量的每个字符
【发布时间】:2016-10-01 11:39:45
【问题描述】:

我有一个长度为n 的字符串,由字母AGCT 组成。如果字符串包含相等数量的AGCT(每个n/4 次),则该字符串是稳定的。我需要找到替换后使其稳定的子字符串的最小长度。这是问题的完整描述link

假设s1=AAGAAGAA

现在由于n=8,理想情况下它应该有2个As、2个Ts、2个Gs和2个Cs。它有 4 个过多的As。因此我们需要一个至少包含 4 个As 的子字符串。

我首先从左侧获取一个 4 个字符的子字符串,如果没有找到,则增加一个变量 mnum(即查找 5 个变量子字符串,依此类推)。

我们得到AAGAA 作为答案。 但是太慢了。

 from collections import Counter
 import sys
 n=int(input())       #length of string
 s1=input()
 s=Counter(s1)
 le=int(n/4)          #ideal length of each element
 comp={'A':le,'G':le,'C':le,'T':le}    #dictionary containing equal number of all elements
 s.subtract(comp)     #Finding by how much each element ('A','G'...) is in excess or loss
 a=[]
 b=[]
 for x in s.values():   #storing frequency(s.values--[4,2]) of elements which are in excess
    if(x>0):
      a.append(x)
 for x in s.keys():         #storing corresponding elements(s.keys--['A','G'])
    if(s[x]>0):
       b.append(x)
 mnum=sum(a)            #minimum substring length to start with
 if(mnum==0):
   print(0)
   sys.exit
 flag=0
 while(mnum<=n):  #(when length 4 substring with all the A's and G's is not found increasing to 5 and so on)
    for i in range(n-mnum+1):     #Finding substrings with length mnum in s1
       for j in range(len(a)):    #Checking if all of excess elements are present
           if(s1[i:i+mnum].count(b[j])==a[j]):
              flag=1
           else:
              flag=0

       if(flag==1):
          print(mnum)
          sys.exit()
    mnum+=1

【问题讨论】:

    标签: python python-3.x string algorithm optimization


    【解决方案1】:

    最小子串可以在O(N)时间和O(N)空间中找到。

    首先从长度为n的输入中统计每个字符的频率fr[i]。 现在,要意识到的最重要的事情是,子字符串被认为是最小的充分必要条件,它必须包含每个频率至少为fr[i] - n/4 的多余字符。否则,将无法替换丢失的字符。因此,我们的任务是遍历每个这样的子字符串并选择长度最小的那个。

    但是如何有效地找到所有这些?

    一开始,minLengthn。我们引入了2 指针索引-leftright(最初是0),它们在原始字符串str 中定义了一个从leftright 的子字符串。然后,我们递增right,直到str[left:right] 中每个多余字符的频率至少为fr[i] - n/4。但这还不是全部,因为str[left : right] 可能在左侧包含不必要的字符(例如,它们并不过分,因此可以删除)。因此,只要str[left : right] 仍然包含足够多的元素,我们就会增加left。完成后,如果 minLength 大于 right - left,我们将更新它。我们重复这个过程直到right &gt;= n

    让我们考虑一个例子。让GAAAAAAA 成为输入字符串。那么算法步骤如下:

    1.统计每个字符的频率:

    ['G'] = 1, ['A'] = 6, ['T'] = 0, ['C'] = 0 ('A' is excessive here)
    

    2.现在遍历原始字符串:

    Step#1: |G|AAAAAAA
        substr = 'G' - no excessive chars (left = 0, right = 0) 
    Step#2: |GA|AAAAAA
        substr = 'GA' - 1 excessive char, we need 5 (left = 0, right = 1)
    Step#3: |GAA|AAAAA
        substr = 'GAA' - 2 excessive chars, we need 5 (left = 0, right = 2)
    Step#4: |GAAA|AAAA
        substr = 'GAAA' - 3 excessive chars, we need 5 (left = 0, right = 3)
    Step#5: |GAAAA|AAA
        substr = 'GAAAA' - 4 excessive chars, we need 5 (left = 0, right = 4)
    Step#6: |GAAAAA|AA
        substr = 'GAAAAA' - 5 excessive chars, nice but can we remove something from left? 'G' is not excessive anyways. (left = 0, right = 5)
    Step#7: G|AAAAA|AA
        substr = 'AAAAA' - 5 excessive chars, wow, it's smaller now. minLength = 5 (left = 1, right = 5)   
    Step#8: G|AAAAAA|A
        substr = 'AAAAAA' - 6 excessive chars, nice, but can we reduce the substr? There's a redundant 'A'(left = 1, right = 6)
    Step#9: GA|AAAAA|A
        substr = 'AAAAA' - 5 excessive chars, nice, minLen = 5 (left = 2, right = 6)
    Step#10: GA|AAAAAA|
        substr = 'AAAAAA' - 6 excessive chars, nice, but can we reduce the substr? There's a redundant 'A'(left = 2, right = 7)
    Step#11: GAA|AAAAA|
        substr = 'AAAAA' - 5 excessive chars, nice, minLen = 5 (left = 3, right = 7)
    Step#12: That's it as right >= 8
    

    或者下面的完整代码:

    from collections import Counter
    
    n = int(input())
    gene = raw_input()
    char_counts = Counter()
    for i in range(n):
        char_counts[gene[i]] += 1
    
    n_by_4 = n / 4
    min_length = n
    left = 0
    right = 0
    
    substring_counts = Counter()
    while right < n:
        substring_counts[gene[right]] += 1
        right += 1
    
        has_enough_excessive_chars = True
        for ch in "ACTG":
            diff = char_counts[ch] - n_by_4
            # the char cannot be used to replace other items
            if (diff > 0) and (substring_counts[ch] < diff):
                has_enough_excessive_chars = False
                break
    
        if has_enough_excessive_chars:
            while left < right and substring_counts[gene[left]] > (char_counts[gene[left]] - n_by_4):
                substring_counts[gene[left]] -= 1
                left += 1
    
            min_length = min(min_length, right - left)
    
    print (min_length)
    

    【讨论】:

    • 很好的逻辑,但你能帮我详细说明一下你是怎么想到的吗?或者这个算法比较熟悉,但是我想不出另一个例子,你能指出算法的起源吗?
    • @cxz 也许有一个现有的算法,但我不知道。我是怎么做到的?每个候选子串必须有足够的冗余字符来替换。因此,我们不必检查所有可能的子字符串,而只需检查那些具有足够冗余字符的子字符串,这在O(n) 中是可以实现的
    【解决方案2】:

    这是一种已完成有限测试的解决方案。这应该会给你一些关于如何改进代码的想法。

    from collections import Counter
    import sys
    import math
    
    n = int(input())
    s1 = input()
    s = Counter(s1)
    
    if all(e <= n/4 for e in s.values()):
        print(0)
        sys.exit(0)
    
    result = math.inf
    out = 0
    for mnum in range(n):
        s[s1[mnum]] -= 1
        while all(e <= n/4 for e in s.values()) and out <= mnum:
            result = min(result, mnum - out + 1)
            s[s1[out]] += 1
            out += 1
    
    print(result)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-09-10
      • 2016-12-17
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-04-05
      • 2014-03-03
      • 2017-06-16
      相关资源
      最近更新 更多