省略一组名称答案

【问题标题】：Ellipsizing a set of names省略一组名称
【发布时间】：2011-06-27 04:10:28
【问题描述】：

好的，我敢肯定，某个地方的某个人已经为此提出了一个算法，所以我想在我自己去（重新）发明它之前先问问。

我有一个任意（用户输入的）非空文本字符串的列表。每个字符串可以是任意长度（0 除外），并且它们都是唯一的。我想将它们显示给用户，但我想将它们修剪为我决定的某个固定长度，并用省略号 (...) 替换其中的一部分。问题是我希望所有输出字符串都是唯一的。

例如，如果我有字符串：

Microsoft Internet Explorer 6
Microsoft Internet Explorer 7
Microsoft Internet Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
谷歌浏览器 14

然后我不想修剪字符串的末端，因为那是唯一的部分（不想显示“Microsoft Internet ...”3次），但可以剪掉中间部分：

微软...rer 6
微软...rer 7
微软...rer 8
Mozilla Firefox 3
Mozilla Firefox 4
谷歌浏览器 14

其他时候，中间部分可能是独一无二的，我想修剪末端：

公司会议纪要，2010 年 5 月 25 日 -- 仅供内部使用
公司会议纪要，2010 年 6 月 24 日 -- 仅供内部使用
公司会议纪要，2010 年 7 月 23 日 -- 仅供内部使用

可能变成：

公司会议纪要，2010 年 5 月 25 日...
公司会议纪要，2010 年 6 月 24 日...
公司会议纪要，2010 年 7 月 23 日...

我想它可能永远不应该省略字符串的 very 开头，即使否则会被允许，因为这看起来很奇怪。而且我猜它可能会在字符串中省略一个以上的位置，但在合理范围内 - 也许 2 次就可以了，但 3 次或更多似乎过多。或者，次数可能不如剩余块的大小重要：省略号之间少于大约 5 个字符将毫无意义。

输入（数量和大小）不会太大，因此性能不是主要问题（好吧，只要算法不尝试一些愚蠢的事情，比如枚举所有可能的字符串，直到找到一个集合有效！）。

我想这些要求看起来很具体，但我实际上相当宽松——我只是想描述一下我的想法。

以前有过这样的事情吗？是否有一些现有的算法或库可以做到这一点？我用谷歌搜索了一些，但到目前为止没有发现任何类似的东西（但也许我只是不擅长谷歌搜索）。我必须相信某个地方已经有人想要解决这个问题了！

【问题讨论】：

标签： algorithm ellipsis

【解决方案1】：

听起来像是longest common substring problem.的一个应用程序

用省略号替换所有字符串共有的最长子字符串。如果字符串仍然太长并且允许您有另一个省略号，请重复。

您必须意识到，您可能无法将给定的字符串集“省略”到足以满足长度要求的程度。

【讨论】：

嗯，这不是一个糟糕的起点，但我认为这不是我想要的。也许我没有选择我的例子来说明这一点，但我不要求省略号只替换相等的子字符串：只是输出字符串是唯一的。例如，如果给定两个输入“Herzkreislaufwiederbelebung”和“Geschwindigkeitsbegrenzung”，并且我想修剪到长度=12（包括点），则返回“Herzkreis ...”和“Geschwind ...”就可以了。
@Ken 听起来你可以把他们打死。
@Ken - 对，你的例子很清楚，但我想我的想法有点模糊。我偏离了轨道，试图提出一些不能被充分缩短但仍然保持独特性的例子。

【解决方案2】：

对字符串进行排序。保留每个字符串的前 X 个字符。如果此前缀对于前后的字符串不是唯一的，则前进直到找到唯一字符（与之前和之后的字符串相比）。（如果没有找到唯一字符，则字符串没有唯一部分，请参阅帖子底部）在这些唯一字符之前和之后添加省略号。

请注意，这仍然可能看起来很有趣：

Microsoft Office -> Micro...ffice
Microsoft Outlook -> Micro...utlook

我不知道您希望使用哪种语言来执行此操作，但这里有一个 Python 实现。

def unique_index(before, current, after, size):
    '''Returns the index of the first part of _current_ of length _size_ that is 
        unique to it, _before_, and _after_. If _current_ has no part unique to it,
        _before_, and _after_, it returns the _size_ letters at the end of _current_'''
    before_unique = False
    after_unique = False
    for i in range(len(current)-size):
        #this will be incorrect in the case mentioned below
        if i > len(before)-1 or before[i] != current[i]:
            before_unique = True
        if i > len(after)-1 or after[i] != current[i]:
            after_unique = True
        if before_unique and after_unique:
            return i

    return len(current)-size

def ellipsize(entries, prefix_size, max_string_length):
    non_prefix_size = max_string_length - prefix_size #-len("...")? Post isn't clear about this.

    #If you want to preserve order then make a copy and make a mapping from the copy to the original
    entries.sort()

    ellipsized = []

    # you could probably remove all this indexing with something out of itertools
    for i in range(len(entries)):
        current = entries[i]

        #entry is already short enough, don't need to truncate
        if len(current) <= max_string_length:
            ellipsized.append(current)
            continue

        #grab empty strings if there's no string before/after
        if i == 0:
            before = ''
        else:
            before = entries[i-1]
        if i == len(entries)-1:
            after = ''
        else:
            after = entries[i+1]

        #Is the prefix unique? If so, we're done.
        current_prefix = entries[i][:prefix_size]    
        if not before.startswith(current_prefix) and not after.startswith(current_prefix):
            ellipsized.append(current[:max_string_length] + '...') #again, possibly -3

        #Otherwise find the unique part after the prefix if it exists.
        else:
            index = prefix_size + unique_index(before[prefix_size:], current[prefix_size:], after[prefix_size:], non_prefix_size)
            if index == prefix_size:
                header = ''
            else:
                header = '...'
            if index + non_prefix_size == len(current):
                trailer = ''
            else:
                trailer = '...'
            ellipsized.append(entries[i][:prefix_size] + header + entries[i][index:index+non_prefix_size] + trailer)
    return ellipsized

另外，你提到字符串本身是独一无二的，但它们都有独特的部分吗？例如，“Microsoft”和“Microsoft Internet Explorer 7”是两个不同的字符串，但第一个字符串没有与第二个字符串不同的部分。如果是这种情况，那么您必须在规范中添加一些内容，以说明如何使这种情况明确。（如果将“Xicrosoft”、“MXcrosoft”、“MiXrosoft”等添加到这两个字符串的混合中，则没有个唯一的字符串比原始字符串短来表示“Microsoft”）（另一种思考方式：如果您有所有可能的 X 字母字符串，则不能将它们全部压缩为 X-1 或更少的字符串。就像没有压缩方法可以压缩 all 输入一样，因为这是本质上是一种压缩方法。）

原始帖子的结果：

>>> for entry in ellipsize(["Microsoft Internet Explorer 6", "Microsoft Internet Explorer 7", "Microsoft Internet Explorer 8", "Mozilla Firefox 3", "Mozilla Firefox 4", "Google Chrome 14"], 7, 20):
    print entry

Google Chrome 14
Microso...et Explorer 6
Microso...et Explorer 7
Microso...et Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
>>> for entry in ellipsize(["Minutes of Company Meeting, 5/25/2010 -- Internal use only", "Minutes of Company Meeting, 6/24/2010 -- Internal use only", "Minutes of Company Meeting, 7/23/2010 -- Internal use only"], 15, 40):
    print entry

Minutes of Comp...5/25/2010 -- Internal use...
Minutes of Comp...6/24/2010 -- Internal use...
Minutes of Comp...7/23/2010 -- Internal use...

【讨论】：

我不明白。哪个字符串的前 X 个字符？独特的字符在哪里？这对只有 2 个字符串但每个字符都是唯一的情况（上图）有何帮助？
我只是在我的答案中添加了很多内容来充实它。
我仍在处理代码，但压缩注释很奇怪。这本质上是一种有损压缩方法，有损压缩绝对可以压缩所有输入。这种情况有点复杂，因为我希望输出是唯一的，但是这里输入令牌的压缩完全取决于输入中的其他令牌，并且给定某些合理的约束（例如，输入的数量总是会消失与可能的字符串数量相比很小），这似乎并不是天生不可能的。
@Ken 是的，我没想到它是一种有损压缩算法。但事实仍然存在，您无法将 >2^n 输入转换为小于 n 位的所有唯一值。
@Ken 再想一想，这并不是真正的有损压缩算法。您希望输出的每个字符串都引用原始字符串中的 single unique 字符串。我断言，对于您在输出中使用的每个缩短的字符串来引用输入中的原始字符串，我可以将单个字符串添加到输入中，以便输出字符串引用原始输入和添加的输入。