如何在ruby中对字母数字数组进行排序答案

【问题标题】：How to sort an alphanumeric array in ruby如何在ruby中对字母数字数组进行排序
【发布时间】：2011-07-25 17:46:33
【问题描述】：

如何在 ruby 中按字母数字对数组数据进行排序？

假设我的数组是a = [test_0_1, test_0_2, test_0_3, test_0_4, test_0_5, test_0_6, test_0_7, test_0_8, test_0_9, test_1_0, test_1_1, test_1_2, test_1_3, test_1_4, test_1_5, test_1_6, test_1_7, test_1_8, test_1_9, test_1_10, test_1_11, test_1_12, test_1_13, test_1_14, ...........test_1_121...............]

我希望我的输出是：

.
.
.
test_1_121
.
.
.
test_1_14
test_1_13
test_1_12
test_1_11
test_1_10
test_1_9
test_1_8
test_1_7
test_1_6
test_1_5
test_1_4
test_1_3
test_1_2
test_1_1
test_0_10
test_0_9
test_0_8
test_0_7
test_0_6
test_0_5
test_0_4
test_0_3
test_0_2
test_0_1

【问题讨论】：

因为这需要的排序不是两个值的直接比较，所以您需要使用sort_by。操纵字符串或潜入物体引起的开销可能会杀死sort。

标签： ruby sorting alphanumeric natural-sort

【解决方案1】：

一种用于对在任意位置包含非填充序列号的字符串进行排序的通用算法。

padding = 4
list.sort{|a,b|
  a,b = [a,b].map{|s| s.gsub(/\d+/){|m| "0"*(padding - m.size) + m } }
  a<=>b
}

其中 padding 是您希望数字在比较期间具有的字段长度。如果字符串中包含的位数少于“填充”位数，则字符串中找到的任何数字都将在比较之前填充零，这会产生 预期的排序顺序。

要产生用户682932要求的结果，只需在排序块后添加.reverse，这会将自然顺序（升序）翻转为降序。

通过对字符串进行预循环，您当然可以动态地找到字符串列表中的最大位数，您可以使用它来代替对任意填充长度进行硬编码，但这需要更多处理（速度较慢) 和更多代码。例如

padding = list.reduce(0){|max,s| 
  x = s.scan(/\d+/).map{|m|m.size}.max
  (x||0) > max ? x : max
}

【讨论】：

我喜欢填充的想法，但它在两种情况下不起作用： 1. 比填充长的数字序列。 2. 像“1.1.1.1...”这样带有大填充的长字符串会使它们的大小成倍增加，以至于它不适合内存。

【解决方案2】：

例如，如果您只是按字符串排序，则不会在“test_2”和“test_10”之间得到正确的排序。这样做：

sort_by{|s| s.scan(/\d+/).map{|s| s.to_i}}.reverse

【讨论】：

【解决方案3】：

您可以将块传递给排序函数以对其进行自定义排序。在您的情况下，您将遇到问题，因为您的数字没有填充零，因此此方法将数字部分填充为零，然后对它们进行排序，从而产生您想要的排序顺序。

a.sort { |a,b|
  ap = a.split('_')
  a = ap[0] + "%05d" % ap[1] + "%05d" % ap[2]
  bp = b.split('_')
  b = bp[0] + "%05d" % bp[1] + "%05d" % bp[2]
  b <=> a
}

【讨论】：

【解决方案4】：

排序例程的处理时间可能会有很大差异。此类基准测试变体可以快速找到最快的做事方式：

#!/usr/bin/env ruby

ary = %w[
    test_0_1  test_0_2   test_0_3 test_0_4 test_0_5  test_0_6  test_0_7
    test_0_8  test_0_9   test_1_0 test_1_1 test_1_2  test_1_3  test_1_4  test_1_5
    test_1_6  test_1_7   test_1_8 test_1_9 test_1_10 test_1_11 test_1_12 test_1_13
    test_1_14 test_1_121
]

require 'ap'
ap ary.sort_by { |v| a,b,c = v.split(/_+/); [a, b.to_i, c.to_i] }.reverse

及其输出：

>> [
>>     [ 0] "test_1_121",
>>     [ 1] "test_1_14",
>>     [ 2] "test_1_13",
>>     [ 3] "test_1_12",
>>     [ 4] "test_1_11",
>>     [ 5] "test_1_10",
>>     [ 6] "test_1_9",
>>     [ 7] "test_1_8",
>>     [ 8] "test_1_7",
>>     [ 9] "test_1_6",
>>     [10] "test_1_5",
>>     [11] "test_1_4",
>>     [12] "test_1_3",
>>     [13] "test_1_2",
>>     [14] "test_1_1",
>>     [15] "test_1_0",
>>     [16] "test_0_9",
>>     [17] "test_0_8",
>>     [18] "test_0_7",
>>     [19] "test_0_6",
>>     [20] "test_0_5",
>>     [21] "test_0_4",
>>     [22] "test_0_3",
>>     [23] "test_0_2",
>>     [24] "test_0_1"
>> ]

测试速度显示算法：

require 'benchmark'

n = 50_000
Benchmark.bm(8) do |x|
  x.report('sort1') { n.times { ary.sort { |a,b| b <=> a }         } }
  x.report('sort2') { n.times { ary.sort { |a,b| a <=> b }.reverse } }
  x.report('sort3') { n.times { ary.sort { |a,b|
                                  ap = a.split('_')
                                  a = ap[0] + "%05d" % ap[1] + "%05d" % ap[2]
                                  bp = b.split('_')
                                  b = bp[0] + "%05d" % bp[1] + "%05d" % bp[2]
                                  b <=> a
                                } } }

  x.report('sort_by1') { n.times { ary.sort_by { |s| s                                               }         } }
  x.report('sort_by2') { n.times { ary.sort_by { |s| s                                               }.reverse } }
  x.report('sort_by3') { n.times { ary.sort_by { |s| s.scan(/\d+/).map{ |s| s.to_i }                 }.reverse } }
  x.report('sort_by4') { n.times { ary.sort_by { |v| a = v.split(/_+/); [a[0], a[1].to_i, a[2].to_i] }.reverse } }
  x.report('sort_by5') { n.times { ary.sort_by { |v| a,b,c = v.split(/_+/); [a, b.to_i, c.to_i]      }.reverse } }
end


>>               user     system      total        real
>> sort1     0.900000   0.010000   0.910000 (  0.919115)
>> sort2     0.880000   0.000000   0.880000 (  0.893920)
>> sort3    43.840000   0.070000  43.910000 ( 45.970928)
>> sort_by1  0.870000   0.010000   0.880000 (  1.077598)
>> sort_by2  0.820000   0.000000   0.820000 (  0.858309)
>> sort_by3  7.060000   0.020000   7.080000 (  7.623183)
>> sort_by4  6.800000   0.000000   6.800000 (  6.827472)
>> sort_by5  6.730000   0.000000   6.730000 (  6.762403)
>>

Sort1 和 sort2 以及 sort_by1 和 sort_by2 有助于为 sort、sort_by 和 reverse 建立基线。

排序 sort3 和 sort_by3 是此页面上的另外两个答案。 Sort_by4 和 sort_by5 是关于我如何做到这一点的两个旋转，sort_by5 是我经过几分钟修补后想出的最快的。

这显示了算法中的微小差异如何影响最终输出。如果有更多的迭代，或者更大的数组被排序，差异会更加极端。

【讨论】：

【解决方案5】：

类似于@ctcherry 的答案，但更快：

a.sort_by {|s| "%s%05i%05i" % s.split('_') }.reverse

编辑：我的测试：

require 'benchmark'
ary = []
100_000.times { ary << "test_#{rand(1000)}_#{rand(1000)}" }
ary.uniq!; puts "Size: #{ary.size}"

Benchmark.bm(5) do |x|
  x.report("sort1") do
    ary.sort_by {|e| "%s%05i%05i" % e.split('_') }.reverse
  end
  x.report("sort2") do
    ary.sort { |a,b|
      ap = a.split('_')
      a = ap[0] + "%05d" % ap[1] + "%05d" % ap[2]
      bp = b.split('_')
      b = bp[0] + "%05d" % bp[1] + "%05d" % bp[2]
      b <=> a
    } 
  end
  x.report("sort3") do
    ary.sort_by { |v| a, b, c = v.split(/_+/); [a, b.to_i, c.to_i] }.reverse
  end
end

输出：

Size: 95166

           user     system      total        real
sort1  3.401000   0.000000   3.401000 (  3.394194)
sort2 94.880000   0.624000  95.504000 ( 95.722475)
sort3  3.494000   0.000000   3.494000 (  3.501201)

【讨论】：

为什么不显示你的测试？基准测试结果在解释为什么速度更快时非常有用。

【解决方案6】：

在此处发布在 Ruby 中执行自然小数排序的更通用方法。以下内容的灵感来自我的代码，用于从https://github.com/CocoaPods/Xcodeproj/blob/ca7b41deb38f43c14d066f62a55edcd53876cd07/lib/xcodeproj/project/object/helpers/sort_helper.rb 中对“like Xcode”进行排序，其本身大致受到https://rosettacode.org/wiki/Natural_sorting#Ruby 的启发。

即使很明显我们希望“10”在“2”之后以进行自然十进制排序，但还有其他方面需要考虑，需要多种可能的替代行为：

我们如何处理像“001”/“01”这样的相等性：我们是保持原始数组顺序还是有一个后备逻辑？（下面，在第一次通过相等的情况下，选择使用严格的排序逻辑进行第二次通过）
我们是否忽略连续的空格进行排序，还是每个空格字符都算数？（下面，选择在第一遍时忽略连续空格，并在相等时进行严格比较）
其他特殊字符的问题相同。（下面，选择单独计算任何非空格和非数字字符）
我们是否忽略大小写； “a”在“A”之前还是之后？（下面，选择在第一次传递时忽略大小写，在相等传递中我们在“A”之前有“a”）

考虑到这些：

这意味着我们几乎肯定应该使用scan 而不是split，因为我们可能需要比较三种子字符串（数字、空格、所有其余部分）。
这意味着我们几乎肯定应该使用Comparable 类和def <=>(other)，因为不可能简单地将每个子字符串map 转换为其他内容，根据上下文（第一遍和平等通过）。

这会导致实现有点冗长，但它适用于边缘情况：

  # Wrapper for a string that performs a natural decimal sort (alphanumeric).
  # @example
  #   arrayOfFilenames.sort_by { |s| NaturalSortString.new(s) }
  class NaturalSortString
    include Comparable
    attr_reader :str_fallback, :ints_and_strings, :ints_and_strings_fallback, :str_pattern

    def initialize(str)
      # fallback pass: case is inverted
      @str_fallback = str.swapcase
      # first pass: digits are used as integers, spaces are compacted, case is ignored
      @ints_and_strings = str.scan(/\d+|\s+|[^\d\s]+/).map do |s|
        case s
        when /\d/ then Integer(s, 10)
        when /\s/ then ' '
        else s.downcase
        end
      end
      # second pass: digits are inverted, case is inverted
      @ints_and_strings_fallback = @str_fallback.scan(/\d+|\D+/).map do |s|
        case s
        when /\d/ then Integer(s.reverse, 10)
        else s
        end
      end
      # comparing patterns
      @str_pattern = @ints_and_strings.map { |el| el.is_a?(Integer) ? :i : :s }.join
    end

    def <=>(other)
      if str_pattern.start_with?(other.str_pattern) || other.str_pattern.start_with?(str_pattern)
        compare = ints_and_strings <=> other.ints_and_strings
        if compare != 0
          # we sort naturally (literal ints, spaces simplified, case ignored)
          compare
        else
          # natural equality, we use the fallback sort (int reversed, case swapped)
          ints_and_strings_fallback <=> other.ints_and_strings_fallback
        end
      else
        # type mismatch, we sort alphabetically (case swapped)
        str_fallback <=> other.str_fallback
      end
    end
  end

用法

示例 1：

arrayOfFilenames.sort_by { |s| NaturalSortString.new(s) }

示例 2：

arrayOfFilenames.sort! do |x, y|
  NaturalSortString.new(x) <=> NaturalSortString.new(y)
end

您可以在https://github.com/CocoaPods/Xcodeproj/blob/ca7b41deb38f43c14d066f62a55edcd53876cd07/spec/project/object/helpers/sort_helper_spec.rb 找到我的测试用例，我使用此参考进行订购： [ ' 一种'， ' 一种'， '0.1.1', '0.1.01', '0.1.2', '0.1.10', '1', '01', '1a', '2', '2个', '10', '一种'， '一种'， '一种 '， 'a2'， 'a1', 'A1B001', 'A01B1', ]

当然，现在就随意定制您自己的排序逻辑。

【讨论】：

这能否以某种方式应用于Dir.entries(src).sort.each do |item|，其中文件名看起来像1886–7 Los Angeles City Directory. Bynon. p221. Suzzalo.jpg，我想自然地对p 之后的数字进行排序，它们最多可达9999，并且可能包含其他字符例如pi，但出于我的目的，除了数字之外的任何内容都可以忽略，但仍需要包括（结束于结尾或开头）？我可以想象抓取一个数组中的所有文件名，然后从那里开始，但我不擅长 Ruby，所以更直接的方法是首选。也许还有 Ruby 的所有魔力。

【解决方案7】：

我查看了 Unix sort 函数的 Wikipedia 页面，该函数的 GNU 版本有一个 -V 标志，它通常对“版本字符串”进行排序。（我认为这是数字和非数字的混合，您希望数字部分按数字排序，非数字部分按词汇排序）。

article states 那个：

GNU 实现有一个 -V --version-sort 选项，它是文本中（版本）数字的自然排序。要比较的两个文本字符串被分成字母块和数字块。字母块按字母数字进行比较，数字块按数字进行比较（即，跳过前导零，更多数字意味着更大，否则最左边的不同数字决定结果）。块从左到右进行比较，该循环中的第一个不相等的块决定哪个文本更大。这恰好适用于 IP 地址、Debian 软件包版本字符串和类似的任务，其中可变长度的数字嵌入在字符串中。

sawa 的解决方案有点像这样，但不按非数字部分排序。

因此，在Coeur 和sawa 之间发布一个解决方案似乎很有用，其工作方式类似于 GNU sort -V

a.sort_by do |r|
  # Split the field into array of [<string>, nil] or [nil, <number>] pairs
  r.to_s.scan(/(\D+)|(\d+)/).map do |m|
    s,n = m
    n ? n.to_i : s.to_s # Convert number strings to integers
  end.to_a
end

在我的例子中，我想像这样按字段对 TSV 文件进行排序，所以作为奖励，这里也是这个例子的脚本：

require 'csv'

# Sorts a tab-delimited file input on STDIN, sortin

opts = {
  headers:true,
  col_sep: "\t",
  liberal_parsing: true,
}

table = CSV.new($stdin, **opts)


# Emulate unix's sort -V: split each field into an array of string or
# numeric values, and sort by those in turn. So for example, A10
# sorts above A100.
sorted_ary = table.sort_by do |r|
  r.fields.map do |f|
    # Split the field into array of [<string>, nil] or [nil, <number>] values
    f.to_s.scan(/(\D+)|(\d+)/).map do |m|
      s,n = m
      n ? n.to_i : s.to_s # Convert number strings to integers
    end.to_a
  end
end

puts CSV::Table.new(sorted_ary).to_csv(**opts)

（除此之外：另一个解决方案 here 使用 Gem::Version 进行排序，但这似乎只适用于格式良好的 Gem 版本字符串。）

【讨论】：

虽然要小心，但 TSV 有时与带有 \t 分隔符的 CSV 略有不同：stackoverflow.com/questions/4404787/…

【解决方案8】：

从外观上看，您想使用sort function 和/或reverse function.

ruby-1.9.2-p136 :009 > a = ["abc_1", "abc_11", "abc_2", "abc_3", "abc_22"]
 => ["abc_1", "abc_11", "abc_2", "abc_3", "abc_22"] 

ruby-1.9.2-p136 :010 > a.sort
 => ["abc_1", "abc_11", "abc_2", "abc_22", "abc_3"] 
ruby-1.9.2-p136 :011 > a.sort.reverse
 => ["abc_3", "abc_22", "abc_2", "abc_11", "abc_1"]

【讨论】：

试试这个 = [test_0_1, test_0_2, test_0_3, test_0_4, test_0_5, test_0_6, test_0_7, test_0_8, test_0_9, test_1_0, test_1_1, test_1_2, test_1_3, test_1_4, test_1_5, test_1_6, test_1_7, test_1 test_1_9、test_1_10、test_1_11、test_1_12、test_1_13、test_1_14、............test_1_121......] 有两个下划线。它不起作用。

【解决方案9】：

好的，从您的输出来看，您似乎只是希望它反转，所以使用reverse()

a.reverse

【讨论】：