Ruby 正则表达式转换为哈希数组，但需要删除键/值对答案

【问题标题】：Ruby regex into array of hashes but need to drop a key/val pairRuby 正则表达式转换为哈希数组，但需要删除键/值对
【发布时间】：2014-01-30 19:30:05
【问题描述】：

我正在尝试解析包含名称后跟层次结构路径的文件。我想获取命名的正则表达式匹配，将它们转换为哈希键，并将匹配存储为哈希。每个哈希将被推送到一个数组（因此在解析整个文件后我将得到一个哈希数组。这部分代码正在工作，除了现在我需要处理具有重复层次结构的错误路径（top_* 始终是顶级）。看来，如果我在 Ruby 中使用命名反向引用，我需要命名 all 的反向引用。我已经在 Rubular 中得到了匹配，但现在我有 p1 反向引用我的结果哈希。

问题：在哈希中不包含 p1 键/值对的最简单方法是什么？我的方法用在其他地方，所以我们不能假设p1 总是存在。在调用 s_ary_to_hash 方法后，我是否坚持删除数组中的每个键/值对？

注意：我保留这个问题是为了尝试解决在我的方法中忽略某些哈希键的具体问题。正则表达式问题现在在这张票中：Ruby regex - using optional named backreferences

更新：正则表达式问题已解决，hier 现在始终存储在命名的“hier”组中。剩下的唯一一项是弄清楚如何在创建哈希之前删除“p1”键/值（如果它存在）。

示例文件：

name1 top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
new12 top_ab12/hat[1]/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
tops  top_bat/car[0]
ab123 top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog

预期输出：

[{:name => "name1", :hier => "top_cat/mouse/dog/elephant/horse"},
 {:name => "new12", :hier => "top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool"},
 {:name => "tops",  :hier => "top_bat/car[0]"},
 {:name => "ab123", :hier => "top_2/top_1/top_3/top_4/dog"}]

代码sn-p：

def s_ary_to_hash(ary, regex)
  retary = Array.new
  ary.each {|x| (retary << Hash[regex.match(x).names.map{|key| key.to_sym}.zip(regex.match(x).captures)]) if regex.match(x)}
  return retary
end

regex = %r{(?<name>\w+) (?<p1>[\w\/\[\]]+)?(?<hier>(\k<p1>.*)|((?<= ).*$))}
h_ary = s_ary_to_hash(File.readlines(filename), regex)

【问题讨论】：

您有.html/.xml 文件吗？如果是，请使用nokogiri。
同意，但这不是 HTML 或 XML……这是另一个我无法触摸的程序的转储。
Greg，不管你用什么正则表达式，考虑把s_ary_to_hash的三行换成ary.each_with_object([]) { |x, retry| .... }。
好提示，我忘记了|x, retry| 收集结果的方式。谢谢！
@Greg，当我测试你的代码时，我没有得到你预期输出的最后一行。你能检查一下吗？

标签： ruby regex hash

【解决方案1】：

这个正则表达式呢？

^(?<name>\S+)\s+(?<p1>top_.+?)(?:\/(?<hier>\k<p1>(?:\[.+?\])?.+))?$

演示

http://rubular.com/r/awEP9Mz1kB

示例代码

def s_ary_to_hash(ary, regex, mappings)
   retary = Array.new

   for item in ary
      tmp = regex.match(item)
      if tmp then
         hash = Hash.new
         retary.push(hash)
         mappings.each { |mapping|
            mapping.map { |key, groups|
              for group in group
                 if tmp[group] then
                     hash[key] = tmp[group]
                     break
                 end
              end 
            }
         }
      end
   end

  return retary
end

regex = %r{^(?<name>\S+)\s+(?<p1>top_.+?)(?:\/(?<hier>\k<p1>(?:\[.+?\])?.+))?$}
h_ary = s_ary_to_hash(
   File.readlines(filename), 
   regex,
   [ 
      {:name => ['name']},
      {:hier => ['hier','p1']}
   ]
)

puts h_ary

输出

{:name=>"name1", :hier=>"top_cat/mouse/dog/elephant/horse\r"}
{:name=>"new12", :hier=>"top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool\r"}
{:name=>"tops", :hier=>"top_bat/car[0]"}

讨论

由于 Ruby 2.0.0 不支持分支重置，我构建了一个解决方案，为 s_ary_to_hash 函数增加了一些功能。它现在接受第三个参数，指示如何构建最终的哈希数组。

第三个参数是一个哈希数组。该数组中的每个散列都有一个键 (K)，对应于最终散列数组中的键。 K 与包含命名组的数组相关联，以从传递的正则表达式中使用（s_ary_to_hash 函数的第二个参数）。

如果一个组等于nil，s_ary_to_hash 会跳过它进入下一个组。

如果所有组都等于nil，则不会将K 推送到最终的哈希数组中。如果这不是我们想要的行为，请随意修改 s_ary_to_hash。

【讨论】：

Alex，当我运行这个时，我得到一个包含三个哈希的数组。前两个与“预期输出”中的相同。然而，第三个是{:name=>"tops", :hier=>"top_bat/car[0]"}]。格雷格，“预期输出”是否正确？
@CarySwoveland 你需要在第三个哈希中删除top_ 吗？
糟糕，我的预期输出中有一个类型。现在已经更正了。
Alex，那个正则表达式不符合删除重复层次结构的标准。
嗨，:p1 仍然出现在我的哈希中（在 irb 中测试）。我的正则表达式功能不是问题，而是它将 p1 键/值对存储在我的哈希中。原始问题-->在哈希中不包含 p1 键/值对的最简单方法是什么

【解决方案2】：

编辑：我已更改方法s_ary_to_hash 以符合我现在理解的排除目录的标准，即如果存在同名的下游目录，则目录d 将被排除，或相同的名称后跟括号中的非负整数。我已将其应用于所有目录，尽管我误解了这个问题；也许它应该适用于第一个。

data =<<THE_END
name1 top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
new12 top_ab12/hat/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
tops  top_bat/car[0]
ab123 top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
THE_END

text = data.split("\n")

def s_ary_to_hash(ary)
  ary.map do |s| 
    name, _, downstream_path = s.partition(' ').map(&:strip)
    arr = []
    downstream_dirs = downstream_path.split('/')
    downstream_dirs.each {|d| puts "'#{d}'"}
    while downstream_dirs.any? do
      dir = downstream_dirs.shift
      arr << dir unless downstream_dirs.any? { |d|
        d == dir || d =~ /#{dir}\[\d+\]/ }
    end     
    { name: name, hier: arr.join('/') }
  end   
end

s_ary_to_hash(text)
  # => [{:name=>"name1", :hier=>"top_cat/mouse/dog/elephant/horse"},
  #     {:name=>"new12", :hier=>"top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool"},
  #     {:name=>"tops", :hier=>"top_bat/car[0]"},
  #     {:name=>"ab123", :hier=>"top_2/top_1/top_3/top_4/dog"}]

排除标准在downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\[\d+\]/ } 中实现，其中dir 是正在测试的目录，downstream_dirs 是所有下游目录的数组。（当dir 是最后一个目录时，downstream_dirs 是空的。）以这种方式对其进行本地化可以很容易地测试和更改排除标准。您可以将其缩短为单个正则表达式和/或使其成为方法：

dir exclude_dir?(dir, downstream_dirs)
  downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\[\d+\]/ }end
end

【讨论】：

嗨，卡里。我想取第二个字符串中的第一个元素（由斜线划定的元素），看看它是否存在于剩余字符串中的任何位置。如果是这样，我希望我返回的匹配包含从第二个位置开始的所有内容。如果没有，我希望我的匹配包含整个原始的第二个字符串。

【解决方案3】：

这是一个非正则表达式的解决方案：

result = string.each_line.map do |line|
  name, path = line.split(' ')
  path = path.split('/')
  last_occur_of_root = path.rindex(path.first)
  path = path[last_occur_of_root..-1]
  {name: name, heir: path.join('/')}
end

【讨论】：