【问题标题】:Regexp to filter a table正则表达式过滤表
【发布时间】:2013-02-26 07:02:23
【问题描述】:

好吧,我有一个由一些开源软件输出的表格,但它没有以实际表格格式输出,例如

<table> 
  <thead>
     <td>Heading</td>
  <thead>
  <tbody>
    <tr>
       <td>Content</td>
    </tr>
  <tbody>
</table

相反,开发该软件的人认为像这样输出表格是个好主意

+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1  | HEADING 2   | ETC   | ANOTHER     | HEADING3   | HEADING4     | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS        AGENTS:21  |  total|        total|       total|         total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+

所以我不能构建一个网络抓取工具来获取数据,或者我不知道我是否可以构建一个抓取工具来抓取它,因为它全部包含在一个 &lt;pre&gt; &lt;/pre&gt; 标签中。因此,相反,我一直在尝试使用 ruby​​ 和 Regex 来尝试完成工作,到目前为止,我已经设法获得了所有领先的 | 并且我已经设法获得了标题 +-------+----- 但仅此而已因为似乎我必须在整个过程中重复该模式,它不想重复自己,好吧但是现在说得够多了这是我到目前为止使用的代码

text.lines.to_a.each do |line|
   line.sub(/^\| |^\+*-*\+*\-*/) do |match|
    puts "Regexp Match: " << match
end
STDIN.getc
puts "New Line "<< line
end

例如,第一行的输出仅为+-----------------+---------- 它是 CSV 格式,所以我使用 Gsub 将剩余的 | 替换为 ,

我可以使用 PHP 或 Ruby,所以任何答案都非常受欢迎

【问题讨论】:

  • 使用 HTML 解析器提取 pre 标签内的文本,然后使用子字符串提取数据(我假设列在固定位置)。如果一个表的列宽是固定的,而在不同的表中不固定,那么你可以分析表头找出每列的宽度。
  • 下面是一个示例,说明当前报告的外观 s7.postimage.org/gicwtx9xn/vicidial.png
  • @nhahtdh 列的宽度是固定的,我希望它们是使用半成品代码的乐趣-.-
  • | 会出现在内容中吗?如果| 没有出现在内容中,您可以按| 拆分。固定宽度是指每列的宽度是固定的(不同的列可能有不同的宽度,但一列的所有行必须具有相同的宽度)。

标签: php ruby regex ruby-on-rails-3 codeigniter


【解决方案1】:

这可能不够干净,但它适用于这个例子 :) 红宝石:

@text = <<END
+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1  | HEADING 2   | ETC   | ANOTHER     | HEADING3   | HEADING4     | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
| content   | more content | cont  | More more   | content    | content 2.0  | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS        AGENTS:21  |  total|        total|       total|         total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+
END
s = @text.scan(/^[|]\W(.*)[|]$/)
puts s
arr = []
arr2 = []
s.each do |o|
  a = o.to_s.split('|')
    a.each do |oo|
      arr2 << oo.to_s.gsub('["','').gsub('"]','').gsub(/\s+/, "")
    end
    arr << arr2
  arr2 = []
end
arr.each do |i|
  puts i
end

【讨论】:

    【解决方案2】:

    这是一个完整的 ruby​​ 解决方案。不过,您需要手动将| 添加到最后一行。

    require 'builder'
    
    table = '+------------+-------------+-------+-------------+------------+---------------+----------+
    | HEADING 1  | HEADING 2   | ETC   | ANOTHER     | HEADING3   | HEADING4     | SML |
    +------------+-------------+-------+-------------+------------+---------------+----------+
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    +------------+-------------+-------+-------------+------------+--------------+----------+
    | TOTALS        AGENTS:21  |  total|        total|       total|         total| total|
    +------------+-------------+-------+-------------+------------+--------------+----------+';
    
    def parse_table(table)
      rows = []
      table.each_line do |line|
        next if line.match /^\+/
        rows << line.split(/\s*\|\s*/).reject(&:empty?) 
      end
      rows
    end
    
    def html_row(xml, columns)
      xml.tr do
        columns.each do |column|
          xml.td column
        end
      end
    end
    
    def html_table(rows)
      head_row = rows.first
      body_rows = rows[1..-1]
    
      xml = Builder::XmlMarkup.new :indent => 2
      xml.table do
        xml.thead do
          html_row xml, head_row
        end
        xml.tbody do
          body_rows.each do |body_row|
            html_row xml, body_row
          end
        end
      end.to_s
    end
    
    
    rows = parse_table(table)
    html = html_table(rows)
    puts html
    

    输出:

    <table>
      <thead>
        <tr>
          <td>HEADING 1</td>
          <td>HEADING 2</td>
          <td>ETC</td>
          <td>ANOTHER</td>
          <td>HEADING3</td>
          <td>HEADING4</td>
          <td>SML</td>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>content</td>
          <td>more content</td>
          <td>cont</td>
          <td>More more</td>
          <td>content</td>
          <td>content 2.0</td>
          <td>litl</td>
        </tr>
        <tr>
          <td>content</td>
          <td>more content</td>
          <td>cont</td>
          <td>More more</td>
          <td>content</td>
          <td>content 2.0</td>
          <td>litl</td>
        </tr>
        <tr>
          <td>content</td>
          <td>more content</td>
          <td>cont</td>
          <td>More more</td>
          <td>content</td>
          <td>content 2.0</td>
          <td>litl</td>
        </tr>
        <tr>
          <td>content</td>
          <td>more content</td>
          <td>cont</td>
          <td>More more</td>
          <td>content</td>
          <td>content 2.0</td>
          <td>litl</td>
        </tr>
        <tr>
          <td>content</td>
          <td>more content</td>
          <td>cont</td>
          <td>More more</td>
          <td>content</td>
          <td>content 2.0</td>
          <td>litl</td>
        </tr>
        <tr>
          <td>content</td>
          <td>more content</td>
          <td>cont</td>
          <td>More more</td>
          <td>content</td>
          <td>content 2.0</td>
          <td>litl</td>
        </tr>
        <tr>
          <td>content</td>
          <td>more content</td>
          <td>cont</td>
          <td>More more</td>
          <td>content</td>
          <td>content 2.0</td>
          <td>litl</td>
        </tr>
        <tr>
          <td>content</td>
          <td>more content</td>
          <td>cont</td>
          <td>More more</td>
          <td>content</td>
          <td>content 2.0</td>
          <td>litl</td>
        </tr>
        <tr>
          <td>TOTALS        AGENTS:21</td>
          <td>total</td>
          <td>total</td>
          <td>total</td>
          <td>total</td>
          <td>total</td>
        </tr>
      </tbody>
    </table>
    

    【讨论】:

    • @paddle 哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇哇。或者有宝石​​吗?
    • 所以你想要的输出是 CSV?
    • 查看 ruby​​ 标准库中的 csv 类
    • @paddle 是的,为了帮助首先使用 fastcsv,但它似乎已经贬值了?
    【解决方案3】:

    退房:

    $table = '+------------+-------------+-------+-------------+------------+---------------+----------+
    | HEADING 1  | HEADING 2   | ETC   | ANOTHER     | HEADING3   | HEADING4     | SML |
    +------------+-------------+-------+-------------+------------+---------------+----------+
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    | content   | more content | cont  | More more   | content    | content 2.0  | litl |
    +------------+-------------+-------+-------------+------------+--------------+----------+
    | TOTALS        AGENTS:21  |  total|        total|       total|         total| total|
    +------------+-------------+-------+-------------+------------+--------------+----------+';
    
    $lines = preg_split('/\r\n|\r|\n/', $table);
    $array = array();
    foreach($lines as $line){
      if(!preg_match('/\+-+\+/', $line)){
        $array[] = preg_split('/\s*\|\s*/', trim($line, '| '));
      }
    }
    
    print_r($array);
    

    输出:

    Array
    (
        [0] => Array
            (
                [0] => HEADING 1
                [1] => HEADING 2
                [2] => ETC
                [3] => ANOTHER
                [4] => HEADING3
                [5] => HEADING4
                [6] => SML
            )
    
        [1] => Array
            (
                [0] => content
                [1] => more content
                [2] => cont
                [3] => More more
                [4] => content
                [5] => content 2.0
                [6] => litl
            )
    
        [2] => Array
            (
                [0] => content
                [1] => more content
                [2] => cont
                [3] => More more
                [4] => content
                [5] => content 2.0
                [6] => litl
            )
    
        [3] => Array
            (
                [0] => content
                [1] => more content
                [2] => cont
                [3] => More more
                [4] => content
                [5] => content 2.0
                [6] => litl
            )
    
        [4] => Array
            (
                [0] => content
                [1] => more content
                [2] => cont
                [3] => More more
                [4] => content
                [5] => content 2.0
                [6] => litl
            )
    
        [5] => Array
            (
                [0] => content
                [1] => more content
                [2] => cont
                [3] => More more
                [4] => content
                [5] => content 2.0
                [6] => litl
            )
    
        [6] => Array
            (
                [0] => content
                [1] => more content
                [2] => cont
                [3] => More more
                [4] => content
                [5] => content 2.0
                [6] => litl
            )
    
        [7] => Array
            (
                [0] => content
                [1] => more content
                [2] => cont
                [3] => More more
                [4] => content
                [5] => content 2.0
                [6] => litl
            )
    
        [8] => Array
            (
                [0] => content
                [1] => more content
                [2] => cont
                [3] => More more
                [4] => content
                [5] => content 2.0
                [6] => litl
            )
    
        [9] => Array
            (
                [0] => TOTALS        AGENTS:21
                [1] => total
                [2] => total
                [3] => total
                [4] => total
                [5] => total
            )
    
    )
    

    希望这会有所帮助:)

    【讨论】:

    • 所有的全局变量是怎么回事?在这里使用它们有什么意义?
    【解决方案4】:

    对于从表中取出字段的主要工作,使用split 和模式来获取每一行:

    line.split(/\s*\|\s*/)
    

    这将根据每个 | 和任何周围的空白将行拆分为一个数组。丢弃数组的第一个和最后一个元素,因为该模式还匹配开头和结尾 |

    【讨论】:

      猜你喜欢
      • 2015-03-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-02
      • 2016-01-19
      • 2011-05-03
      • 2011-11-10
      • 1970-01-01
      相关资源
      最近更新 更多