【问题标题】:What's a semantically-correct way to parse CSV from SQL Server 2008?从 SQL Server 2008 解析 CSV 的语义正确方法是什么?
【发布时间】:2013-01-12 11:24:45
【问题描述】:

我从 SQL Server 2008 获得了一个 CSV 转储,其中包含如下行:

Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00

parse_dbenhur 很漂亮,但是可以重写它以支持逗号和引号的存在吗? parse_ugly 很丑。

# @dbenhur's excellent answer, which works 100% for what i originally asked for
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/
FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
def parse_dbenhur(line)
  line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

def parse_ugly(line)
  dumb_fields = line.chomp.split(',').map { |v| v.gsub(/\s+/, ' ') }
  fields = []
  open = false
  dumb_fields.each_with_index do |v, i|
    open ? fields.last.concat(v) : fields.push(v)
    open = (v.start_with?('"') and (v.count('"') % 2 == 1) and dumb_fields[i+1] and dumb_fields[i+1].start_with?(' ')) || (open and !v.end_with?('"'))
  end
  fields.map { |v| (v.start_with?('"') and v.end_with?('"')) ? v[1..-2] : v }
end

lines = []
lines << 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00'
lines << 'Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00'
lines << 'Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00'

require 'csv'
lines.each do |line|
  puts
  puts line
  begin
    c = CSV.parse_line(line)
    puts "#{c.to_csv.chomp} (size #{c.length})"
  rescue
    puts "FasterCSV says: #{$!}"
  end
  a = parse_ugly(line)
  puts "#{a.to_csv.chomp} (size #{a.length})"
  b = parse_dbenhur(line)
  puts "#{b.to_csv.chomp} (size #{b.length})"
end

这是我运行它时的输出:

Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
FasterCSV says: Illegal quoting in line 1.
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)

Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
FasterCSV says: Unclosed quoted field on line 1.
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)

Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS""",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS""""",1997-05-15 00:00:00 (size 5)

Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS"" FOOBAR",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS"" FOOBAR""",1997-05-15 00:00:00 (size 5)

Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 (size 4)
Construction,198120036B,"""""MERITER""","""DO IT CTR"""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
Construction,198120036B,"""""""MERITER""""","""""DO IT CTR"""""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)

更新

请注意,当字段包含逗号时,CSV 使用双引号。

更新 2

如果将逗号从相关字段中去掉也没关系...我的 parse_ugly 方法不会保留它们。

更新 3

我从客户那里得知,exporting this strange CSV 是 SQL Server 2008 - 已报告给 Microsoft herehere

更新 4

@dbenhur 的回答非常适合我最初的要求,但指出我忽略了用逗号和引号显示行。我会接受 d@benhur 的回答 - 但我希望可以改进它以适用于上述所有行。

希望最终更新

这段代码有效(我认为它“语义正确”):

QUOTED = /"((?:[^"]|(?:""(?!")))*)"/
SEPQ = /,(?! )/
UNQUOTED = /([^,]*)/
SEPU = /,(?=(?:[^ ]|(?: +[^",]*,)))/
FIELD = /(?:#{QUOTED}#{SEPQ})|(?:#{UNQUOTED}#{SEPU})|\Z/

def parse_sql_server_2008_csv_line(line)
  line.scan(FIELD)[0...-1].map{ |matches| (matches[0] || matches[1]).tr(',', ' ').gsub(/\s+/, ' ') }
end

改编自@dbenhur 和@ghostdog74 在How can I process a CSV file with “bad commas”? 中的回答

【问题讨论】:

    标签: ruby regex csv


    【解决方案1】:

    以下使用正则表达式和String#scan。我观察到,在您处理的损坏的 CSV 格式中," 仅在字段的开头 结尾具有引用属性。

    扫描遍历连续匹配正则表达式的字符串,因此正则表达式可以假定其起始匹配点是字段的开头。我们构造正则表达式,以便它可以匹配没有内部引号 (QUOTED) 非逗号字符串 (UNQUOTED) 的平衡引用字段。当任一替代字段表示匹配时,它必须后跟一个分隔符,可以是逗号或字符串结尾 (SEP)

    因为UNQUOTED 可以匹配分隔符之前的零长度字段,所以扫描总是匹配末尾的空字段,我们用[0...-1] 丢弃它。 Scan 产生一个元组数组;每个元组都是一个捕获组的数组,所以我们在每个元素上mapmatches[0] || matches[1] 选择捕获的备用。

    您的示例行都没有显示包含逗号和引号的字段——我不知道它是如何合法表示的,而且这段代码可能无法正确识别这样的字段。

    SEP = /(?:,|\Z)/
    QUOTED = /"([^"]*)"/
    UNQUOTED = /([^,]*)/
    
    FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
    
    def ugly_parse line
      line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
    end
    
    lines.each do |l|
      puts l
      puts ugly_parse(l).inspect
      puts
    end
    
    # Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
    # ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
    # 
    # Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
    # ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
    # 
    # Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
    # ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]
    

    【讨论】:

    • 嗨@dbenhur,你的回答对我最初的要求100%有效,我会接受它——但你认为如何增强它以支持我上面添加的边缘情况?
    • 请务必查看上面的“希望最终更新”。
    • @SeamusAbshere 这是一些疯狂的输出。我很难想出一个对所有这些变体都有意义的规则。您的客户可以迁移到完整的数据库导出格式吗? :(
    • 嘿@dbenhur,看看我的“最终更新”——我相信这是 SQL Server 2008 的纯正输出。我的解决方案主要基于您的解决方案,适用于我尝试过的 260,000 行样本数据。嘘!
    【解决方案2】:

    如果您的 CSV 从未使用双引号作为合法的引用字符,请将选项调整为 CSV 以传递 :quote_char =&gt; "\0",然后您就可以执行此操作(为清楚起见包装字符串)

    1.9.3p327 > puts 'Construction,197133031B,"MORGAN SHOES" ALT,
                      1997-05-13 00:00:00'.parse_csv(:quote_char => "\0")
    Construction
    197133031B
    "MORGAN SHOES" ALT
    1997-05-13 00:00:00
    
    1.9.3p327 > puts 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,
                      1996-08-09 00:00:00'.parse_csv(:quote_char => "\0")
    Plumbing
    196222006P
    REPLACE LEAD WATER SERVICE W/1" COPPER
    1996-08-09 00:00:00
    

    【讨论】:

    • 嘿,感谢您的回答,不幸的是,它确实使用双引号作为引号字符(但仅当字段有逗号时)。请查看我更新的问题。
    猜你喜欢
    • 2017-10-28
    • 1970-01-01
    • 2012-06-28
    • 1970-01-01
    • 2018-11-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多