Powershell 昂贵的解析答案

【问题标题】：Powershell expensive parsingPowershell 昂贵的解析
【发布时间】：2009-03-24 06:37:16
【问题描述】：

这是我正在编写的脚本中的一小段；

Get-Content $tempDir\$todaysLog | Where-Object { $_ -match "" } |
    ForEach-Object -Process {
    $fields = [regex]::split($_,'@|\s+')
    Add-Content -Path $importSource2\$todaysLog -value ($($fields[0]) + "`t"  + $($fields[1]) + "`t" + $($fields[2]) + " " + $($fields[3])+ "`t" + "<*sender*@"+($($fields[5])) + "`t" + "<*recipient*@"+($($fields[7])))
    }

对包装感到抱歉，本质上它将文件的元素标记为一个数组，然后写出某些元素以及周围的一些其他文本。目的是用无意义的东西代替敏感的发送者/接收者信息。

这是我正在解析的日志文件的示例；

10.197.71.28 SG 02012009 00:00:00

显然我已经替换了示例中的地址信息。上面的部分工作得很好，虽然我意识到它非常昂贵。有什么能想出更便宜的东西，也许是一个选择字符串来替换文本而不是标记/重写它？

干杯

【问题讨论】：

标签： syntax powershell

【解决方案1】：

cat $tempDir\$todaysLog |
  %{ [regex]::Replace($_, "[A-Z0-9._%+-]+(@[A-Z0-9.-]+\.[A-Z]{2,4}\s<\[')[A-Z0-9._%+-]+(@[A-Z0-9.-]+\.[A-Z]{2,4}'\]>)", '*sender*$1*recipients*$2', "IgnoreCase") } > $importSource2\$todaysLog

日志条目必须类似于示例行（尤其是 sender@kpmg.com 部分）。

编辑：我做了一些基准测试（1 mo 文件（大约 15000 行示例））：

Andy Walker 的解决方案（使用 split）-> 18,44s

Measure-Command {

Get-Content $tempDir\$todaysLog | Where-Object { $_ -match "" } |
    ForEach-Object -Process {
    $fields = [regex]::split($_,'@|\s+')
    Add-Content -Path $importSource2\$todaysLog -value ($($fields[0]) + "`t"  + $($fields[1]) + "`t" + $($fields[2]) + " " + $($fields[3])+ "`t" + "<*sender*@"+($($fields[5])) + "`t" + "<*recipient*@"+($($fields[7])))
    }

}

Dangph 的解决方案（使用 replace）-> 18,16s

Measure-Command {

Get-Content $tempDir\$todaysLog | Where-Object { $_ -match "" } |
    ForEach-Object -Process {
    $s2 = $_ -replace "\t[^@\t']+@", "`t*sender*@"
    $s3 = $s2 -replace "\<\['.+@", "<['*recipient*@"
    Add-Content -Path $importSource2\$todaysLog -value $s3
    }

}

Madgnome 的解决方案（使用 regex）-> 6,16s

Measure-Command {

cat $tempDir\$todaysLog |
  %{ [regex]::Replace($_, "[A-Z0-9._%+-]+(@[A-Z0-9.-]+\.[A-Z]{2,4}\s<\[')[A-Z0-9._%+-]+(@[A-Z0-9.-]+\.[A-Z]{2,4}'\]>)", '*sender*$1*recipients*$2', "IgnoreCase") } > $importSource2\$todaysLog

}

【讨论】：

有趣的结果。 “1个月”是什么意思？那是几行？我很想知道 Andy Walker 正在处理什么大小的文件。
要计时命令，请像这样使用 Measure-Command：Measure-Command { 1..1000 }

【解决方案2】：

$s1 = "10.197.71.28 SG  02012009 00:00:00   sender@kpmg.com <['recip@kpmg.com.sg']>"
$s2 = $s1 -replace "\t[^@\t']+@", "`t*sender*@"
$s3 = $s2 -replace "\<\['.+@", "<['*recipient*@"
write-host $s3

我假设所有日志条目看起来都像示例行。如果他们不这样做，那么我们可能需要更复杂一些。

注意如果你复制粘贴上面的代码，你可能需要在第一行的“sender”之前手动重新插入制表符。

【讨论】：

【解决方案3】：

您应该避免使用 Powershell 作为大文件大小日志解析的引擎。我会使用 logparser.exe（你有一个可以转换为 csv 的空格分隔条目），然后在 Powershell 中使用 import-csv 来重新创建一个 Powershell 对象。从那里您可以删除和替换字段（基于每个对象）。 Powershell 是胶水而不是燃料。使用它来解析任何大小的大型日志并不是完全愚蠢的，但对你和 CPU 来说都是昂贵的。尽管 Lee Holmes 在他的书籍示例 http://examples.oreilly.com/9780596528492/ 中有一个出色的 Convert-TextObject.ps1，但您需要某种类型的日志解析引擎来处理繁重的工作。

【讨论】：