Powershell - 优化非常非常大的 csv 和文本文件搜索和替换答案

【问题标题】：Powershell - Optimizing a very, very large csv and text file search and replacePowershell - 优化非常非常大的 csv 和文本文件搜索和替换
【发布时间】：2014-03-14 16:58:41
【问题描述】：

我有一个目录，其中包含约 3000 个文本文件，当我将程序转换到新服务器时，我会定期搜索和替换这些文本文件。

每个文本文件可能平均有约 3000 行，我需要一次搜索文件以查找 300 到 1000 个术语。

我正在替换与我正在搜索的字符串相关的服务器前缀。因此，对于每一个 csv 条目，我都在寻找 Search_String、\\Old_Server\"Search_String" 并确保在程序完成后，结果是 "\\New_Server\Search_String"。

我拼凑了一个 powershell 程序，它可以工作。但它太慢了，我从未见过它完整。

有什么让它更快的建议吗？

编辑 1：我按照建议更改了 get-content，但仍然需要 3 分钟来搜索两个文件（约 8000 行）以获取 9 个单独的搜索词。我一定还在搞砸；如果手动完成 9 次，notepad++ 搜索和替换仍然会更快。

我不确定如何删除第一个 (Get-Content)，因为我想在对文件进行任何更改之前制作文件的副本以进行备份。

编辑 2：所以这要快一个数量级；它可能在 10 秒内搜索文件。但是现在它不会将更改写入文件，它只会搜索目录中的第一个文件！我没有更改该代码，所以我不知道它为什么会损坏。

编辑 3：成功！我调整了下面发布的解决方案，使其更快。它现在在几秒钟内搜索每个文件。我可能会颠倒循环顺序，以便它将文件加载到数组中，然后搜索并替换 CSV 中的每个条目，而不是相反。如果我让它工作，我会发布它。

最终脚本如下供参考。

#get input from the user
$old = Read-Host 'Enter the old cimplicity qualifier (F24, IRF3 etc'
$new = Read-Host 'Enter the new cimplicity qualifier (CB3, F24_2 etc)'
$DirName = Get-Date -format "yyyy_MM_dd_hh_mm"

New-Item -ItemType directory -Path $DirName -force
New-Item "$DirName\log.txt" -ItemType file -force -Value "`nMatched CTX files on $dirname`n"
$logfile = "$DirName\log.txt"

$VerbosePreference = "SilentlyContinue"


$points = import-csv SearchAndReplace.csv -header find #Import CSV File
#$ctxfiles = Get-ChildItem . -include *.ctx | select -expand fullname #Import local directory of CTX Files

$points | foreach-object { #For each row of points in the CSV file
    $findvar = $_.find #Store column 1 as string to search for  

    $OldQualifiedPoint = "\\\\"+$old+"\\" + $findvar #Use escape slashes to escape each invidual bs so it's not read as regex
    $NewQualifiedPoint = "\\"+$new+"\" + $findvar #escape slashes are NOT required on the new string
    $DuplicateNew = "\\\\" + $new + "\\" + "\\\\" + $new + "\\"
    $QualifiedNew = "\\" + $new + "\"

    dir . *.ctx | #Grab all CTX Files 
     select -expand fullname | #grab all of those file names and...
      foreach {#iterate through each file
                $DateTime = Get-Date -Format "hh:mm:ss"
                $FileName = $_
                Write-Host "$DateTime - $FindVar - Checking $FileName"
                $FileCopied = 0
                #Check file contents, and copy matching files to newly created directory
                If (Select-String -Path $_ -Pattern $findvar -Quiet ) {
                   If (!($FileCopied)) {
                        Copy $FileName -Destination $DirName
                        $FileCopied = 1
                        Add-Content $logfile "`n$DateTime - Found $Findvar in $filename"
                        Write-Host "$DateTime - Found $Findvar in $filename"
                    }

                    $FileContent = Get-Content $Filename -ReadCount 0
                    $FileContent =
                    $FileContent -replace $OldQualifiedPoint,$NewQualifiedPoint -replace $findvar,$NewQualifiedPoint -replace $DuplicateNew,$QualifiedNew
                    $FileContent | Set-Content $FileName
                }
           }
         $File.Dispose()
    }

【问题讨论】：

您仍然在条件检查中使用 get-content，所以仍然需要很长时间。只需进行替换然后检查您是否更改了任何内容并将其输出为您的“XX found”会更快

标签： search powershell csv optimization text

【解决方案1】：

如果我没看错，您应该能够将一个 3000 行的文件读入内存，并将这些替换作为数组操作进行，从而无需遍历每一行。您还可以将这些替换操作链接到一个命令中。

dir . *.ctx | #Grab all CTX Files 
     select -expand fullname | #grab all of those file names and...
      foreach {#iterate through each file
                $DateTime = Get-Date -Format "hh:mm:ss"
                $FileName = $_
                Write-Host "$DateTime - $FindVar - Checking $FileName"
                #Check file contents, and copy matching files to newly created directory
                If (Select-String -Path $_ -Pattern $findvar -Quiet ) {
                    Copy $FileName -Destination $DirName
                    Add-Content $logfile "`n$DateTime - Found $Findvar in $filename"
                    Write-Host "$DateTime - Found $Findvar in $filename"

                    $FileContent = Get-Content $Filename -ReadCount 0
                    $FileContent =
                      $FileContent -replace $OldQualifiedPoint,$NewQualifiedPoint -replace $findvar,$NewQualifiedPoint -replace $DuplicateNew,$QualifiedNew
                     $FileContent | Set-Content $FileName
                }
           }

另一方面，Select-String 将文件路径作为参数，因此您不必执行Get-Content 然后将其传送到Select-String。

【讨论】：

我明白你的意思，不知道 readcount 参数和-r 0 对我来说意义不大。让世界变得与众不同，酷酷。
太棒了！效果很好，而且速度很快。因为我将整个文件加载到一个数组中；我认为在这个阶段改变循环顺序会更快。目前，我正在提取一个 csv 条目，然后搜索所有文件。打开文件，然后搜索所有 CSV 条目可能会更快。谢谢！
在这种情况下，我将完全摆脱选择字符串测试，只需通过 CSV 集合运行每个文件。它可能比通过 CSV 循环在每次迭代中返回并运行另一个选择字符串要快。

【解决方案2】：

是的，您可以通过不使用Get-Content 来加快速度... 改用 Stream Reader。

$file = New-Object System.IO.StreamReader -Arg "test.txt"
while (($line = $file.ReadLine()) -ne $null) {
    # $line has your line
}
$file.dispose()

【讨论】：

while($line = $file.ReadLine()) 将在第一个空行处停止。与 $null 比较更好。
是的；大多数文件都散布着空行：P
使用 .readline() 仍然是一次做一行。这些只是 3000 行文件，因此您应该能够将整个文件读入内存，然后替换为数组，例如 (Get-Content $FileName -r 0) -replace
@mjolinor Idk 这与我的帖子有什么关系...也许可以考虑对原始问题发表评论，甚至发布您的答案。
我认为每个文件都可以很快加载到内存中；它们都没有超过几兆，但是就像我说的那样-速度太慢了无法使用。我对powershell知之甚少，但有人建议我逐行阅读。

【解决方案3】：

我想为此使用 PowerShell，并创建了一个如下所示的脚本：

$filepath = "input.csv"
$newfilepath = "input_fixed.csv"

filter num2x { $_ -replace "aaa","bbb" }
measure-command {
    Get-Content -ReadCount 1000 $filepath | num2x | add-content $newfilepath
}

在我的笔记本电脑上处理 6.5Gb 文件需要 19 分钟。下面的代码是批量读取文件（使用ReadCount）并使用应该优化性能的过滤器。

但后来我尝试了FART，它在 3 分钟内完成了同样的事情！差别很大！

【讨论】：