使用 perl 解析文件并将特定值插入数据库答案

【问题标题】：Using perl to parse a file and insert specific values into a database使用 perl 解析文件并将特定值插入数据库
【发布时间】：2010-04-22 16:02:07
【问题描述】：

免责声明：我是 perl 脚本的新手，这部分是学习练习（但仍然是一个工作项目）。此外，我对 shell 脚本的掌握要强得多，所以我的示例可能会以这种心态进行格式化（但我想在 perl 中创建它们）。提前抱歉我的冗长，我想确保我至少稍微清楚地表达了我的观点

我有一个文本文件（参考指南），它是将 Word 文档转换为文本，然后在 Notepad++ 中从 Windows 转换为 UNIX 格式。该文件是统一的，因为文件的每个部分都有相同的字段/格式/表格。

我计划做的，基本上是抓取每个部分，以唯一的批处理作业名称为键，并将所有值放入数据库（或者可能只是一个 excel 文件），以便可以搜索所有字段/为每个作业编辑比在 word 文件中容易得多，并且以后可能会创建一个 Web 界面。

所以我想做的是通过执行以下操作来获取每个部分：
sed -n '/job_name_1_regex/,/job_name_2_regex/' file.txt --如何在 perl 脚本中对其进行格式化？
（完全抓住该部分，然后从那里进一步分解）

要读取脚本中的文件，我有open FORMAT_FILE, 'test_format.txt';，然后使用foreach $line (<FORMAT_FILE>) 逐行解析文件。 --有没有更好的办法？

我的下一个问题是，由于我是从带有表格的 word doc 转换而来的，它看起来像：

 表格标题 1 表格标题 2
标题 1/值 1 标题 2/值 1
标题 1/值 2 标题 2/值 2

但它看起来像文本文件：

表格标题 1
表标题 2
标题 1/值 1
标题 1/值 2
标题 2/值 1
标题 2/值 2

所以我想将“标题 1”和“标题 2”作为列名，然后将各自的值放在那里。我只是不确定如何从文本文件中获取与标题相关的值。标题 1 的值始终是标题 1 的行号加上 2（标题 1、标题 2、标题 1 的值）。我知道这可以很容易地在 awk/sed 中完成，只是不确定如何在 perl 脚本内部解决它。

---编辑---
为此，我正在考虑做一个类似的数组：

my @heading1 = ($value1, $value2, etc.)
my @heading2 = ($value1, $value2, etc.)

我只需要能够将正确的值和标题关联在一起。因此，heading1 = heading2 之后的行（值开始的位置）。就像说（在shell中）：

x=$(grep -n "Heading 1" file.txt | cut -d":" -f1) #gets the line that "Heading 1" is on in the file
(( x = x+2 )) #adds 2 to the line (where the values will start)
#print values from file.txt from the line where they start to the
#last one (I'll figure that out at some point before this)
sed -n "$x,$last_line_of_values p" file.txt

暂时将其超级破解，以尝试详细说明我想做的事情...让我知道是否可以清除一点...
---/编辑---

在我拥有所有正确的值等之后，将其链接到数据库也可能是一个问题，我还没有开始研究 perl 与数据库交互的方式。

对不起，如果这有点散漫……它还没有完全在我的脑海中形成。

【问题讨论】：

标签： perl sed

【解决方案1】：

http://perlmeme.org/tutorials/connect_to_db.html

#!/usr/bin/perl
use strict;
use warnings;
use DBI;

my $driver = "mysql";   # Database driver type
my $database = "test";  # Database name
my $user = "";          # Database user name
my $password = "";      # Database user password

my $dbh = DBI->connect(
    "DBI:$driver:$database",
    $user, $password,
    {
        RaiseError => 1,
        PrintError => 1,
    }
) or die $DBI::errstr;

my $sth = $dbh->prepare("
        INSERT INTO test 
                    (col1, col2)
             VALUES (?, ?)
    ") or die $dbh->errstr;

my $intable = 0;
open my $file, "file.txt" or die "can't open file $!";
while (<$file>)  {
  if (/job_name_1_regex/../job_name_2_regex/) { # job 1 section
    $intable = 1 if /Table Heading 1/; # table start
    if ($intable) {
      my $next_line = <$file>; # heading 2 line
      chomp; chomp $next_line;
      $sth->execute($_, $next_line) or die $dbh->errstr;
    }
  }
}
close $file or die "can't close file $!";
$dbh->disconnect;

【讨论】：

太棒了，数据库连接过程更清晰...你能解释一下'chomp;'这行是什么吗？ chomp $next_line;'确实如此，只是试图很好地处理所有事情以及为什么要完成某些事情。
@Sean: chomp 从字符串中删除尾随 $/（通常为换行符）（如果没有给出，则它适用于 $_ 变量）。

【解决方案2】：

这篇文章中有几件事......首先，基本的“最佳实践”：

使用现代 perl。用
开始你的脚本
use strict; use warnings;
不要使用全局文件句柄，使用词法文件句柄（在变量中声明它们）。
始终检查“打开”以获取返回值。

open my $file, "/some/file" or die "can't open file : $!"

然后，关于模式匹配：我根本不理解你的例子，但我想你想要类似的东西：

foreach my $line ( <$file> ) {
    if ( $line =~ /regexp1/) { 
    # do something...
    }

}

编辑：关于表，我想最好的办法是构建两个数组，每列一个。如果我在阅读文件时理解正确，则需要拆分行并将一部分放入@col1 数组中，将第二部分放入@col2 数组中。简单明了的方法是使用两个临时变量：

my ( $val1, $val2 ) = split /\s+/, $line;
push @col1, $val1;
push @col2, $val2;

【讨论】：

感谢 waz，我更新了关于表格的文章，试图更好地解释它。