将编号的成绩单解析为 XML答案

【问题标题】：Parsing a numbered transcript into XML将编号的成绩单解析为 XML
【发布时间】：2012-03-11 03:56:54
【问题描述】：

我想构建一个抓取器，用于解析来自Leveson Inquiry 的记录，其格式为以下纯文本：

         1                                      Thursday, 2 February 2012

         2   (10.00 am)

         3   LORD JUSTICE LEVESON:  Good morning.

         4   MR BARR:  Good morning, sir.  We're going to start today

         5       with witnesses from the mobile phone companies,

         6       Mr Blendis from Everything Everywhere, Mr Hughes from

         7       Vodafone and Mr Gorham from Telefonica.

         8   LORD JUSTICE LEVESON:  Very good.

         9   MR BARR:  We're going to listen to them all together, sir.

        10       Can I ask that the gentlemen are sworn in, please.

        11                   MR JAMES BLENDIS (affirmed)

        12                     MR ADRIAN GORHAM (sworn)

        13                      MR MARK HUGHES (sworn)

        14                       Questions by MR BARR

        15   MR BARR:  Can I start, please, Mr Hughes, with you.  Could

        16       you tell us the position that you hold and a little bit

        17       about your professional background, please?

        18   MR HUGHES:  Yes, sure.  I'm currently head of fraud risk and

        19       security for Vodafone UK.  I have been in that position

        20       since August 2011 and I've worked in the fraud risk and

        21       security department in Vodafone since October 2006.

        22   Q.  Mr Gorham, if I could ask you the same question, please.

        23   MR GORHAM:  I'm the head of fraud and security for

        24       Telefonica O2, I've been in that role for ten years and

        25       have been in the industry for 13.


                                         1

(Full example)

最终我想构建一个结构如下的 XML 文件：

<hearing date="2012-02-02" time="10:00">
    <quote speaker="Lord Justice Leveson" page="1" line="3">Good morning.</quote>
    <quote speaker="Mr Barr" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</quote>
    <quote speaker="Lord Justice Leveson" page="1" line="8">Very good.</quote>
[... and on ...]
</hearing>

...有什么帮助吗？

（另请注意，“MR BARR:”在某个点会变成简单的“Q.”。）

非常感谢！

【问题讨论】：

我不知道如何开始，真的。看来我需要一个。解析行号、页码和空格 b.将所有引号放入变量 c。将所有内容解析为 XML。最重要的是，我正在寻求有关如何解决此问题的建议。
解析纯文本是困难。您可能不得不依赖文本中的规律性：找出以[name]: 开头的行（也许通过正则表达式），然后折叠文本直到下一个诸如“该人所说的内容”之类的语句。
我搞定了，只有一个问题，你想用 cmets 做什么？例如：(A short break)
@Robjong -- 太棒了！如果您可以为它们创建一个新标签——即<event page="n" line="n">A short break</event>，那将是非常棒的。
尽管如此，仍有一些事情需要解决，这只是一个可以扩展的概念证明，而且有点混乱，哈哈。顺便提一句。也许在完成后验证 XML？

标签： php xml regex web-scraping scraperwiki

【解决方案1】：

首先让我说这不是一个万无一失的脚本，我很可能忘记或忽略了一些东西，但它是一个概念证明，供您改进和扩展或只是获得一个想法。

文本布局中有足够的规律供我们使用，脚本所做的就是拆分转录成一系列行并将这些行与一些模式进行匹配，以尝试识别规律并确定数据的作用。

示例脚本：

<?php
/*
Proof of Concept : Transcript to XML by Robjong

? :
    - action on date change (what to do when the date changes?)
    - what to do with lines like "MR MARK HUGHES (sworn)" (make it a note?!)
    - what to do with lines like "Questions by MR BARR" (make it a note?!)
    - detect events/notes in quotes better? (e.g: MR BLENDIS: (Nods head).)


Notes :

    - desperately needs error checking/handling!!!! (for now it just got in the way)
    - it might well be that the configuration of PHP will cause file_get_contents to fail,
      try curl or download it manually and read the local file
    - if you are running PHP < 5.2.4, change the \h in the pattern to \s or [\t ]

*/

# basic usage
// get the transcript as plain text
$txt = file_get_contents( 'http://www.levesoninquiry.org.uk/wp-content/uploads/2012/02/Transcript-of-Morning-Hearing-2-February-2012.txt' );
// convert transcript to XML
$xml = transcriptToXML_beta( $txt );
// we have the transcript as XML, now what?
file_put_contents( 'transcript.xml', $xml ); // let's write it to a file
echo $xml;


function transcriptToXML_beta( $string ) { // beta is just to emphasize lack of torough testing
    $lines = explode( "\n", $string ); // split text into an array array of lines
    if( !is_array( $lines ) ) { // the provided string was not multiline
        return false;
    }

    // these vars will hold the data we need to build our XML
    $date = ''; // the date of the transcript
    $time = ''; // transcript time
    $page = 1; // this will hold the current page number

    $linenr = ''; // this will hold the line nr
    $speaker = ''; // the name of the speaker
    $text = ''; // transcribed text attributed to the speaker
    $new = false; // will be true if a new item has been matched
    $event = ''; // this will hold notes that are in a quote but need to be stored separately (events)

    $xml = ''; // this will be the XML string
    $i = 0; // count the lines to display actual line number for debugging
    foreach( $lines as $line ) { // loop over lines
        $i++;
        if( !preg_match( "/[[:graph:]]/", $line ) ) { // line is empty, does not contain printable characters....
            continue; // ....so we skip to the next one
        }

        if( preg_match( "/^\h*\d+\h+(?P<date>[a-z]+,\h+\d+\h+[a-z]+\h\d{4})\s*$/i", $line, $match ) ) { # it looks like a date
            $date = $match['date']; // set date
            $speaker = ''; // reset vars
            $text = '';
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*\d+\h+([A-Z]+(?:\s+[A-Z]+){0,4}\h+\(.*?\)|(?i:questions\h+by)[A-Z\h]+)\s*$/", $line, $match ) ) { # (qued) event, uppercase text followed by text between parentheses
            $event .= "    <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry to que, to be added after current quote
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*(\d*)\h*\(\h*(?P<time>\d{1,2}[:.]\d{1,2}\h*[ap]m)\)\s*$/i", $line, $match ) ) { # seems we have a time entry
            $time = $match['time']; // set date
            $xml .= "    <time page=\"{$page}\" line=\"{$match[1]}\">" . strtoupper( str_replace( '.', ':', $match['time'] ) ) . "</time>\n"; // add time as entry
            $speaker = ''; // reset vars
            $text = '';
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*(\d+)\s*$/", $line, $match ) ) { # line has just one or more digits, we assume its a pagenr
            if( $match[1] <= $page ) { // if the number is lower then the current page number ignore it, this avoids triggering errors for 'empty lines' that only have a line number
                continue;
            }
            $page = (int) $match[1] + 1; // set pagenr, add one because the nr is at the bottom of the page
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*\d+\s+\(([[:print:]]+)\)\s*$/", $line, $match ) && !$speaker ) { # note, text is between parentheses
            $xml .= "    <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*\d+\h+[A-Z\h]+\(.*?\)\s*$/", $line, $match ) && !$speaker ) { # note, uppercase text followed by text between parentheses, only if not in quote
            $xml .= "    <event type=\"note\" speaker=\"\" page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
            continue;// no need to handle this line any further
        } elseif( preg_match("/^\h*(?P<linenr>\d+)\h+(?P<speaker>(?:\h[A-Z]+(?:\h[A-Z]+){0,4}))[:.]\h*(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # new speaker entry
            if( $new && $speaker ) { // if we have one open we need to add it first
                $xml .= "    <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n"; // add quote
                $new = false; // reset
                if( $event ) { // if we have a qued note we need to add that too
                    $xml .= $event; // add entry to XML string
                    $event = ''; // clear 'que'
                }
            }
            $speaker = trim( $match['speaker'] ); // assign match to speaker var
            $linenr = (int) $match['linenr']; // assign line number
            $text = trim( $match['text'] ); // assign text
            $new = true; // set new match bool
        } elseif( preg_match( "/^\h*(?P<linenr>\d+)\h+(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # follow up text
            $text .= ' ' . trim( $match['text'] ); // append text
        } else { # unkown line (add check for linenr only lines or remove $match[1] >= $page from the pagenr match conditional)
            // not sure what kind of line this is... add it as a separate 'type'?!
            trigger_error( "Unable to parse line {$i} \"{$line}\"" ); // throw exception / trigger error
            continue; // no need to handle this line any further
        }

        if( !$new && $speaker ) {
            $xml .= "    <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
            $speaker = ''; // reset vars
            $text = '';
            $new = false;
            if( $event ) { // if we have a qued note we need to add that too
                $xml .= $event; // add entry to XML string
                $event = ''; // clear 'que'
            }
        }
    }

    // if we have a match open we need to handle it, this might happen because we do not assign the match in the same iteration as we matched it
    if( $new ) {
        $xml .= "    <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
    }

    if( !trim( $xml ) ) { // no text found so $xml is still an empty string
        return false;
    }

    $date = new DateTime( $date ); // instantiate datetime with the time from the transcript
    $date = date( 'Y-m-d', $date->getTimestamp() ); // format date
    // now we need to wrap the nodes in a root node
    $xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<hearing date=\"{$date}\">\n{$xml}</hearing>\n";

    return $xml; // return the XML
}
?>

我将在今天晚些时候更新 cmets 和脚本

输出样本：

<hearing date="2012-02-02"> 
    <time page="1" line="2">10:00 AM</time> 
    <entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="3">Good morning.</entry> 
    <entry type="quote" speaker="MR BARR" page="1" line="4">Good morning, sir.  We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</entry> 
    <entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="8">Very good.</entry> 
    <entry type="quote" speaker="MR BARR" page="1" line="9">We're going to listen to them all together, sir. Can I ask that the gentlemen are sworn in, please.</entry> 
    <event page="1" line="9">MR JAMES BLENDIS (affirmed)</event> 
    <event page="1" line="9">MR ADRIAN GORHAM (sworn)</event> 
    <event page="1" line="9">MR MARK HUGHES (sworn)</event> 
    <event page="1" line="9">Questions by MR BARR</event>

顺便说一句。只是出于好奇，你需要这个做什么？

【讨论】：

哇，史诗！我正在开发一个刮板/视图，以更易于访问的方式呈现所有 Leveson 证词——想想气泡和永久链接。今天晚些时候我会给你的功能一个镜头，并会用我的项目的链接报告回来。谢谢！
干得漂亮！由于空行（稍后我将提供处理），它会引发一些异常，但其他方面效果很好。再次感谢！见：scraperwiki.com/scrapers/leveson_inquiry_transcript_scraper
很好，我更新了示例以忽略只有行号的行。

【解决方案2】：

这通常是一个非常困难的问题，并且超出了 StackOverflow 的范围。也就是说，如果我必须这样做，我会采用迭代方法：

确定文本布局中的规律并设计候选语法。
使用语法编写解析器；解析将非常严格并丢弃（带有错误消息）任何不匹配的内容。
在整个文本上运行它
检查输出和不匹配，修改语法，识别特殊情况
返回步骤 3

至于这些步骤的详细信息，只有您可以决定是否要得到您想要的。此外，任何解决方案都需要事先或事后手动干预，以清除低频不一致。

【讨论】：

感谢您的回复——如果问题确实超出了 SO 的范围，我们深表歉意。我知道这是一项非常艰巨的任务（而且我个人以前从未做过），所以我以开放的心态询问任何建议。关于这一点——考虑到它的难度，你认为它甚至值得吗？
是否“值得”完全取决于进行成本效益分析。不要忘记包括无形的好处，例如设计和编程的乐趣，以及您可能获得的宝贵经验。