【问题标题】:How to transform structured textfiles into PHP multidimensional array如何将结构化文本文件转换为 PHP 多维数组
【发布时间】:2013-08-19 16:21:53
【问题描述】:

我有 100 个文件,每个文件包含 x 条新闻文章。文章由以下缩写的部分构成:

HD BY WC PD SN SC PG LA CY LP TD CO IN NS RE IPC PUB AN

其中[LP][TD] 可以包含任意数量的段落。

典型的消息如下所示:

HD Corporate News: Alcoa Earnings Soar; Outlook Stays Upbeat 
BY By James R. Hagerty and Matthew Day 
WC 421 words
PD 12 July 2011
SN The Wall Street Journal
SC J
PG B7
LA English
CY (Copyright (c) 2011, Dow Jones & Company, Inc.) 

LP 

Alcoa Inc.'s profit more than doubled in the second quarter, but the giant 
aluminum producer managed only to meet analysts' recently lowered forecasts.

Alcoa serves as a bellwether for U.S. corporate earnings because it is the 
first major company to report and draws demand from a wide range of 
industries.

TD 

The results marked an early test of how corporate optimism is holding up 
in the face of bleak economic news.

License this article from Dow Jones Reprint 
Service[http://www.djreprints.com/link/link.html?FACTIVA=wjco20110712000115]

CO 
almam : ALCOA Inc

IN 
i2245 : Aluminum | i22 : Primary Metals | i224 : Non-ferrous Metals | imet 
  : Metals/Mining

NS 
c15 : Performance | c151 : Earnings | c1521 : Analyst 
Comment/Recommendation | ccat : Corporate/Industrial News | c152 : 
Earnings Projections | ncat : Content Types | nfact : Factiva Filters | 
nfce : FC&E Exclusion Filter | nfcpin : FC&E Industry News Filter

RE 
usa : United States | use : Northeast U.S. | uspa : Pennsylvania | namz : 
North America

IPC 
DJCS | EWR | BSC | NND | CNS | LMJ | TPT

PUB 
Dow Jones & Company, Inc.

AN 
Document J000000020110712e77c00035

在每篇文章之后,在新文章开始之前有 4 个换行符。我需要把这些文章放到一个数组中,如下:

$articles = array(
  [0] = array (
    [HD] => Corporate News: Alcoa earnings Soar; Outlook...
    [BY] => By James R. Hagerty...
    ...
    [AN] => Document J000000020110712e77c00035
  )
)

【问题讨论】:

标签: php regex


【解决方案1】:

一种使用 explode 分隔每个块并使用正则表达式提取字段的方法:

$pattern = <<<'LOD'
~
# definition
(?<fieldname> (?:HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)$ ){0}

# pattern
\G(?<key>\g<fieldname>) \s+
(?<value>
    .+ 
    (?: \R{1,2} (?!\g<fieldname>) .+ )*+
)
(?:\R{1,3}|\z)
~xm
LOD;
$subjects = explode("\r\n\r\n\r\n\r\n", $text);
$result = array();

foreach($subjects as $i => $subject) {
    if (preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER)) {
        foreach ($matches as $match) {
            $result[$i][$match['key']]=$match['value'];
        }
    }
}
echo '<pre>', print_r($result, true);

图案细节:

图案分为两部分:

在定义部分,我编写了一个名为 fieldname 的子模式,以便稍后在主模式中使用它。此模式还检查每个字段名以$ 锚点结束。

主要模式:

\G                        # this forces the match to be contiguous to the
                          # precedent match or the start of the string (no gap)
(?<key> \g<fieldname> )   # a capturing group named "key" for the fieldname
\s+                       # one or more white characters
(?<value>                 # open a capturing group named "value" for the
                          # field content
    .+                    # all characters except newlines 1 or more times
    (?:                   # open an atomic group
        \R\R?+            # one or two newlines to allow paragraphs (LP & TD) 
        (?!\g<fieldname>) # but not followed by a fieldname (only a check)
        .+                #
    )*+                   # close the atomic group and repeat 0 or more times
)                         # close the capture group "value"
(?:\R{1,3}|\z)            # between 1 or 3 newlines max. or the end of the
                          # string (necessary if i want contigous matches)

全局修饰符:

  • x(扩展模式):以 # 开头的空格和内联 cmets 在模式中被忽略。
  • m(多行模式):^ 匹配行首,$ 匹配行尾。

【讨论】:

  • 您可能希望链接到 Heredoc 字符串引用。如果有人将它粘贴到 Dreamweaver 之类的东西中,就会出现各种错误。 php.net/manual/en/…
  • 谢谢!但是对于 TS 中的示例运行它会返回零匹配。 $subjects 包含一个文档(在 TS 中的 $text 中给出)但模式不匹配任何内容?
  • @Pr0no:我已经使用文本示例进行了测试,效果很好。我将发布数据样本。
  • 我已更新 TS 以反映您的回答。我无法让它工作。我哪里错了?
  • @Pr0no:第一个分隔符 ~ 必须紧跟在下一行的 'LOD' 之后(之前没有空格和制表符)。由于您的文本文件使用\r\n 换行,因此您必须在模式中将\n 替换为\r\n,请参阅编辑。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-04-29
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多