【问题标题】:XML file parsing: how to startXML文件解析:如何开始
【发布时间】:2014-10-25 17:36:12
【问题描述】:

我需要一些帮助来解析 xml 文件。这是我第一次做这种任务,我会很感激一些建议或帮助。我有一个像这样的大文件:

<Response success="true" start_row="0" num_rows="100" total_rows="100">
<ncbi-genes>
    <ncbi-gene>
        <acronym>Accn1</acronym>
        <alias-tags>BNC1 BNaC1 ACIC2 ASIC2 Mdeg BNaC1a</alias-tags>
        <data-sets>
            <data-set>
                <blue-channel nil="true"/>
                <delegate type="boolean">true</delegate>
                <specimen>
                    <chemotherapy nil="true"/>
                    <donor-id type="integer">9456</donor-id>
                    <donor>
                        <age-id type="integer">1</age-id>
                        <condition-description>TS26</condition-description>
                        <age>
                            <age-group-id type="integer">1</age-group-id>
                            <days type="float">18.5</days>
                        </age>
                    </donor>
                </specimen>
                <differential-expression-rankings type="array">
                    <differential-expression-ranking>
                        <structure>
                            <acronym>PPH</acronym>
                            <name>prepontine hindbrain</name>
                        </structure>
                    </differential-expression-ranking>
                    <differential-expression-ranking>
                        <structure>
                            <acronym>p3</acronym>
                            <name>prosomere 3</name>
                        </structure>
                    </differential-expression-ranking>
                </differential-expression-rankings>
            </data-set>
            <data-set>
              (...same fields as before...)
            </data-set>
        </data-sets>
   </ncbi-gene>
</ncbi-genes>

我需要提取:

1) {ncbi-genes} -> {ncbi-gene}->{acronym}

2) {ncbi-genes} -> {ncbi-gene}->{data-sets}->{data-set}->{specimen}-{donor}->{年龄}->{天}

3) {ncbi-genes} -> {ncbi-gene}->{data-sets}->{data-set}->{{{structure}->{name}

该文件包含 100 个字段,每个字段中可以出现多次。

我尝试过的...:

#!/usr/bin/perl -w
use strict;
use warnings;
#use XML::Parser;
use LWP::Simple;  # used to fetch the chatterbox ticker
use XML::Simple;
use Data::Dumper;

my $file1 = 'file.xml';
my $xml = new XML::Simple;

my $data = $xml->XMLin($file1, ForceArray => 1);
print Dumper($data); ## This prints all data OK

#To print the acronym field
foreach my $genelist (@{$data->{ncbi-genes}}) {
    print $genelist;
    my $curr_gene= $genelist->{ncbi-gene};
    print $curr_gene->{acronym} . "\n"
}

这个循环不起作用。我认为是因为 ncbi-genes 中的“-”。我已将此字段更改为 NCBIGENES,现在错误是:

Not a HASH reference at xml_parser.pl line 19.
HASH(0x29d7ca0)

调用哈希的问题... 正如我所说,我是这种数据的新手,这是我第一次使用 xml 模块。因此,非常感谢任何有关定位自己的建议。

提前致谢。

【问题讨论】:

  • 这个脚本没有26行,是哪一行出错了?
  • 您是否阅读过XML::Simple 中的“该模块的状态”部分?
  • 旁注:使用引号$genelist-&gt;{'ncbi-gene'},以避免重命名元素。
  • @cucurbit XML::Twig 相当用户友好——您可能会发现开始使用 XML::LibXML 会更好。
  • “巨大”毫无意义。它可以表示从 1Mb 到 100Gb 的任何内容。有很大的不同,它可能会影响您的操作方式。

标签: xml perl


【解决方案1】:

这是一个使用XML::LibXML 进行解析的简单示例; XML::LibXML 让您可以轻松访问XPath,这是一种 XML 查询语言,允许您根据标签名称、值、属性和/或与其他节点的关系来选择节点集。使用 XPath 可以轻松挑选出“y 节点下的所有 x 节点”或“所有具有属性 z 的 x 节点” 具有 ID 为 w," 或类似复杂查询的后代节点。

use strict;
use warnings;
use feature qw(say);
use Data::Dumper;
use XML::LibXML;

my $tree = XML::LibXML->load_xml( IO => \*DATA );

## make sure that we have some genes!
die "Could not find any genes!" if ! $tree->exists('//ncbi-gene');

# for every 'ncbi-gene' node:
for my $gene ( $tree->findnodes('//ncbi-gene') ) {
    my %data;
    # is there an acronym as direct child of the node?
    $data{acronym} = $gene->findvalue('acronym') if $gene->exists('acronym');

    # find the donor age in days using the path specified
    # to get the value of each node, run to_literal on it
    $data{donor_age_days} = [ map { $_->to_literal }
        $gene->findnodes('data-sets/data-set/specimen/donor/age/days') ];

    # find all the 'name' nodes under a 'structure' node that is a descendant of $gene
    $data{structures} = [ map { $_->to_literal }
        $gene->findnodes( 'descendant::structure/name', $gene ) ];

    # this will find any 'name' node under a structure node anywhere in the tree
    $data{all_structures} = [ map { $_->to_literal } 
        $gene->findnodes('//structure/name') ];

    # an example of using findvalue on a query that returns an array: only the
    # first value is returned.
    $data{acronyms_str} = [ $gene->findvalue('//structure/acronym') ];

    say Dumper( \%data );
}

__DATA__
<Response success="true" start_row="0" num_rows="100" total_rows="100">
<ncbi-genes>
    <ncbi-gene>
        <acronym>Accn1</acronym>
        <alias-tags>BNC1 BNaC1 ACIC2 ASIC2 Mdeg BNaC1a</alias-tags>
        <data-sets>
            <data-set>
                <blue-channel nil="true"/>
                <delegate type="boolean">true</delegate>
                <specimen>
                    <chemotherapy nil="true"/>
                    <donor-id type="integer">9456</donor-id>
                    <donor>
                        <age-id type="integer">1</age-id>
                        <condition-description>TS26</condition-description>
                        <age>
                            <age-group-id type="integer">1</age-group-id>
                            <days type="float">18.5</days>
                        </age>
                    </donor>
                </specimen>
                <differential-expression-rankings type="array">
                    <differential-expression-ranking>
                        <structure>
                            <acronym>PPH</acronym>
                            <name>prepontine hindbrain</name>
                        </structure>
                    </differential-expression-ranking>
                    <differential-expression-ranking>
                        <structure>
                            <acronym>p3</acronym>
                            <name>prosomere 3</name>
                        </structure>
                    </differential-expression-ranking>
                </differential-expression-rankings>
            </data-set>
            <data-set>
              (...same fields as before...)
            </data-set>
        </data-sets>
   </ncbi-gene>
   <ncbi-favourite-places>
      <structure>
         <name>Eiffel Tower</name>
      </structure>
   </ncbi-favourite-places>
</ncbi-genes>
</Response>

输出(请注意,我对您的 XML 进行了一些更改!):

$VAR1 = {
  'acronym' => 'Accn1',
  'donor_age_days' => [
    '18.5'
  ],
  'structures' => [
    'prepontine hindbrain',
    'prosomere 3'
  ],
  'acronyms_str' => [
    'PPHp3'
  ],
  'all_structures' => [
    'prepontine hindbrain',
    'prosomere 3',
    'Eiffel Tower'
  ]
};

zvon.org 有一些很好的 XPath 教程,在浏览 XML 文档时应该会派上用场——请注意,XML::LibXML 所基于的 libxml 仅实现 XPath 1.0。

以下是为每个data-set 节点收集数据的快速示例:

for my $gene ( $tree->findnodes('//ncbi-gene') ) {
    my $data;
    for my $ds ( $gene->findnodes('data-sets/data-set')) {

        # get the age in days -- assumes there is only one age per <data-set>
        my $age = $ds->findvalue('specimen/donor/age/days');

        # get the structures associated with that age
        my @structures = map { $_->to_literal } 
                     $ds->findnodes('descendant::structure/name');
        # you can now save them however you like--e.g.
        push @{$data->{$age}}, @structures;
    }
}

【讨论】:

  • 效果很好,非常感谢,现在,我会尝试理解所有行:)
  • 有没有办法对与年龄相关的结构进行分组?我的意思是,我想知道每个年龄包含哪些结构......也许将年龄保存为键并构造该键的数组?
  • 是的——您可能希望依次遍历每个数据集,以确保将年龄与结构相关联。我将在答案中添加一个示例。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-06-17
  • 1970-01-01
  • 2018-06-18
  • 2012-03-27
相关资源
最近更新 更多