使用 sed 提取两个 XML 标记之间的文本答案

【问题标题】：Extract text between two XML tags using sed使用 sed 提取两个 XML 标记之间的文本
【发布时间】：2014-11-13 19:56:21
【问题描述】：

我有类似于以下的 XML 文件：

<?xml version="1.0" encoding="UTF-8"?>
<OnlineCommentary>
    <doc docid="cnn_210085_comment002" articleURL="http://www.cnn.com/News.asp?NewsID=210085" date="10/07/2010" time="00:21" subtitle="Is Justin Bieber getting special treatment?" author="Zorro75">
        <seg id="1"> They are the same thing. Let's shoot them both. </seg>
    </doc>
    <doc docid="cnn_210092_comment004" articleURL="http://www.cnn.com/News.asp?NewsID=210092" date="06/04/2010" time="17:07" subtitle="Dear Chicago, we love you despite it all" author="MRL1313">
        <seg id="1"> We can't wait for you to move back either. </seg>
        <seg id="2"> You seem quite uptight. </seg>
        <seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>
    </doc>
</OnlineCommentary>

我想对该文件执行命令以仅提取开始标签<seg ...>和结束标签</seg>之间的连接

我试过了：

sed -n 's:.*<seg id="1">\(.*\)</seg>.*:\1:p' XML-file.xml > output.txt

我的问题如下：

-- 如何打印所有<seg id="*">??我的命令只打印第一个标签的内容 (<seg id="*">)

-- 有没有一种方法可以用来使例如<seg id="1">、<seg id="2">、<seg id="3"> 打印在同一行，而仅包含 <seg id="1"> 的标签将打印在单独的行中？？

【问题讨论】：

标签： xml regex linux shell sed

【解决方案1】：

打印所有<seg id=>（每行一个），包括<seg

sed -n 's:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:\1:p' XML-file.xml > output.txt

在 1 行上打印所有内容，并使用分隔的 ,。使用保持缓冲区而不是打印，最后调用缓冲区，用, 替换新行（并删除起始,，由于Append 操作），并打印结果

sed -n '\:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:  { s//\1/
   H
   }
$ {g
   s/\n/,/g;s/^,//
   p
   }' XML-file.xml > output.txt

现在，@Choroba 建议使用适当的 XML 工具非常好，您可以最大限度地减少处理文件中不需要的数据的风险。

【讨论】：

感谢您的帮助！！

【解决方案2】：

使用适当的 XML 处理工具。比如在XML::XSH2:

open file.xml ;
for //doc echo seg/text() ;

【讨论】：

感谢您的建议！！