【问题标题】：Use SED to extract value of all input elements with a certain name使用 SED 提取具有特定名称的所有输入元素的值
【发布时间】：2016-01-11 12:46:11
【问题描述】：

如何根据对其他属性的搜索获得 value 属性？

例如：

<body>
<input name="dummy" value="foo">
<input name="alpha" value="bar">
</body>

如何获取名称为“dummy”的输入元素的值？

【问题讨论】：

你可以用这个命令得到它。 sed -n 's/.*input name="dummy" value="([^"]*)".*/\1/p' 但是对于这项工作，html/xml 解析器是正确的工具

标签： html regex bash sed

【解决方案1】：

由于您正在寻找使用 bash 和 sed 的解决方案，我假设您正在寻找 Linux 命令行选项。

使用`hxselect` html解析工具提取元素；使用`sed` 从元素中提取值

我在 Google 上搜索了“linux bash parse html tool”，结果发现：https://unix.stackexchange.com/questions/6389/how-to-parse-hundred-html-source-code-files-in-shell

接受的答案建议使用html-xml-utils package 中的hxselect 工具，该工具基于css 选择器提取元素。所以安装后（下载、解压缩、./configure、make、make install），您可以使用给定的 css 选择器运行此命令

hxselect "input[name='dummy']" < example.html

（鉴于 example.html 包含问题中的示例 html。）这将返回：

<input name="dummy" value="foo"/>

差不多了。我们需要从该行中提取值：

hxselect "input[name='dummy']" < example.html | sed -n -e "s/^.*value=['\"]\(.*\)['\"].*/\1/p"

返回“foo”。

为什么你会/不想使用这种方法

using regex to parse out the attributes is complicated, and often the wrong way to go
hxselect 工具（在我的另一个答案中）安装起来很麻烦
但是，这种方法接受格式错误的 html，这正是 this answer to the question linked above 所主张的。顺便说一句，这个问题对 regex+html 辩论进行了非常彻底的讨论。

【讨论】：

事后我回来了，我真的不认为这是一个好的答案 - 它很尴尬而且太复杂，并且不遵循@Ramana的建议，因为它仍在使用 SED 解析元素属性.我做了更多的研究并用不同的方法再次回答

【解决方案2】：

由于您要求使用 SED，我假设您需要一个命令行选项。但是，为 html 解析而构建的工具可能更有效。我的第一个答案的问题是我不知道在 css 中选择属性值的方法（还有其他人吗？）。但是，使用 xml，您可以像选择其他元素一样选择属性。这是使用 xml 解析工具的命令行选项。

将其视为 XML；使用XPATH

用你的包管理器安装xmlstarlet
运行 xmlstarlet sel -t -v //input[@name=\'dummy\']/@value example.html（其中 example.html 包含您的 html
如果您的 html 不是有效的 xml，请按照 xmlstarlet 的警告进行必要的更改（在这种情况下，<input> 必须更改为 <input/>
再次运行该命令。返回：foo

为什么你可能/可能不会使用这种方法

它比hand-rolling a regex html parser 更加简单和强大，但是
需要格式良好的 html

【讨论】：

【解决方案3】：

使用 sed 解析 HTML 通常是个坏主意，因为 sed 以基于行的方式工作，而 HTML 通常不认为换行符在语法上很重要。如果您的 HTML 处理工具在重新格式化 HTML 时出现故障，那就不好了。

请考虑使用 Python，它的标准库中有一个 HTML 推送解析器。例如：

#!/usr/bin/python

from HTMLParser import HTMLParser
from sys import argv

# Our parser. It inherits the standard HTMLParser that does most of
# the work.
class MyParser(HTMLParser):
    # We just hook into the handling of start tags to extract the
    # attribute
    def handle_starttag(self, tag, attrs):
        # Build a dictionary from the attribute list for easier
        # handling
        attrs_dict = dict(attrs)

        # Then, if the tag matches our criteria
        if tag == 'input' \
           and 'name' in attrs_dict \
           and attrs_dict['name'] == 'dummy':
            # Print the value attribute (or an empty string if it
            # doesn't exist)
            print attrs_dict['value'] if 'value' in attrs_dict else ""

# After we defined the parser, all that's left is to use it. So,
# build one:
p = MyParser()

# And feed a file to it (here: the first command line argument)
with open(argv[1], 'rb') as f:
    p.feed(f.read())

将此代码另存为foo.py，然后运行

python foo.py foo.html

foo.html 是您的 HTML 文件。

【讨论】：

使用hxselect html解析工具提取元素；使用sed 从元素中提取值

为什么你会/不想使用这种方法

将其视为 XML；使用XPATH

为什么你可能/可能不会使用这种方法

使用`hxselect` html解析工具提取元素；使用`sed` 从元素中提取值