使用python LXML从html网页中提取信息答案

【问题标题】：Use python LXML to extract information from html webpage使用python LXML从html网页中提取信息
【发布时间】：2015-10-06 10:50:10
【问题描述】：

我正在尝试使用我所拥有的有限知识制作一个 python 脚本来从网页中抓取特定信息。但我想我有限的知识是不够的。我需要提取7-8条信息。标签如下 -

<a class="ui-magnifier-glass" href="here goes the link that i want to extract" data-spm-anchor-id="0.0.0.0" style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"></a>

<a href="link to extract" title="title to extract" rel="category tag" data-spm-anchor-id="0.0.0.0">or maybe this word instead of title</a>

如果我知道如何从这些 href 标签中提取信息。我将能够自己完成其余的工作。

如果有人可以帮助我编写代码以在 csv 文件中添加此信息，我们将不胜感激。

我已经开始使用此代码

url = raw_input('url : ')

page = requests.get(url)
tree = html.fromstring(page.text)
productname = tree.xpath('//h1[@class="product-name"]/text()')
price = tree.xpath('//span[@id="sku-discount-price"]/text()')
print '\n' + productname[0]
print '\n' + price[0]

【问题讨论】：

你想要使用Beautifulsoup的解析方式，因为你已经在这里标记了它？我认为使用Beautifulsoup 进行解析是迄今为止最简单的。

标签： python html beautifulsoup lxml python-requests

【解决方案1】：

这里是如何使用lxml 和一些使用curl 的东西通过id 提取：

curl some.html | python extract.py

提取.py：

from lxml import etree
import sys
# grab all elements with id == 'postingbody'
pb = etree.HTML(sys.stdin.read()).xpath("//*[@id='postingbody']")
print(pb)

一些.html：

<html>
    <body>
        <div id="nope">nope</div>
        <div id="postingbody">yep</div>
    </body>
</html>

另见：

XPath to select Element by attribute value

【讨论】：

【解决方案2】：

您可以使用 lxml 和 csv 模块来做您想做的事。 lxml 支持 xpath 表达式来选择你想要的元素。

from lxml import etree
from StringIO import StringIO
from csv import DictWriter

f= StringIO('''
    <html><body>
    <a class="ui-magnifier-glass" 
       href="here goes the link that i want to extract" 
       data-spm-anchor-id="0.0.0.0" 
       style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
    ></a>
    <a href="link to extract"
       title="title to extract" 
       rel="category tag" 
       data-spm-anchor-id="0.0.0.0"
    >or maybe this word instead of title</a>
    </body></html>
''')
doc = etree.parse(f)

data=[]
# Get all links with data-spm-anchor-id="0.0.0.0" 
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')

# Iterate thru each element containing an <a></a> tag element
for elem in r:
    # You can access the attributes with get
    link=elem.get('href')
    title=elem.get('title')
    # and the text inside the tag is accessable with text
    text=elem.text

    data.append({
        'link': link,
        'title': title,
        'text': text
    })

with open('file.csv', 'w') as csvfile:
    fieldnames=['link', 'title', 'text']
    writer = DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for row in data:
        writer.writerow(row)

【讨论】：

非常感谢！有没有一种方法可以让我获取不同变量中的所有数据，将它们添加到字典或列表中。然后将其附加到 csv？
它已经这样做了。为了清楚起见，我添加了更多 cmets 并对其进行了重构。如果您还没有这样做，您应该在交互式 python 下运行它。它使您可以逐行查看正在发生的事情并检查中间状态。
是的，我已经运行了代码代码。但问题是它在 csv 中添加了 3 行相同的数据
也许这是因为 data 是一个列表并且它被用作字典？
该示例查找所有具有属性 data-spm-anchor-id="0.0.0.0" 的元素。由于有两个元素，因此有相应数量的数据行。第一行是标题行，告诉您列包含的内容，可以通过删除 writer.writeheader() 来省略。