使用 lxml 抓取 html答案

【问题标题】：html scraping using lxml使用 lxml 抓取 html
【发布时间】：2018-02-22 05:36:38
【问题描述】：

我正在使用 lxml 报废数据

这是单个帖子的检查元素

<article id="post-4855" class="post-4855 post type-post status-publish format-standard hentry category-uncategorized">


<header class="entry-header">
    <h1 class="entry-title"><a href="http://aitplacements.com/uncategorized/cybage/" rel="bookmark">Cybage..</a></h1>
            <div class="entry-meta">
        <span class="byline"> Posted by <span class="author vcard"><a class="url fn n" href="http://aitplacements.com/author/tpoait/">TPO</a></span></span><span class="posted-on"> on <a href="http://aitplacements.com/uncategorized/cybage/" rel="bookmark"><time class="entry-date published updated" datetime="2017-09-13T11:02:32+00:00">September 13, 2017</time></a></span><span class="comments-link"> with <a href="http://aitplacements.com/uncategorized/cybage/#respond">0 Comment</a></span>      </div><!-- .entry-meta -->
        </header><!-- .entry-header -->

<div class="entry-content">
    <p>cybage placement details shared <a href="http://aitplacements.com/uncategorized/cybage/" class="read-more">READ MORE</a></p>
        </div><!-- .entry-content -->

对于每个这样的帖子，我想提取标题、帖子内容和发布时间。

例如上面的，细节将是

{title : "Cybage..",
 post : "cybage placement details shared"
 datetime="2017-09-13T11:02:32+00:00"
}

到目前为止，我能够实现的目标：该网站需要登录，我已成功登录。

提取信息：

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) 
Chrome/42.0.2311.90'}
url = 'http://aitplacements.com/news/'
page = requests.get(url,headers=headers)
doc = html.fromstring(page.content)
#print doc # it prints <Element html at 0x7f59c38d2260>
raw_title = doc.xpath('//h1[@class="entry-title"]/a/@href/text()')
print raw_title

raw_title 给出空值[] ?

我做错了什么？

【问题讨论】：

你应该看看beautiful soup。它非常适合您的需要。或者scrapy，如果您需要更高级的东西（例如，蜘蛛）。
我得到了空值，因为我被注销了，解决了这个问题

标签： python html-parsing lxml

【解决方案1】：

@href指的是href属性的值：

In [14]: doc.xpath('//h1[@class="entry-title"]/a/@href')
Out[14]: ['http://aitplacements.com/uncategorized/cybage/']

您想要的是 <a> 元素的文本：

In [16]: doc.xpath('//h1[@class="entry-title"]/a/text()')
Out[16]: ['Cybage..']

因此，使用

raw_title = doc.xpath('//h1[@class="entry-title"]/a/text()')
if len(raw_title) > 0:
    raw_title = raw_title[0]
else:
    # handle the case of missing title
    raise ValueError('Missing title')

【讨论】：

为什么我的raw_title 是空的，文档确实提取了页面吗？
如果您不确定doc 解析了什么，请打印LH.tostring(doc, pretty_print=True)（或将其写入文件并在那里检查）。你得到一个空的raw_title 的原因是因为a/@href/text() 正在寻找附加到href 属性的文本。空无一人。文本附加到<a> 元素。
问题是我又被注销了，解决了问题