如何使用lxml cssselctor从<a>元素中提取href？答案

【问题标题】：how to extract href from <a> element using lxml cssselctor?如何使用lxml cssselctor从<a>元素中提取href？
【发布时间】：2018-02-27 17:57:53
【问题描述】：

def extract_page_data(html):
tree = lxml.html.fromstring(html)
item_sel = CSSSelector('.my-item')
text_sel = CSSSelector('.my-text-content')
time_sel = CSSSelector('.time')
author_sel = CSSSelector('.author-text')
a_tag = CSSSelector('.a')

    for item in item_sel(tree):
    yield {'href': a_tag(item)[0].text_content(),
           'my pagetext': text_sel(item)[0].text_content(),
           'time': time_sel(item)[0].text_content().strip(),
           'author': author_sel(item)[0].text_content()}

我想提取href，但我无法使用此代码提取它

【问题讨论】：

除了安德森爵士已经提供的解决方案外，您还需要修改您的选择器调用，例如 .cssselect() 而不是 .CSSSelector()。
喜欢这个先生吗？ a_tag = cssselect('.a') 并感谢 :)
对不起，我误会你了。看来你的做法不一样。

标签： python-3.x beautifulsoup lxml lxml.html

【解决方案1】：

尝试将@href提取为

'href': a_tag(item)[0].attrib['href']

或

'href': a_tag(item)[0].get('href')

作为一个选项，您也可以使用 XPath

tree.xpath(".//a/@href")

【讨论】：

(item).xpath(".//a/@href")[0].strip() 这工作了谢谢先生:)