如何将此 XPath 表达式转换为 BeautifulSoup？答案

【问题标题】：How can I translate this XPath expression to BeautifulSoup?如何将此 XPath 表达式转换为 BeautifulSoup？
【发布时间】：2010-12-21 08:20:33
【问题描述】：

在回答previous question 时，有几个人建议我将BeautifulSoup 用于我的项目。我一直在努力处理他们的文档，但我无法解析它。有人可以指出我应该能够将此表达式转换为 BeautifulSoup 表达式的部分吗？

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')

以上表达式来自Scrapy。我正在尝试将正则表达式 re('\.a\w+') 应用到 td class altRow 以从那里获取链接。

我也很感激任何其他教程或文档的指针。我找不到。

感谢您的帮助。

编辑： 我在看这个page：

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>

但是，如果您查看页面源 "/cabel" 是否存在：

 <td class="altRow" valign="middle" width="34%"> 
 <a href='/cabel'>Abel, Christian</a>

由于某种原因，BeautifulSoup 看不到搜索结果，但 XPath 可以看到它们，因为 hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+') 捕获了“/cabel”

编辑： cobbal：还是不行。但是当我搜索这个时：

>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>

它返回所有带有第二个字符“a”的链接，但不返回律师姓名。因此，出于某种原因，BeautifulSoup 看不到这些链接（例如“/cabel”）。我不明白为什么。

【问题讨论】：

你试过用双引号代替单引号吗：<a href="/cabel">...</a>.
据我所知，BeautifulSoup 没有正确解析页面，soup.contents 在文档开头的标签<a href="https://www.whitecasealumni.com/jsp/Front/login.jsp" target="_blank"> 之后没有给出任何内容。

标签： python xpath beautifulsoup

【解决方案1】：

一个选择是使用lxml（我对beautifulsoup不熟悉，所以我不能说如何使用它），它默认支持XPath

编辑：
尝试 ~~（未测试）~~ 测试：

soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)

我使用了http://www.crummy.com/software/BeautifulSoup/documentation.html的文档

soup 应该是 BeautifulSoup 对象

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)

【讨论】：

如果可以避免的话，我不期待这个 Windows 安装 codespeak.net/lxml/installation.html。否则看起来比 BeautifulSoup 好得多（文档方面）
来自 BS 文档：以下是一些导航汤的方法：soup.contents[0].name # u'html' 当我尝试时，我得到：soup.contents[0].name回溯（最后一次调用）：文件“”，第 1 行，在 soup.contents[0].name 文件“C:\Python26\BeautifulSoup.py”，第 427 行，在 getattr raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr) AttributeError: 'NavigableString' object has no attribute 'name'

【解决方案2】：

我知道 BeautifulSoup 是规范的 HTML 解析模块，但有时你只是想从一些 HTML 中刮出一些子字符串，而 pyparsing 有一些有用的方法可以做到这一点。使用此代码：

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

我从您的页面中提取了 914 条引用，从 Abel 到 Zupikova。

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
AcuÃ±a, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
ZiÃ³Å‚ek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova

【讨论】：

我一定会尝试 pyparsing。这对我来说比 BeautifulSoup 更有意义。

【解决方案3】：

我刚刚在 Beautiful Soup 邮件列表上回答了这个问题，作为对 Zeynel 发送给该列表的电子邮件的回复。基本上，网页中有一个错误，在解析过程中完全杀死了 Beautiful Soup 3.1，但只是被 Beautiful Soup 3.0 破坏了。

线程位于Google Groups archive。

【讨论】：

【解决方案4】：

您似乎使用的是 BeautifulSoup 3.1

我建议恢复到 BeautifulSoup 3.0.7（因为this problem）

我刚刚用 3.0.7 测试，得到了你期望的结果：

>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]

使用 BeautifulSoup 3.1 进行测试可以获得您所看到的结果。 html 中可能存在格式错误的标签，但我没有快速查看它。

【讨论】：