【发布时间】:2015-12-08 15:20:36
【问题描述】:
我有一个小问题。我正在使用 python 2.7.8。我正在尝试提取应该在 br> 之前的文本。我喜欢:
<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>
<p>1. C99 standard guarantees uniqueness of ____ characters for internal names.<br>
a) 31<br>
b) 63<br>
c) 12<br>
d) 14</p>
<p> more </p>
<p>2. C99 standard guarantess uniqueness of _____ characters for external names.<br>
a) 31<br>
b) 6<br>
c) 12<br>
d) 14</p>
</div>
</body>
</html>
我尝试过的代码目前在 br> 之后而不是在 br 之前。这是代码:
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
soup2 = BeautifulSoup(htmls)
for br2 in soup2.findAll('br'):
next = br2.previousSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.previousSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
print "Found:", next.encode('utf-8')
输出给了我:
Found:
a) 31
Found:
b) 63
Found:
c) 12
Found:
d) 14
a) 31
Found:
b) 6
Found:
c) 12
Found:
d) 14
Found:
知道我哪里做错了。
【问题讨论】:
-
任何一个???我还在尝试但失败了......
-
井列表不在里面。如果你能表达你的意思?
标签: python html beautifulsoup html-parsing