【发布时间】:2016-06-21 13:06:18
【问题描述】:
我正在拉取网页上的列表并为它们提供上下文,我还拉取紧接在它们之前的文本。拉出<ul> 或<ol> 之前的标签似乎是最好的方法。所以假设我有这个列表:
我想拔出子弹和“千禧一代”这个词。我使用 BeautifulSoup 函数:
#pull <ul> tags
def pull_ul(tag):
return tag.name == 'ul' and tag.li and not tag.attrs and not tag.li.attrs and not tag.a
ul_tags = webpage.find_all(pull_ul)
#find text immediately preceding any <ul> tag and append to <ul> tag
ul_with_context = [str(ul.previous_sibling) + str(ul) for ul in ul_tags]
当我打印 ul_with_context 时,我得到以下信息:
['\n<ul>\n<li>With immigration adding more numbers to its group than any other, the Millennial population is projected to peak in 2036 at 81.1 million. Thereafter the oldest Millennial will be at least 56 years of age and mortality is projected to outweigh net immigration. By 2050 there will be a projected 79.2 million Millennials.</li>\n</ul>']
如您所见,“千禧一代”并未被取消。我从中提取的页面是http://www.pewresearch.org/fact-tank/2016/04/25/millennials-overtake-baby-boomers/ 这是子弹的代码部分:
<p> 和 <ul> 标签是同级的。知道为什么它没有拉出带有 “千禧一代” 字样的标签吗?
【问题讨论】: