从网站中提取特定行答案

【问题标题】：Extraction Specific Lines From A Website从网站中提取特定行
【发布时间】：2016-01-11 00:58:08
【问题描述】：

</span>
                    <div class="clearB paddingT5px"></div>
                    <small>
                        10/12/2015 5:49:00 PM -  Seeking Alpha
                    </small>
                    <div class="clearB paddingT10px"></div>

假设我有一个网站的源代码，其中一部分看起来像这样。我试图在“小”和“/小”之间划清界限。整个网页中有很多这样的行，夹在“small”和“/small”之间。我想提取“小”和“/小”之间的所有行。

我正在尝试使用看起来像这样的“正则表达式”函数

regex = '<small>(.+?)</small>'
datestamp = re.compile(regex)
urls = re.findall(datestamp, htmltext)

这只会返回一个空格。请给我建议。

【问题讨论】：

您为什么要尝试使用正则表达式解析 HTML？使用 HTML 解析器！
改用 (.+)。你的正则表达式是lazy。
BeautifulSoup select 或 find_all 方法效率更高
虽然我同意 jonrsharpe。看看这个答案：stackoverflow.com/a/1732454/5388440
这里？ clips.ua.ac.be/pages/pattern-web#DOM

标签： python regex web-scraping beautifulsoup

【解决方案1】：

这里有两种方法可以解决这个问题：

首先使用正则表达式，不推荐：

import re

html = """</span>
    <div class="clearB paddingT5px"></div>
    <small>
        10/12/2015 5:49:00 PM -  Seeking Alpha
    </small>
    <div class="clearB paddingT10px"></div>"""

for item in re.findall('\<small\>\s*(.*?)\s*\<\/small\>', html, re.I+re.M):
    print '"{}"'.format(item)

其次，使用BeautifulSoup 之类的东西为您解析 HTML：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all("small"):
    print '"{}"'.format(item.text.strip())

为两者提供以下输出：

"10/12/2015 5:49:00 PM -  Seeking Alpha"

【讨论】：

【解决方案2】：

在这里使用 xml.etree。有了它，您可以从网页中获取 html 数据并返回您希望使用 urllib2 的任何标签......就像这样。

import urllib2
from xml.etree import ElementTree

url = whateverwebpageyouarelookingin
request = urllib2.Request(url, headers={"Accept" : "application/xml"})
u = urllib2.urlopen(request)
tree = ElementTree.parse(u)
rootElem = tree.getroot()
yourdata = rootElem.findall("small")  
print yourdata

【讨论】：