无法为 re.compile 定义正则表达式并将其传递给 Beautifulsoup答案

【问题标题】：Unable to define regular expression for re.compile and pass it to Beautifulsoup无法为 re.compile 定义正则表达式并将其传递给 Beautifulsoup
【发布时间】：2016-02-24 15:27:52
【问题描述】：

目前我正在练习使用python访问网络的基本概念。我正在关注 YouTube 上的教程，并被引导到以下代码。

from urllib2 import urlopen,  HTTPError
from BeautifulSoup import BeautifulSoup
import re


url="http://getbusinessreviews.org/"
try:
   webpage = urlopen(url).read
except HTTPError, e:  
    if e.code == 404:
        e.msg = 'data not found on remote: %s' % e.msg
    raise
pathFinderTitle = re.compile('<h2 class="entry-title"><a href.* rel="bookmark">(.*)</a></h2>')
if  webpage:
    if pathFinderTitle:
        findPathTitle = re.findall(pathFinderTitle,webpage)
    else:
        print "unable to get path finder title"

else:
    print "unable to url open "
listIterator =[]
listIterator[:]= range(2,10)

for i in listIterator:
    print findPathTitle[i]

我想从以下 HTML 中提取“Nutracoster”

        <h2 class="entry-title">

            <a href="http://getbusinessreviews.org/nutracoster/" rel="bookmark">Nutracoster</a>

        </h2>

我有两个问题

目前我没有得到任何结果，谁能指导我我做错了什么？（我想我的正则表达式没有明确定义）
如何将此正则表达式传递给 Beautifulsoup？

由于我处于学习阶段，因此提前感谢任何愚蠢的错误：D

【问题讨论】：

回答您的问题 3：是的。 for pathTitle in findPathTitle: ...。我建议您先学习 Python 基础知识，然后再深入研究 HTML 解析和正则表达式等复杂内容。
同意@Jasper，如果你想学习网页抓取，我会先学习beautifulsoup而不用正则表达式，因为这样你会更容易调试和理解一个新概念而不是两个。
感谢您的建议，我非常感谢，但不幸的是，该任务是由我的团队负责人分配给我的，并且截止日期很短。我需要创建脚本来废弃上述网络和将其帖子保存在 csv 文件中。
附言。我自己已经完成了第 3 部分，并且我知道一些基本的 python :)
我需要创建一个脚本，该脚本将废弃上述网络并将其帖子保存在 csv 文件中不需要正则表达式。

标签： regex python-2.7 beautifulsoup

【解决方案1】：

Beautiful Soup 不需要使用正则表达式来选择元素：它可以自行提取所有具有特定属性的<h2> 标签。

此外，最好不要使用正则表达式来解析 HTML（参见 popular question）。

试试这个小sn-p代码：

from bs4 import BeautifulSoup as BS
from urllib2 import urlopen, HTTPError, URLError

url = "http://getbusinessreviews.org/"
try:
    webpage = urlopen(url)
except HTTPError, e:
    if e.code == 404:
        e.msg = 'data not found on remote: %s' % e.msg
    raise
except URLError, e:
    print e.args

soup = BS(webpage, 'lxml')

## Relevant lines ##
for h2 in soup.find_all("h2", attrs={"class": "entry-title"}):
    print h2.text

【讨论】：

非常感谢。我真的很感谢你的努力。你把我的努力救到了错误的方向。非常感谢！！！
@NightGale：很高兴听到这个消息，如果您觉得我的回答令人满意，请接受或解释缺少的内容，我也可以扩展我的答案。
没问题，编码愉快！