Python 2.7 BeautifulSoup，电子邮件抓取答案

【问题标题】：Python 2.7 BeautifulSoup , email scrapingPython 2.7 BeautifulSoup，电子邮件抓取
【发布时间】：2017-02-01 08:23:00
【问题描述】：

希望你一切都好。我是 Python 新手，使用的是 python 2.7。

我正在尝试仅从该公共网站业务目录中提取 mailto：http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
我要查找的邮件是完整目录中从 a-z 开始的每个小部件中提到的电子邮件。不幸的是，这个目录没有 API。我正在使用 BeautifulSoup，但到目前为止没有成功。
这是我的代码：

import urllib
from bs4 import BeautifulSoup
website = raw_input("Type website here:>\n")
html = urllib.urlopen('http://'+ website).read()
soup = BeautifulSoup(html)

tags = soup('a') 

for tag in tags:
    print tag.get('href', None)

我得到的只是实际网站的网站，例如 http://www.tecomdirectory.com 和其他 href 而不是小部件中的 mailto 或网站。我也尝试用汤（'目标'）替换汤（'a'），但没有运气！有人可以帮帮我吗？

【问题讨论】：

嗨！谢谢回复！在我读 php 的 URL 中？所以我认为其中可能有一些php！如果没有，对不起！在编码中仍然是新的。问候
您好，能否请您确认我没有涉及 php，以便我可以编辑删除 php 标签的问题？

标签： python python-2.7 web-scraping beautifulsoup

【解决方案1】：

你不能只找到每个锚点，你需要专门在 href 中查找“mailto:”，你可以使用 css 选择器a[href^=mailto:] 找到具有 的 anchor 标签href以mailto:开头：

import requests

soup  = BeautifulSoup(requests.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content)

print([a["href"] for a in soup.select("a[href^=mailto:]")])

或者提取文字：

print([a.text for a in soup.select("a[href^=mailto:]")])

使用find_all("a")，您需要使用正则表达式来实现相同的目的：

import re

find_all("a", href=re.compile(r"^mailto:"))

【讨论】：

我修改了代码：' import urllib import requests from bs4 import BeautifulSoup website = 'www.tecomdirectory.com/companies.php? segment=&activity=&search=category&submit=Search' html = urllib.urlopen('http://'+ website).read() soup = BeautifulSoup(requests.get(html).content) tags = soup('a') for tag in tags: print([a["href"] for a in soup.select("a[href^=mailto:]")]) ' 但是我得到一个错误：带有最终注释的回溯：requests.execption .无效的架构！
是的，因为您将 HTML 传递给请求，传递 url 并忘记 urllib 或者只使用 urllib 并忘记请求。
嗨帕德莱克！感谢您的耐心等待，我修改了代码并删除了 urllib 并将 url 传递给请求，这是代码：' import requests from bs4 import BeautifulSoup soup = BeautifulSoup(requests.get('tecomdirectory.com/…) tags = soup( 'a') for tag in tags: print([a["href"] for a in soup.select("a[href^=mailto:]")]) ' 但是我得到空列表打印输出
在浏览器中打开网址，你会明白为什么。
仅供参考，如果您使用的是 BeautifulSoup 4.7+，则选择答案将不起作用，因为它会引发 SelectorSyntaxError。您需要在 4.7+ 中引用属性值，因为 : 不是有效 CSS 标识符的一部分。 BeautifulSoup