仅使用 python 从网站抓取内部链接答案

【问题标题】：Crawl only internal links from a website using python仅使用 python 从网站抓取内部链接
【发布时间】：2019-06-06 02:38:30
【问题描述】：

我正在尝试编写一个只处理网站内部链接的爬虫。我正在使用 python 2.7，漂亮的汤和请求，我需要所有内部链接（绝对和亲属）。

我的客户向我请求了一个网站爬虫，但我希望它只爬取内部链接。我需要它忽略 jpg/png/gif 和其他类型的 url，所以它只处理页面。

import re, request
from bs4 import BeautifulSoup

def processUrl(url):
    if not url in checkedUrls:
        try:
            if 'text/html' in requests.head(url).headers['Content-Type']:
                req=requests.get(url)
                if req.status_code==200:
                    print url
                    checkedUrls.append(url)
                    html=BeautifulSoup(req.text,'html.parser')
                    pages=html.find_all('a')
                    for page in pages:
                        url=page.get('href')
                        processUrl(url)
        except:
            pass

checekdUrls=[]
url='http://sampleurl.com'
processUrl(url)

【问题讨论】：

那里真的没有问题，因为它更像是“你能检查我的代码吗？”之类的事情。您还可以在其他地方发布此操作。如果您提供了一个提供输出的实际 url 或代码，并表明它给了您不想要的输出，那么提供它。你用网址测试过吗？它不工作吗？它怎么不工作？另外，我不熟悉它，但我看到人们在尝试这样做时经常谈论scrappy。更多示例here
你只需要在开始爬取之前添加一个额外的逻辑来检查域是否相同。如果不只是返回给调用者。

标签： python-2.7 beautifulsoup python-requests web-crawler

【解决方案1】：

这是你的代码，加上我上面评论的逻辑。

import re, request
from bs4 import BeautifulSoup

def processUrl(url, domain, checkedUrls=[]):
    if domain not in url:
        return checkedUrls

    if not url in checkedUrls:
        try:
            if 'text/html' in requests.head(url).headers['Content-Type']:
                req=requests.get(url)
                if req.status_code==200:
                    print url
                    checkedUrls.append(url)
                    html=BeautifulSoup(req.text,'html.parser')
                    pages=html.find_all('a')
                    for page in pages:
                        url=page.get('href')
                        processUrl(url)
        except:
            pass

    return checkedUrls


checekdUrls=[]
domain = 'sampleurl.com'
url='http://sampleurl.com'
checkedUrls = processUrl(url, domain, checkedUrls)

【讨论】：

我已经试过了。找到引用域的社交网络链接时失败。
我不知道有任何 Selenium 配置规范允许您在同一个域中自动爬网。听起来您需要对社交网络案例进行更专业的处理。如果检查域不起作用，那么也许您可以尝试检查 ip 地址
另外说明，您的社交网络链接失败可能是因为您的域包含为超链接变量，即https://socialnetworksite.com/link?site=sampleurl.com。如果是这种情况，那么您只需将字符串搜索指定为完整的http://sampleurl.com。 URL 中的特殊字符通常会被转义，因此您的字符串搜索不会捕获这些字符
NameError: 未定义全局名称“checkedUrls”
@Mostafa 感谢您的指出。变量没有被赋予全局范围。我已经更新了函数签名