在没有BeautifulSoup的python中查找页面的超链接[关闭]答案

【问题标题】：Find hyperlinks of a page in python without BeautifulSoup [closed]在没有BeautifulSoup的python中查找页面的超链接[关闭]
【发布时间】：2016-03-18 02:27:49
【问题描述】：

我想要做的是找到一个网页的所有超链接，这是我到目前为止所拥有的，但它不起作用

from urllib.request import urlopen

def findHyperLinks(webpage):
    link = "Not found"
    encoding = "utf-8"
    for webpagesline in webpage:
        webpagesline = str(webpagesline, encoding)
        if "<a href>" in webpagesline:
            indexstart = webpagesline.find("<a href>")
            indexend = webpagesline.find("</a>")
            link = webpagesline[indexstart+7:indexend]
            return link
    return link

def main():
    address = input("Please enter the adress of webpage to find the hyperlinks")
    try:
        webpage = urlopen(address)
        link =  findHyperLinks(webpage)
        print("The hyperlinks are", link)

        webpage.close()
    except Exception as exceptObj:
        print("Error:" , str(exceptObj))

main()

【问题讨论】：

打开一个网络浏览器，导航到一个页面，然后右键单击，查看源代码。然后 Ctrl+F 并搜索<a href>。这是你的问题之一。
不，我只能使用 url open 我们还没有在课堂上讨论 xpath
正则表达式怎么样？
我不太确定课堂上的教授给我们的示例指南是什么，但他没有找到超链接，而是向我们展示了如何找到不同的页面标题
@JonathonReinhart 当我这样做时它没有显示任何内容

标签： python regex web-scraping

【解决方案1】：

您的代码中存在多个问题。其中之一是您正在尝试查找具有当前、空且唯一一个href 属性的链接：<a href>。

无论如何，如果您使用 HTML 解析器（嗯，解析 HTML），事情会变得更加简单和可靠。使用BeautifulSoup 的示例：

from bs4 import BeautifulSoup
from urllib.request import urlopen

soup = BeautifulSoup(urlopen(address))
for link in soup.find_all("a", href=True):
    print(link["href"], link.get_text())

【讨论】：

我不允许在这段代码中使用 BeautifulSoup

【解决方案2】：

没有 BeautifulSoap，你可以使用 RegExp 和简单的函数。

from urllib.request import urlopen
import re

def find_link(url):
    response = urlopen(url)
    res = str(response.read())
    my_dict = re.findall('(?<=<a href=")[^"]*', res)

    for x in my_dict:
        # simple skip page bookmarks, like #about
        if x[0] == '#':
            continue

        # simple control absolute url, like /about.html
        # also be careful with redirects and add more flexible
        # processing, if needed
        if x[0] == '/':
            x = url + x

        print(x)

find_link('http://cnn.com')

【讨论】：

感谢这确实有帮助