Python递归函数完成后返回额外的“无”答案

【问题标题】：Python recursive Function returns extra 'None' once completePython递归函数完成后返回额外的“无”
【发布时间】：2021-08-24 07:27:06
【问题描述】：

我正在为一个学校项目编写一个网络爬虫，它对页面上找到的所有有效 URL 进行分类，并且可以按照 URL 指向下一个网页并执行相同的操作；最多设置层数。

快速代码意图：

函数接受一个 BeautifulSoup 类型、一个 url（指示它从哪里开始）、层数和最大层深度
检查页面中的所有 href 行
每次发现包含有效 url 的 href 标记（以 http、https、HTTP、HTTPS 开头；我知道这可能不是检查的完美方法，但现在它是什么时，都会填充附加到“结果”列表我正在合作）
每次找到有效的 URL 时，层会增加 1，再次调用 recursiveLinkSearch() 函数
当层数达到，或者没有href的剩余时，返回结果列表

我对递归非常缺乏实践，并且遇到了 python 在递归结束时将“无”添加到“结果”列表中的问题。

此链接 [https://stackoverflow.com/questions/61691657/python-recursive-function-returns-none] 表明它可能是我退出函数的地方。由于嵌套的 for 循环，我也不确定递归是否正常运行。

非常感谢任何有关递归退出策略的帮助或见解。

def curlURL(url):
    # beautify with BS
    soup = BeautifulSoup(requests.get(url, timeout=3).text, "html.parser")
    return soup


def recursiveLinkSearch(soup, url, layer, depth):
    results = []
    # for each 'href' found on the page, check if it is a URL
    for a in soup.find_all(href=True):
        try:
            # for every href found, check if contains http or https
            if any(stringStartsWith in a.get('href')[0:4] for stringStartsWith in ["http", "https", "HTTP", "HTTPS"]) \
                    and a.get('href') != url and layer < depth:

                print(f"Found URL: {a.get('href')}")
                print(f"LOG: {colors.yellow}Current Layer: {layer}{colors.end}")
                results.append(a.get('href'))
                # BUG: adds an extra "None" type to the end of each list
                results.append(recursiveLinkSearch(curlURL(a.get('href')), a.get('href'), layer+1, depth))
        # Exceptions Stack
        except requests.exceptions.InvalidSchema:
            print(f"{a.get('href')}")
            print(f"{colors.bad}Invalid Url Detected{colors.end}")
        except requests.exceptions.ConnectTimeout:
            print(f"{a.get('href')}")
            print(f"{colors.bad}Connection Timeout. Passing...")
        except requests.exceptions.SSLError:
            print(f"{a.get('href')}")
            print(f"{colors.bad}SSL Certificate Error.  Passing...")
        except requests.exceptions.ReadTimeout:
            print(f"{a.get('href')}")
            print(f"{colors.bad}Read Timeout.  Passing...")
    # exit recursion
    if results != []:
        print(f"LOG: {results[-1]}")
        return results

【问题讨论】：

如果没有值添加到结果列表中，您的 recursiveLinkSearch 函数不会显式返回任何内容，这意味着它会隐式返回 None。您对if results != [] 的测试不正确。代码需要返回一些东西（或者调用代码需要准备好测试 None 并且不附加到结果列表中）。另外，请注意使用列表追加和扩展之间的区别。您可能需要使用扩展。
谢谢，感谢您对扩展的见解。嵌套列表的目的是跟踪找到链接的位置。用例是用户爬了 3 层深，他们可能想知道在第 2 层的链接 2、第 3 层的链接 4 上找到了一个标签。我想要的是一个多维数组数据结构，但是如果有更好的方法来索引这个我很想知道。

标签： python recursion web-scraping

【解决方案1】：

这不是递归问题。最后，if results != []: 你打印一些东西并返回results。否则你的函数就结束了，什么都不返回。但是在 python 中，如果您附加没有返回任何内容的函数的值 - 您会得到None。因此，当您的结果为空时 - 您将获得 None。

您可以检查要附加的内容，如果附加后得到 None，则可以检查 pop()。

【讨论】：

准确！谢谢你的洞察力。现在我可以清理传回的空列表。干杯！