【问题标题】:Can't differentiate the two expressions supposed to work in the same way无法区分应该以相同方式工作的两个表达式
【发布时间】:2019-10-24 03:24:00
【问题描述】:

几天前,我创建了 this post,以寻求任何解决方案,让我的脚本以这样的方式循环,以便脚本使用很少链接来检查我定义的title(应该从每个链接中提取)在four 次中是否没有任何意义。如果title 仍然没有,则脚本将break loop 并转到另一个链接以重复相同的操作。

这就是我获得成功的方式--► 通过将fetch_data(link) 更改为return fetch_data(link) 并在while loop 之外但在if 语句中定义counter=0

修正脚本:

import time
import requests
from bs4 import BeautifulSoup

links = [
    "https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2",
    "https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3",
    "https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4"
]
counter = 0

def fetch_data(link):
    global counter
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    try:
        title = soup.select_one("p.tcode").text
    except AttributeError: title = ""

    if not title:
        while counter<=3:
            time.sleep(1)
            print("trying {} times".format(counter))
            counter += 1
            return fetch_data(link) #First fix
        counter=0 #Second fix

    print("tried with this link:",link)

if __name__ == '__main__':
    for link in links:
        fetch_data(link)

这是上述脚本产生的输出(根据需要):

trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3
trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4

I used wrong selector within my script so that I can let it meet the condition I've defined above.

为什么我应该使用return fetch_data(link) 而不是fetch_data(link),因为大多数时候表达式的工作方式相同?

【问题讨论】:

  • 旁注:这里你的明确返回只适用于失败的案例。

标签: python python-3.x web-scraping conditional-statements


【解决方案1】:

如果您的函数内的 while 循环无法获取标题,它将启动递归调用。它在您使用return fetch_data(link) 时起作用,因为每当计数器小于或等于 3 while counter&lt;=3 时,它将在 while 循环结束时立即退出函数,因此不会下降到将重置计数器的下一行到 0 counter=0。由于计数器是一个全局变量,并且每个递归深度仅增加 1,因此您最多只能有 4 个递归深度,因为只要 counter 大于 3,它就不会进入将调用另一个的 while 循环fetch_data(link).

fetch_data (counter=0)
  --> fetch_data (counter=1)
    --> fetch_data (counter=2)
      --> fetch_data (counter=3)
        --> fetch_data (counter=4) 
        - not go into while loop, reset counter, print url
        - return to above function
      - return to above function
    - return to above function
  - return to above function

如果使用fetch_data(link),该函数仍会在while循环中发起递归调用。但是,不要立即退出,会将计数器重置为 0。这很危险,因为在您的计数器变为 4 后,该函数并返回到 while 循环内上一个函数调用的 while 循环,while 循环不会中断并且继续发起额外的递归调用,因为计数器当前设置为 0,即

fetch_data (counter=0)
  --> fetch_data (counter=1)
    --> fetch_data (counter=2)
      --> fetch_data (counter=3)
        --> fetch_data (counter=4) 
        - not go into while loop, !!!reset counter!!!, print url
        - return to above function
      - not return to above function call
      - since counter = 0, continue the while loop
        --> fetch_data (counter=1)
          --> fetch_data (counter=2)
            --> fetch_data (counter=3)
...

【讨论】:

  • 现在这很有意义@VietHTran。感谢您的清晰说明。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-09-12
  • 1970-01-01
  • 2016-03-01
  • 1970-01-01
  • 2020-08-02
  • 1970-01-01
相关资源
最近更新 更多