【发布时间】:2019-10-24 03:24:00
【问题描述】:
几天前,我创建了 this post,以寻求任何解决方案,让我的脚本以这样的方式循环,以便脚本使用很少链接来检查我定义的title(应该从每个链接中提取)在four 次中是否没有任何意义。如果title 仍然没有,则脚本将break loop 并转到另一个链接以重复相同的操作。
这就是我获得成功的方式--► 通过将fetch_data(link) 更改为return fetch_data(link) 并在while loop 之外但在if 语句中定义counter=0。
修正脚本:
import time
import requests
from bs4 import BeautifulSoup
links = [
"https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2",
"https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3",
"https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4"
]
counter = 0
def fetch_data(link):
global counter
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
try:
title = soup.select_one("p.tcode").text
except AttributeError: title = ""
if not title:
while counter<=3:
time.sleep(1)
print("trying {} times".format(counter))
counter += 1
return fetch_data(link) #First fix
counter=0 #Second fix
print("tried with this link:",link)
if __name__ == '__main__':
for link in links:
fetch_data(link)
这是上述脚本产生的输出(根据需要):
trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3
trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4
I used wrong selector within my script so that I can let it meet the condition I've defined above.
为什么我应该使用
return fetch_data(link)而不是fetch_data(link),因为大多数时候表达式的工作方式相同?
【问题讨论】:
-
旁注:这里你的明确返回只适用于失败的案例。
标签: python python-3.x web-scraping conditional-statements