【发布时间】:2022-08-19 16:08:50
【问题描述】:
我正在编写一个脚本,它将为 URL 抓取新闻通讯。时事通讯中有一些不相关的 URL(例如文章链接、邮件链接、社交链接等)。我添加了一些逻辑来删除这些链接,但由于某种原因,并非所有链接都被删除。这是我的代码:
from os import remove
from turtle import clear
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
termSheet = \"https://fortune.com/newsletter/termsheet\"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, \"html.parser\")
termSheetLinks = []
for companyURL in htmlParser.select(\"table#templateBody p > a\"):
termSheetLinks.append(companyURL.get(\'href\'))
for link in termSheetLinks:
if \"fortune.com\" in link in termSheetLinks:
termSheetLinks.remove(link)
if \"forbes.com\" in link in termSheetLinks:
termSheetLinks.remove(link)
if \"twitter.com\" in link in termSheetLinks:
termSheetLinks.remove(link)
print(termSheetLinks)
当我最近运行它时,这是我的输出,尽管我试图删除所有包含 \"fortune.com\" 的链接:
[\'https://fortune.com/company/blackstone-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am\', \'https://fortune.com/company/tpg?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am\', \'https://casproviders.org/asd-guidelines/\', \'https://fortune.com/company/carlyle-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am\', \'https://ir.carlyle.com/static-files/433abb19-8207-4632-b173-9606698642e5\', \'mailto:termsheet@fortune.com\', \'https://www.afresh.com/\', \'https://www.geopagos.com/\', \'https://montana-renewables.com/\', \'https://descarteslabs.com/\', \'https://www.dealer-pay.com/\', \'https://www.sequeldm.com/\', \'https://pueblo-mechanical.com/\', \'https://dealcloud.com/future-proof-your-firm/\', \'https://apartmentdata.com/\', \'https://www.irobot.com/\', \'https://www.martin-bencher.com/\', \'https://cell-matters.com/\', \'https://www.lever.co/\', \'https://www.sigulerguff.com/\']
任何帮助将不胜感激!
-
` if \"fortune.com\" in link in termSheetLinks:` 你为什么使用第二个IN?
-
您实际上是在 for 循环中更改 termSheetLinks 列表,这会导致跳过。
标签: python html web-scraping beautifulsoup