【问题标题】:Python REGEX remove string containing substringPython REGEX删除包含子字符串的字符串
【发布时间】:2022-08-19 16:08:50
【问题描述】:

我正在编写一个脚本,它将为 URL 抓取新闻通讯。时事通讯中有一些不相关的 URL(例如文章链接、邮件链接、社交链接等)。我添加了一些逻辑来删除这些链接,但由于某种原因,并非所有链接都被删除。这是我的代码:

from os import remove
from turtle import clear
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

termSheet = \"https://fortune.com/newsletter/termsheet\"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, \"html.parser\")
termSheetLinks = []

for companyURL in htmlParser.select(\"table#templateBody p > a\"):
    termSheetLinks.append(companyURL.get(\'href\'))

for link in termSheetLinks:
    if \"fortune.com\" in link in termSheetLinks:
        termSheetLinks.remove(link)
    if \"forbes.com\" in link in termSheetLinks:
        termSheetLinks.remove(link)
    if \"twitter.com\" in link in termSheetLinks:
        termSheetLinks.remove(link)

print(termSheetLinks)

当我最近运行它时,这是我的输出,尽管我试图删除所有包含 \"fortune.com\" 的链接:

[\'https://fortune.com/company/blackstone-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am\', \'https://fortune.com/company/tpg?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am\', \'https://casproviders.org/asd-guidelines/\', \'https://fortune.com/company/carlyle-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am\', \'https://ir.carlyle.com/static-files/433abb19-8207-4632-b173-9606698642e5\', \'mailto:termsheet@fortune.com\', \'https://www.afresh.com/\', \'https://www.geopagos.com/\', \'https://montana-renewables.com/\', \'https://descarteslabs.com/\', \'https://www.dealer-pay.com/\', \'https://www.sequeldm.com/\', \'https://pueblo-mechanical.com/\', \'https://dealcloud.com/future-proof-your-firm/\', \'https://apartmentdata.com/\', \'https://www.irobot.com/\', \'https://www.martin-bencher.com/\', \'https://cell-matters.com/\', \'https://www.lever.co/\', \'https://www.sigulerguff.com/\']

任何帮助将不胜感激!

  • ` if \"fortune.com\" in link in termSheetLinks:` 你为什么使用第二个IN?
  • 您实际上是在 for 循环中更改 termSheetLinks 列表,这会导致跳过。

标签: python html web-scraping beautifulsoup


【解决方案1】:

在我看来,它不需要regex - 而不是删除网址,只将那些附加到不包含您的子字符串的列表中,例如使用list comprehension

[companyURL.get('href') for companyURL in htmlParser.select("table#templateBody p > a") if not any(x in companyURL.get('href') for x in ["fortune.com","forbes.com","twitter.com"])]

例子

from bs4 import BeautifulSoup
import requests

termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")

myList = ["fortune.com","forbes.com","twitter.com"]
[companyURL.get('href') for companyURL in htmlParser.select("table#templateBody p > a") 
     if not any(x in companyURL.get('href') for x in myList)]

输出

['https://casproviders.org/asd-guidelines/',
 'https://ir.carlyle.com/static-files/433abb19-8207-4632-b173-9606698642e5',
 'https://www.afresh.com/',
 'https://www.geopagos.com/',
 'https://montana-renewables.com/',
 'https://descarteslabs.com/',
 'https://www.dealer-pay.com/',
 'https://www.sequeldm.com/',
 'https://pueblo-mechanical.com/',
 'https://dealcloud.com/future-proof-your-firm/',
 'https://apartmentdata.com/',
 'https://www.irobot.com/',
 'https://www.martin-bencher.com/',
 'https://cell-matters.com/',
 'https://www.lever.co/',
 'https://www.sigulerguff.com/']

【讨论】:

    【解决方案2】:

    在 for 迭代器之后删除链接不会跳过任何条目。

    from os import remove
    from turtle import clear
    from bs4 import BeautifulSoup
    import requests
    import re
    import pandas as pd
    
    termSheet = "https://fortune.com/newsletter/termsheet"
    html = requests.get(termSheet)
    htmlParser = BeautifulSoup(html.text, "html.parser")
    termSheetLinks = []
    
    for companyURL in htmlParser.select("table#templateBody p > a"):
        termSheetLinks.append(companyURL.get('href'))
    
    lRemove = []
    for link in termSheetLinks:
        if "fortune.com" in link:
            lRemove.append(link)
        if "forbes.com" in link:
            lRemove.append(link)
        if "twitter.com" in link:
            lRemove.append(link)
    for l in lRemove:
        termSheetLinks.remove(l)
    
    print(termSheetLinks)
    

    【讨论】:

      猜你喜欢
      • 2014-05-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-08-14
      • 1970-01-01
      • 2014-04-16
      • 1970-01-01
      相关资源
      最近更新 更多