【问题标题】:Webscraping with Beautifulsoup and Python not working使用 Beautifulsoup 和 Python 进行网页抓取不起作用
【发布时间】:2021-05-10 12:15:38
【问题描述】:

我正在尝试从以下页面获取网站地址列表:https://www.wer-zu-wem.de/dienstleister/filmstudios.html

我的代码:

import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.wer-zu-wem.de/dienstleister/filmstudios.html")
src = result.content
soup = BeautifulSoup(src, 'lxml')
links = soup.find_all('a', {'class': 'col-md-4 col-lg-5 col-xl-4 text-center text-lg-right'})
print(links)

import requests
from bs4 import BeautifulSoup

webLinksList = []

result = requests.get(
    "https://www.wer-zu-wem.de/dienstleister/filmstudios.html")
src = result.content
soup = BeautifulSoup(src, 'lxml')


website_Links = soup.find_all(
    'div', class_='col-md-4 col-lg-5 col-xl-4 text-center text-lg-right')


if website_Links != "":
    print("List is empty")
for website_Link in website_Links:
    try:
        realLink = website_Link.find(
            "a", attrs={"class": "btn btn-primary external-link"})
        webLinksList.append(featured_challenge.attrs['href'])
    except:
        continue

for link in webLinksList:
    print(link)

"list is empty" 在开头打印,我没有尝试将任何数据添加到列表中。

【问题讨论】:

  • 您对该站点的哪些链接感兴趣?
  • 你有if website_Links != "":。因此,如果website_Links 强制转换为包含任何内容的字符串,您将得到List is empty。我不认为.find_all 返回一个字符串,我相信它返回一个列表。除了空字符串,我认为通常不会强制转换为空字符串,因此 website_Links != "" 将始终解析为 True
  • 你是对的,但作为一个列表它也是空的

标签: python html css web-scraping beautifulsoup


【解决方案1】:

尝试以下方法获取所有指向外部网站的链接:

import requests
from bs4 import BeautifulSoup

link = "https://www.wer-zu-wem.de/dienstleister/filmstudios.html"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

result = requests.get(link,headers=headers)
soup = BeautifulSoup(result.text,'lxml')
for links in soup.find_all('a',{'class':'external-link'}):
    print(links.get("href"))

【讨论】:

    【解决方案2】:

    试试这个:

    import requests
    from bs4 import BeautifulSoup
    
    headers = {
        
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0",
    }
    
    result = requests.get("https://www.wer-zu-wem.de/dienstleister/filmstudios.html", headers=headers)
    src = result.content
    soup = BeautifulSoup(src, 'lxml')
    links = soup.find('ul', {'class': 'wzwListeFirmen'}).findAll("a")
    print(links)
    

    【讨论】:

    • 谢谢 我试试外部链接”
    猜你喜欢
    • 2020-08-01
    • 2020-08-09
    • 1970-01-01
    • 1970-01-01
    • 2020-10-04
    • 2021-01-31
    • 1970-01-01
    • 1970-01-01
    • 2018-10-16
    相关资源
    最近更新 更多