【问题标题】:web scraping from news articles从新闻文章中抓取网页
【发布时间】:2020-11-20 12:19:09
【问题描述】:

我一直在尝试访问给定新闻网站的链接。我发现代码运行良好,但唯一的问题是,它输出“javascript:void();”以及所有其他链接。请让我知道我可以进行哪些更改以使我不会遇到“javascript:void();”在所有其他链接的输出中。 以下是代码:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    如果您不想要它们,只需将它们过滤掉即可。

    方法如下:

    import requests
    from bs4 import BeautifulSoup
    from bs4.dammit import EncodingDetector
    
    resp = requests.get("https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation")
    
    http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
    html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
    encoding = html_encoding or http_encoding
    
    soup = BeautifulSoup(resp.content, 'html.parser', from_encoding=encoding)
    
    for link in soup.find_all('a', href=True):
        if link["href"] != "javascript:void();":
            print(link['href'])
    
    

    【讨论】:

    • IMO,最好使用if link["href"] != "javascript:void();": print(link['href'])
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-08-10
    • 1970-01-01
    • 2021-04-11
    • 1970-01-01
    • 2022-08-04
    • 2022-11-19
    相关资源
    最近更新 更多