【问题标题】:ERROR: 'NoneType' object has no attribute 'find_all'错误:“NoneType”对象没有属性“find_all”
【发布时间】:2022-01-12 20:47:12
【问题描述】:

我正在对一个名为:CVE Trends 的网页进行网页抓取

import bs4, requests,webbrowser

LINK = "https://cvetrends.com/"
PRE_LINK = "https://nvd.nist.gov/"

response = requests.get(LINK)
response.raise_for_status()
soup=bs4.BeautifulSoup(response.text,'html.parser')
div_tweets=soup.find('div',class_='tweet_text')

a_tweets=div_tweets.find_all('a')
    
link_tweets =[]
for a_tweet in a_tweets:
    link_tweet= str(a_tweet.get('href'))
    if PRE_LINK in link_tweet:
        link_tweets.append(link_tweet)

from pprint import pprint
pprint(link_tweets)

这是我迄今为止编写的代码。我尝试了很多方法,但它总是给出同样的错误:

“NoneType”对象没有“find_all”属性

有人可以帮帮我吗?我真的需要这个。 提前感谢您的任何回答。

【问题讨论】:

  • 显然soup.find(...) 返回无
  • 感谢您的回答,我已尝试打印它并在输出中给出“无”。我也尝试过更改标签类,但错误总是一样。
  • 看看response.content 好像这不是你假设的html。

标签: python web web-scraping nonetype


【解决方案1】:

这是因为soup.find("div", class_="tweet_text") 没有找到任何东西,所以它返回None。发生这种情况是因为您尝试抓取的网站是使用 javascript 填充的,因此当您向该网站发送 get 请求时,您会得到以下结果:

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <title>
   CVE Trends - crowdsourced CVE intel
  </title>
  <meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="description"/>
  <meta content="trending CVEs, CVE intel, CVE trends" name="keywords"/>
  <meta content="CVE Trends - crowdsourced CVE intel" name="title" property="og:title">
   <meta content="Simon Bell" name="author"/>
   <meta content="website" property="og:type">
    <meta content="https://cvetrends.com/images/cve-trends.png" name="image" property="og:image">
     <meta content="https://cvetrends.com" property="og:url">
      <meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." property="og:description"/>
      <meta content="en_GB" property="og:locale"/>
      <meta content="en_US" property="og:locale:alternative"/>
      <meta content="CVE Trends" property="og:site_name"/>
      <meta content="summary_large_image" name="twitter:card"/>
      <meta content="@SimonByte" name="twitter:creator"/>
      <meta content="CVE Trends - crowdsourced CVE intel" name="twitter:title"/>
      <meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="twitter:description"/>
      <meta content="https://cvetrends.com/images/cve-trends.png" name="twitter:image"/>
      <link href="https://cvetrends.com/favicon.ico" id="favicon" rel="icon" sizes="32x32"/>
      <link href="https://cvetrends.com/apple-touch-icon.png" id="apple-touch-icon" rel="apple-touch-icon"/>
      <link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/5.1.0/css/bootstrap.min.css" rel="stylesheet"/>
     </meta>
    </meta>
   </meta>
  </meta>
 </head>
 <body>
  <div id="root">
  </div>
  <noscript>
   Please enable JavaScript to run this app.
  </noscript>
  <script src="https://cvetrends.com/js/main.d0aa7136854f54748577.bundle.js">
  </script>
 </body>
</html>

您可以使用print(soup.prettify()) 验证这一点。

为了能够抓取此站点,您可能必须使用 Selenium 之类的东西。

【讨论】:

    【解决方案2】:

    这是由于没有得到您想要的响应。

    https://cvetrends.com/

    这个网站有java-script加载的内容,所以你不会得到请求中的数据。

    您将从https://cvetrends.com/api/cves/24hrs获取数据,而不是抓取网站

    这里有一些解决方案:

    import requests
    import json
    from urlextract import URLExtract
    
    LINK = "https://cvetrends.com/api/cves/24hrs"
    PRE_LINK = "https://nvd.nist.gov/"
    link_tweets = []
    # library for url extraction
    extractor = URLExtract()
    # ectract response from LINK (json Response)
    html = requests.get(LINK).text
    # convert string to json object
    twitt_json = json.loads(html)
    twitt_datas = twitt_json.get('data')
    for twitt_data in twitt_datas:
        # extract tweets
        twitts = twitt_data.get('tweets')
        for twitt in twitts:
            # extract tweet texts and validate condition
            twitt_text = twitt.get('tweet_text')
            if PRE_LINK in twitt_text:
                # find urls from text
                urls_list = extractor.find_urls(twitt_text)
                for url in urls_list:
                    if PRE_LINK in url:
                        link_tweets.append(twitt_text)
    print(link_tweets)
    

    【讨论】:

      猜你喜欢
      • 2014-06-04
      • 2018-05-02
      • 2021-12-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-06-17
      相关资源
      最近更新 更多