【问题标题】:Getting the text from all p elements in a div with BeautifulSoup使用 BeautifulSoup 从 div 中的所有 p 元素中获取文本
【发布时间】:2015-12-30 14:37:02
【问题描述】:

我正在尝试获取给定 div 中所有 p 元素的文本(没有标签的内容):

import requests
from bs4 import BeautifulSoup

def getArticle(url):
    url = 'http://www.bbc.com/news/business-34421804'
    result = requests.get(url)
    c = result.content
    soup = BeautifulSoup(c)

    article = []
    article = soup.find("div", {"class":"story-body__inner"}).findAll('p')
    for element in article:
        article = ''.join(element.findAll(text = True))
    return article

问题是它只返回最后一段的内容。但是如果我只使用 print,代码就可以完美运行:

    for element in article:
        print ''.join(element.findAll(text = True))
    return

我想在别处调用这个函数,所以我需要它来返回文本,而不仅仅是打印它。我搜索了stackoverflow并搜索了很多,但没有找到答案,我不明白可能是什么问题。我使用 Python 2.7.9 和 bs4。 提前致谢!

【问题讨论】:

    标签: python-2.7 web-scraping beautifulsoup


    【解决方案1】:

    以下代码应该可以工作 -

    import requests
    from bs4 import BeautifulSoup
    
    def getArticle(url):
        url = 'http://www.bbc.com/news/business-34421804'
        result = requests.get(url)
        c = result.content
        soup = BeautifulSoup(c)
    
        article_text = ''
        article = soup.find("div", {"class":"story-body__inner"}).findAll('p')
        for element in article:
            article_text += '\n' + ''.join(element.findAll(text = True))
        return article_text
    

    您的代码中有几个问题 -

    1. 已使用相同的变量名称“article”来存储元素和文本。
    2. 应该返回的变量只是被赋值而不是附加的,所以只剩下最后一个值。

    【讨论】:

      【解决方案2】:

      要获取文章中的所有文本(CSS 选择器 reference,请查看 SelectorGadget 扩展程序以通过单击浏览器中所需的元素来获取 CSS 选择器):

      for text in soup.select('.ssrcss-xalfp3-ArticleWrapper div > div > p'):
         article_text = text.text
         # other code
      

      代码:

      from bs4 import BeautifulSoup
      import requests, lxml
      
      headers = {
          'User-agent':
          "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
      }
      
      
      def get_news_article():
        html = requests.get("https://www.bbc.com/news/business-34421804", headers=headers)
        soup = BeautifulSoup(html.text, 'lxml')
      
        title = soup.select_one('#main-heading').text
        print(f'Title: {title}\n')
      
        for text in soup.select('.ssrcss-xalfp3-ArticleWrapper div > div > p'):
          article_text = text.text
          print(article_text)
          print()
      
        return article_text
      
      get_news_article()
      
      ---------------
      '''
      Title: Amazon bars the sale of Apple and Google TV devices
      
      Amazon is to stop selling video-streaming TV devices from Google and Apple because they don't "interact well" with its own media service.
      
      The online retailer said it had made the decision to avoid "customer confusion" and the devices will be removed from sale by 29 October.
      
      ... other text
      '''
      

      【讨论】:

        【解决方案3】:

        试试这个:

            import requests
            from bs4 import BeautifulSoup
            
            def getArticle(url):
                url = 'http://www.bbc.com/news/business-34421804'
                result = requests.get(url)
                c = result.content
                soup = BeautifulSoup(c)
            
                article = []
                article_total = []
                for i in range(len( soup.find("div", {"class":"story-body__inner"}))):
                    article = soup.find("div", {"class":"story-body__inner"})[i].findAll('p')
                    for element in article:
                        article = ''.join(element.findAll(text = True))
                    article_total.append(article)
            article_total
        

        【讨论】:

          猜你喜欢
          • 2016-08-12
          • 1970-01-01
          • 2016-01-06
          • 2021-09-22
          • 2012-04-24
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2017-10-14
          相关资源
          最近更新 更多