【问题标题】:Python3.5 BeautifulSoup4 get text from 'p' in divPython3.5 BeautifulSoup4从div中的'p'获取文本
【发布时间】:2017-10-14 22:02:49
【问题描述】:

我正在尝试从 div 类“caselawcontent searchable-content”中提取所有文本。此代码仅打印没有网页文本的 HTML。我缺少什么来获取文本?

以下链接位于“finteredcasesdoc.text”文件中:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html

import requests
from bs4 import BeautifulSoup

with open('filteredcasesdoc.txt', 'r') as openfile1:

    for line in openfile1:
                rulingpage = requests.get(line).text
                soup = BeautifulSoup(rulingpage, 'html.parser')
                doctext = soup.find('div', class_='caselawcontent searchable-content')
                print (doctext)

【问题讨论】:

    标签: html python-3.x beautifulsoup python-requests


    【解决方案1】:
    from bs4 import BeautifulSoup
    import requests
    
    url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    

    我添加了一个更可靠的 .find 方法(键 : value

    whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})
    
    
    the_title = whole_section.center.h2
    #e.g. Missouri Court of Appeals,Southern District,Division Two.
    second_title = whole_section.center.h3.p
    #e.g. STATE of Missouri, Plaintiff-Appellant v....
    number_text = whole_section.center.h3.next_sibling.next_sibling
    #e.g.
    the_date = number_text.next_sibling.next_sibling
    #authors
    authors = whole_section.center.next_sibling
    para = whole_section.findAll('p')[1:]
    #Because we don't want the paragraph h3.p.
    # we could aslso do findAll('p',recursive=False) doesnt pickup children
    

    基本上,我已经解剖了整个 至于段落(例如正文,var para),您必须循环 print(authors)

    # and you can add .text (e.g. print(authors.text) to get the text without the tag. 
    # or a simple function that returns only the text 
    def rettext(something):
        return something.text
    #Usage: print(rettext(authorts)) 
    

    【讨论】:

    • 你们俩都帮了大忙!谢谢。
    【解决方案2】:

    尝试打印doctext.text。这将为您摆脱所有 HTML 标记。

    from bs4 import BeautifulSoup
    cases = []
    
    with open('filteredcasesdoc.txt', 'r') as openfile1: 
        for url in openfile1:
            # GET the HTML page as a string, with HTML tags  
            rulingpage = requests.get(url).text 
    
            soup = BeautifulSoup(rulingpage, 'html.parser') 
            # find the part of the HTML page we want, as an HTML element
            doctext = soup.find('div', class_='caselawcontent searchable-content')
            print(doctext.text) # now we have the inner HTML as a string
            cases.append(doctext.text) # do something useful with this !
    

    【讨论】:

    • 很好的评论 Theo,我还添加了一些其他方法来研究这个问题。您以更简单的方式解释了它,支持它!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-12-30
    • 1970-01-01
    • 2016-08-12
    相关资源
    最近更新 更多