BeautifulSoup / 如何提取特定段落的文本？答案

【问题标题】：BeautifulSoup / How to extract a specific paragraph of text?BeautifulSoup / 如何提取特定段落的文本？
【发布时间】：2021-12-30 03:08:38
【问题描述】：

我正在使用 Beautifulsoup 从单个 MP 页面中提取信息，例如https://publications.parliament.uk/pa/cm/cmregmem/211115/cox_geoffrey.htm

我想提取每个带编号的粗体标题下的文本（例如“1. 就业和收入”）并单独保存。每个不同 MP 的标题都会发生变化（例如，有些声明“3. 来自英国的礼物、福利和款待”，有些则没有）——我想要一个适用于任何 MP 页面的脚本。

目前，我正陷入一团糟，试图用循环来做这件事。我对 BS（和 python）很陌生，所以我觉得我可能错过了一个技巧。有人有什么想法吗？

import requests
from bs4 import BeautifulSoup

#urls
home_url = "https://publications.parliament.uk/pa/cm/cmregmem/211101/"

#extracting list of mp names and links + save as tuples in list (mp_list)
home_page = requests.get(home_url+'contents.htm')
home_soup = BeautifulSoup(home_page.content, "html.parser")

mp_list = []
mp_elements = home_soup.find_all("p", attrs={'class':None, 'xmlns':'http://www.w3.org/1999/xhtml'})

for mp_element in mp_elements:
    try:
        mp_name = list(mp_element.children)[1].text.strip()
        mp_url = list(mp_element.children)[1]['href']
        mp_list.append((mp_name,mp_url))
    except:
        pass

#extract text from mp page
mp_url = home_url+mp_list[115][1] ##this is just to pick out an example MP page to test with
print(mp_url)
mp_page = requests.get(mp_url)
mp_soup = BeautifulSoup(mp_page.content, "html.parser")
mp_text_all = mp_soup.find_all("p")

mp_text_list = []
for item in mp_text_all:
    mp_text_list.append(item.text)

编辑：我最终想出了这个。见下文。

def compile_indv_mp_page_dict(text): 

    ## save consituency to mp_page_dict before it's removed in next line
    mp_constituency = get_constituency(text[0])
    
    mp_page_dict_v1 = {}
    ## mp_page_dict_v1 {'h1':0, 'h8':9, ...}
    for line in text:
        if line in list(headings_dict.values()):
            for h in list(headings_dict.keys()):
                if headings_dict[h] == line:
                    mp_page_dict_v1[h] = text.index(line) 
    
    ## mp_page_dict {'h1':[0,1,2,3], 'h8':[4,5,6], ...}
    h_end = len(text)
    for index, item in enumerate(list(mp_page_dict_v1.items())):
        try:
            h_var1 = list(mp_page_dict_v1.items())[index][0]
            h_var2 = list(mp_page_dict_v1.items())[index+1][0]
            mp_page_dict_v1[h_var1] = list(range(mp_page_dict_v1[h_var1], mp_page_dict_v1[h_var2]))
        except:
            mp_page_dict_v1[h_var1] = list(range(mp_page_dict_v1[h_var1],h_end)) 

    mp_page_dict = {}
    ## mp_page_dict {'h1':['text', 'text', 'text'], 'h2':['text','text','text'], ...}
    for key, line_list in list(mp_page_dict_v1.items()):
        text_list = []
        for line in line_list:
            if text[line] in list(headings_dict.values()):
                pass
            else:
                text_list.append(text[line])
        full_heading = headings_dict[key]
        mp_page_dict[full_heading] = "\n".join(text_list)
    
    return mp_page_dict

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup scrapy

【解决方案1】：

到目前为止，想要的解决方案如下：

import pandas as pd
import requests
from bs4 import BeautifulSoup
data=[]
def get_data(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    h1 =[x.get_text(strip=True) for x in soup.select('p[xmlns="http://www.w3.org/1999/xhtml"]')]
    print(h1)
    


url1 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/bridgen_andrew.htm'
url2 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/robinson_mary.htm'

print(' URL-1 '.center(50, '*'))
get_data(url1)
print(' URL-2 '.center(50, '*'))
get_data(url2)

cols = ["heading", "details"]

df = pd.DataFrame(data, columns= cols)
#print(df)
#df.to_csv('info.csv',index = False)

输出：

['Bridgen, Andrew (North West Leicestershire)', '1. Employment and earnings', 'From 6 May 2020 to 5 May 2022, Ady, building projects.', 'From 6 February 2017, AB Farms Ltd; potato production and storage. (Registered 21 March 2017)', '']
********************* URL-2 **********************
['Robinson, Mary (Cheadle)', '2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation', 'Name of donor: IX Wireless LtdAddress of donor: 4 Lockside Office Park, Lockside Road, Riversway, Preston PR2 2YSAmount of donation or nature and value if donation in kind: £2,000 to my local associationDonor status: company, registration 11008144(Registered 30 July 2021)', '7. (i) Shareholdings: over 15% of issued share capital', 'Mary Felicity Design Ltd; clothing design company. (Registered 03 June 2015)', '8. Miscellaneous', 'From 31 January 2020, member of Cheadle Towns Fund Board. This is an unpaid role. (Registered 28 January 2020)', 'From 20 June 2021, unpaid director of the Northern Research Group Ltd, a shared services company for northern MPs. (Registered 04 August 2021)', '']

【讨论】：

谢谢，它在 DataFrame 中看起来好多了。但我想要做的是将整个文本块保存在一起 //// 例如对于 .../abbott_diane.htm //// h1 = ['1.就业和收入”，“卫报付款，国王广场，90 York Way, London N1 9GU，文章：”，2020 年 7 月 24 日，收到了 100 英镑。小时：1小时。（2021 年 2 月 2 日注册）', ...等] //// h8 = ['8. Miscellaneous', '自 2015 年 12 月起，担任 Diane Abbott 基金会的受托人，该基金会致力于卓越和改善教育。（2016 年 10 月 26 日注册）'] //// 这就是我正在努力解决的问题。
以上解决方案。

【解决方案2】：

你可以这样做。

您需要的text 存在于带有class=indent 的<p> 标签内。使用.find_all(). 选择所有<p> 标签
如果你想要标题，那么你需要在上面选择的<p>标签之前选择<p>。我在这里使用了.findPreviousSibling() 来做到这一点。

这是适用于任何 MP 页面的完整代码。你只需要通过传入 MP 的 url 来调用函数get_data()。

import requests
from bs4 import BeautifulSoup

def get_data(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    p = soup.find_all('p', class_='indent')

    for i in p:
        heading = i.findPreviousSibling('p').find('strong')
        if heading:
            heading = heading.text.strip()
            print(heading)
        print(f'{i.text.strip()}\n')


url1 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/bridgen_andrew.htm'
url2 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/robinson_mary.htm'

print(' URL-1 '.center(50, '*'))
get_data(url1)
print(' URL-2 '.center(50, '*'))
get_data(url2)

这适用于任何 MP 的页面。这是两个不同 MP 链接的输出。

********************* URL-1 **********************
1. Employment and earnings
From 6 May 2020 to 5 May 2022, Adviser to Mere Plantations Ltd of Unit 1 Cherry Tree Farm, Cherry Tree Lane, Rostherne WA14 3RZ; a company which grows teak in Ghana. I provide advice on business and international politics. I will be paid £12,000 a year for an expected monthly commitment of 8 hrs. (Registered 17 June 2020; updated 23 December 2020)

Payments from Open Dialogus Ltd, 14 London Street, Andover SP11 6UA, for writing articles:

7. (i) Shareholdings: over 15% of issued share capital
AB Produce PLC; processing and distribution of fresh vegetables.

AB Produce Trading Ltd; holding company.

Bridgen Investments Ltd; investment company, investing in shares, property, building projects.

From 6 February 2017, AB Farms Ltd; potato production and storage. (Registered 21 March 2017)

********************* URL-2 **********************
2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation
Name of donor: IX Wireless LtdAddress of donor: 4 Lockside Office Park, Lockside Road, Riversway, Preston PR2 2YSAmount of donation or nature and value if donation in kind: £2,000 to my local associationDonor status: company, registration 11008144(Registered 30 July 2021)

7. (i) Shareholdings: over 15% of issued share capital
Mary Felicity Design Ltd; clothing design company. (Registered 03 June 2015)

8. Miscellaneous
From 31 January 2020, member of Cheadle Towns Fund Board. This is an unpaid role. (Registered 28 January 2020)

From 20 June 2021, unpaid director of the Northern Research Group Ltd, a shared services company for northern MPs. (Registered 04 August 2021)

【讨论】：

谢谢你，这很有帮助。有没有办法拆分每个编号标题下的文本？即 '1 下的所有文本。就业和收入”将存储在所有文本的单独变量中，该变量位于“7 (i) 股权：超过 15% 的已发行股本”下。我想对文本的每个部分做不同的事情。
这很简单，对吧？您可以进行一些字符串匹配，提取文本并将它们存储在变量中。
是吗？（！）我绝对坚持下去。
也许有更简单的方法（或者我的问题不清楚） - 但我找到的解决方案在上面。