用 Python 和 Beautiful Soup 分割抓取的文本答案

【问题标题】：Dividing scraped text with Python and Beautiful Soup用 Python 和 Beautiful Soup 分割抓取的文本
【发布时间】：2019-01-29 16:18:23
【问题描述】：

我已经从this website 中删除了时间表。我得到的输出是：

"ROUTE": "NAPOLI PORTA DI MASSA \u00bb ISCHIA"

但我想：

"DEPARTURE PORT": "NAPOLI PORTA DI MASSA"
"ARRIVAL PORT": "ISCHIA"

如何分割字符串？这是代码：

medmar_live_departures_table = list(soup.select('li.tratta'))
departure_time = []
for li in medmar_live_departures_table:
    next_li = li.find_next_sibling("li")
    while next_li and next_li.get("data-toggle"):
        if next_li.get("class") == ["corsa-yes"]:         
          #  departure_time.append(next_li.strong.text)
            medmar_live_departures_data.append({
            'ROUTE' : li.text
           })

【问题讨论】：

在“\u00bb”上拆分？
是的，但是你是怎么做到的呢？抱歉，我刚开始学习 Python...
li.text.split("\ u00bb") ?

标签： python web-scraping beautifulsoup

【解决方案1】：

两件事，

1.由于 "»" 是一个非 ascii 字符，python 会返回非 ascii 字符，如 "\u00bb"，因此通过使用非 ascii 代码拆分文本来解析字符串，如下所示：

parse=li.get_text().split('\u00bb')

另外，你可以像这样使用re库来解析非ascii字符（如果你选择这个路径，你需要添加re库）：

import re

non_ascii = li.get_text()
parse = re.split('[^\x00-\x7f]', non_ascii)
#[^\x00-\x7f] will select non-ascii characters as pointed out by Moinuddin Quadri in https://stackoverflow.com/questions/40872126/python-replace-non-ascii-character-in-string

但是，通过这样做，python 将从解析中创建一个部分列表，但并非“li”html 标记中的所有文本都带有“»”字符（即文本末尾的“POZZUOLI-PROCIDA”）网站上的表格），所以我们必须考虑到这一点，否则我们会遇到一些问题。

2. 字典可能是数据结构的糟糕选择，因为您正在解析的数据将具有相同的键。

例如，POUZZOULI » CASAMICCIOLA 和 POUZOULI » PROCIDA。 COSMICCIOLA 和 PROCIDA 将具有相同的密钥。 Python 将简单地覆盖/更新 POUZZOULI 键的值。所以 POUZZOULI: CASAMICCIOLA 将变为 POUZZOULI: PROCIDA 而不是添加 POUZZOULI: CASAMICCIOLA 作为字典条目和 POUZZOULI: PROCIDA 作为另一个字典条目。

我建议将解析的每个部分作为元组添加到列表中，如下所示：

single_port= []
ports=[]

medmar_live_departures_table = list(bs.select('li.tratta'))
departure_time = []
for li in medmar_live_departures_table:
    next_li = li.find_next_sibling("li")
    while next_li and next_li.get("data-toggle"):
        if next_li.get("class") == ["corsa-yes"]:
            #  departure_time.append(next_li.strong.text)
            non_ascii = li.get_text()
            parse = re.split('[^\x00-\x7f]', non_ascii)

            # The if statement takes care of table data strings that don't have the non-ascii character "»" 
            if len(parse) > 1:
                ports.append((parse[0], parse[1]))

            else:
                single_port.append(parse[0])


# This will print out your data in your desired manner
for i in ports:
    print("DEPARTURE: "+i[0])
    print("ARRIVAL: "+i[1])

for i in single_port:
    print(i)

我还在我运行的一个测试代码中使用了 split 方法：

import requests
from bs4 import BeautifulSoup
import re

url="https://www.medmargroup.it/"
response=requests.get(url)
bs=BeautifulSoup(response.text, 'html.parser')


timeTable=bs.find('section', class_="primarystyle-timetable")

medmar_live_departures_table=timeTable.find('ul')
single_port= []
ports=[]


for li in medmar_live_departures_table.find_all('li', class_="tratta"):
    parse=li.get_text().split('\u00bb')

    if len(parse)>1:
        ports.append((parse[0],parse[1]))

    else:
        single_port.append(parse[0])


for i in ports:
    print("DEPARTURE: "+i[0])
    print("ARRIVAL: "+i[1])

for i in single_port:
    print(i)

我希望这会有所帮助！

【讨论】：

非常感谢！很好的解释和回复。像您这样的人使 Stackoverlow 成为了一个伟大的社区。span>

【解决方案2】：

试试这个：

medmar_live_departures_table = list(soup.select('li.tratta'))
departure_time = []
for li in medmar_live_departures_table:
    next_li = li.find_next_sibling("li")
    while next_li and next_li.get("data-toggle"):
        if next_li.get("class") == ["corsa-yes"]:         
          #  departure_time.append(next_li.strong.text)
            medmar_live_departures_data.append({
            'DEPARTURE PORT' : li.text.split("\ u00bb")[0],
            'ARRIVAL PORT' : li.text.split("\ u00bb")[1]
           })

【讨论】：