【问题标题】:Removing whitespaces/blankspaces/newlines from scraped data从抓取的数据中删除空格/空格/换行符
【发布时间】:2021-10-06 06:09:42
【问题描述】:

我使用漂亮的汤从 url 中抓取了数据。但清理后,清理后的数据中有许多空格/空格/换行符。我尝试了.strip() 函数来删除这些。但它仍然存在。

代码

from bs4 import BeautifulSoup
import requests
import re
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', clean_data)
with open('read.txt', 'w') as file:
    file.writelines(text)

输出

   America the Beautiful: A Virtual Patriotic Salute   Flagstaff Symphony Orchestra                                                                                           Contact             Hit enter to search or ESC to close                                     About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets                  All Events   This event has passed. America the Beautiful: A Virtual Patriotic Salute  July 4, 2020         Violin Virtuoso Beethoven Virtual 5k             In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of  America the Beautiful  performed by 60 of their professional musicians, coming together virtually, to celebrate our nation s independence. CLICK HERE FOR DETAILS   + Google Calendar+ iCal Export     Details    Date:    July 4, 2020   Event Category: Concerts and Events             Violin Virtuoso Beethoven Virtual 5k                   Concert InfoConcerts Concerts and Events FAQs     FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members     Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards  (Used by permission of the Association of Fundraising Professionals)     ResourcesCommunity & Education For Musicians For Board Members             2021 Flagstaff Symphony Orchestra. 
           Copyright 2019 Flagstaff Symphony Association                             About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets   Contact  

在上面的代码中,我用 ' '(空格)替换了 unicode 字符。如果我没有用空格替换,那么几个单词将被连接在一起。 我想要获得的是一个字符串数据类型,没有不必要的空格和换行数据。

添加的问题

我尝试了strip(), re.sub() 等所有方法来替换文本中某些行开头的空格。但以下数据无效

Subscription Tickets
 All Events
This event has passed.
America the Beautiful: A Virtual Patriotic Salute
July 4, 2020
 Violin Virtuoso
Beethoven Virtual 5k 

我们如何删除这些空格

【问题讨论】:

  • 仅供参考,它是报废而不是报废。报废意味着像垃圾一样扔掉。

标签: python python-re


【解决方案1】:

你可以试试:

print(re.sub('\s+',' ', text))

【讨论】:

    【解决方案2】:

    试试这个:

    from bs4 import BeautifulSoup
    import requests
    import re
    
    
    URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
    html_content = requests.get(URL).text
    cleantext = BeautifulSoup(html_content, "lxml").text
    cleanr = re.compile('<.*?>')
    clean_data = re.sub(cleanr, ' ', cleantext)
    text = re.sub('\s+', ' ', clean_data)
    print(text)
    with open('read.txt', 'w') as file:
        file.writelines(text)
    

    输出:

    America the Beautiful: A Virtual Patriotic Salute – Flagstaff Symphony Orchestra Contact Hit enter to search or ESC to close About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets « All Events This event has passed. America the Beautiful: A Virtual Patriotic Salute July 4, 2020 « Violin Virtuoso Beethoven Virtual 5k » In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of “America the Beautiful” performed by 60 of their professional musicians, coming together virtually, to celebrate our nation’s independence. CLICK HERE FOR DETAILS + Google Calendar+ iCal Export Details Date: July 4, 2020 Event Category: Concerts and Events « Violin Virtuoso Beethoven Virtual 5k » Concert InfoConcerts Concerts and Events FAQs FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards (Used by permission of the Association of Fundraising Professionals) ResourcesCommunity & Education For Musicians For Board Members © 2021 Flagstaff Symphony Orchestra. © Copyright 2019 Flagstaff Symphony Association About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets Contact
    

    【讨论】:

      【解决方案3】:

      不清楚是否要保留一些空格以提高可读性。如果你这样做了,你可以试试这个方法:

      更新:添加了代码以仅保留字母数字字符,字符排除列表除外。

      代码:

      from bs4 import BeautifulSoup
      import requests
      
      
      def clean_scraped_text(raw_text):
      
          # strip whitespaces from start and end of raw text
          stripped_text = raw_text.strip()
      
          processed_text = ''
          for i, char in enumerate(stripped_text):
              # add a single '\n' to processed_text for every sequence of '\n'
              if char == '\n':
                  if stripped_text[i - 1] != '\n':
                      processed_text += '\n'
              else:
                  # if character is not '\n' add it to new_text
                  processed_text += char
      
          # clean whitespaces from each line in new_text
          cleaned_text = ''
          for line in processed_text.splitlines():
              # only retain alphanumeric characters and listed characters 
              exclude_list = [' ', '\xa0', '-']
              line = ''.join(x for x in line if x.isalnum() or (x in exclude_list))
              cleaned_text += line.strip() + '\n'
      
          return cleaned_text
      
      URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
      html_content = requests.get(URL).text
      text = BeautifulSoup(html_content, "lxml").text
      print(clean_scraped_text(text))
      

      输出:

      America the Beautiful A Virtual Patriotic Salute  Flagstaff Symphony Orchestra
      
      Contact
      Hit enter to search or ESC to close
      
      
      About
      Our Team
      Our Conductor
      Orchestra Members
      Concerts  Events
      Season 72 Concerts
      Subscribe
      Venue Parking  Concerts FAQs
      Support The FSO
      Donate to FSO
      Sponsor a Chair
      Funding and Impact
      Videos
      Donate
      Subscription Tickets
      All Events
      This event has passed
      America the Beautiful A Virtual Patriotic Salute
      July 4 2020
      Violin Virtuoso
      Beethoven Virtual 5k
      In place of our traditional 4th of July concert at the Pepsi Amphitheater the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4 2020 at 11am The FSO is proud to offer a special rendition of America the Beautiful performed by 60 of their professional musicians coming together virtually to celebrate our nations independence
      CLICK HERE FOR DETAILS
      Google Calendar iCal Export
      Details
      Date
      July 4 2020
      Event Category Concerts and Events
      
      Violin Virtuoso
      Beethoven Virtual 5k
      
      Concert InfoConcerts
      Concerts and Events FAQs
      
      FSO InfoAbout FSO Mission and History
      Our Team
      Our Conductor
      Orchestra Members
      Support FSOMake a Donation
      Underwriting a Concert
      Sponsor a Chair
      Advertise with FSO
      Volunteer
      Leave a Legacy
      Donor Bill of Rights
      Code of Ethical Standards  Used by permission of the Association of Fundraising Professionals
      ResourcesCommunity  Education
      For Musicians
      For Board Members
      2021 Flagstaff Symphony Orchestra
      Copyright 2019 Flagstaff Symphony Association
      
      
      About
      Our Team
      Our Conductor
      Orchestra Members
      Concerts  Events
      Season 72 Concerts
      Subscribe
      Venue Parking  Concerts FAQs
      Support The FSO
      Donate to FSO
      Sponsor a Chair
      Funding and Impact
      Videos
      Donate
      Subscription Tickets
      Contact
      

      【讨论】:

      • 您好,感谢您的代码。是否可以删除上述结果中的特殊字符,如 > + 版权符号等,并保持文档格式相同
      • 不客气! :) 是的,我添加了一些适用于此的代码。对于大量文本,此解决方案将非常慢。您可以尝试研究如何使用 re (regex) 模块,就像在其他答案中一样。将来这可能会更有效。
      • 完美。非常感谢
      • 对此还有一个疑问。当提供一些像alabamasymphony.org/event/griegs-holberg-suite 这样的网址时,我没有得到相同类型的输出。单词连接在一起,难以处理。你知道这是什么原因吗
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2015-11-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-11
      • 1970-01-01
      • 2018-05-28
      相关资源
      最近更新 更多