【问题标题】:Is there a way to erase or separate web scraping data? in Python有没有办法删除或分离网络抓取数据?在 Python 中
【发布时间】:2016-03-11 15:42:05
【问题描述】:

您好,我正在从 ABC 新闻网站抓取最新消息,我正在抓取的代码如下所示:

 <a href="/Politics/huckabee-draws-cheers-fundraiser-west-bank-settlement/story?id=35615831" name="lpos=widget[A_3_freeformlite_4380645_homepage]&amp;lid=link[Headline_2]">Huckabee Draws Cheers at Fundraiser for West Bank Settlement<span class="metaH_timeDay">41 minutes ago</span></a>

但你注意到我在 a 标签内有一个 span 标签,所以当我用 BeautifulSoup 抓取它时,我会得到如下信息:

41 分钟前,Huckabee 在为约旦河西岸定居点筹款活动中获得欢呼

但它给我的时间正好在我的数据旁边,我想分开 41 分钟,所以它看起来像这样:

41 分钟前,Huckabee 在为约旦河西岸定居点筹款活动中欢呼

或者至少删除它!

我的代码如下所示:

import requests
from bs4 import BeautifulSoup

url = "http://abcnews.go.com/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

for x in range(1,10):
   for link in soup.find_all("a",{"name": "lpos=widget[A_3_freeformlite_4380645_homepage]&lid=link[Headline_"+str(x)+"]"}):
    print link.text
    print link.find_all("",{"class": "metaH_timeDay"})[0].text
    print ""

有人可以帮我吗?

【问题讨论】:

    标签: python html web-scraping beautifulsoup python-requests


    【解决方案1】:

    您也可以使用decompose() 函数-运行一段时间 lop 以从该div 中删除所有span 标记-

    import requests
    from bs4 import BeautifulSoup
    
    url = "http://abcnews.go.com/"
    
    r = requests.get(url)
    
    soup = BeautifulSoup(r.content, "html.parser")
    
    for x in range(1):
        d=soup.select("div.h a")
        for j in d:
            j = str(j)
            f = BeautifulSoup(j,'html.parser')
            while f.span:
                f.span.decompose()
            print f.text.encode('utf-8') 
    

    输出-

     Obama Seeks to Remove Fear From ISIS Fight
    Kerry off to Paris Again for Climate Conference
    Huckabee Draws Cheers at Fundraiser for West Bank Settlement
    Sanders Unveils Plan to Address Climate Change
     FBI Looking Into Blatter's Role in Bribery Case
    Armed Bank Robbery Suspect Shot in Miami Had Escaped From Half-Way House
    13 Injured in Attack on Government Office in Western China
    Police Arrest Mother of Newborn Baby Who Was Buried Alive
    Shooting Suspect's Neighbor Says He Became 'More Withdrawn'
     Justice Department to Investigate Chicago Police
    Hillary Clinton Corrects Flub, Thanks to Justice Breyer
     Dashcam Must Be Working
    Clinton Laughs Off TrumpΓÇÖs Claims That She Lacks ΓÇÿStaminaΓÇÖ
     Man Killed in Wisconsin Standoff Was a Hostage
     2 New York College Students Abducted, Held Hostage
    Transgender Actress, Warhol Muse Holly Woodlawn Dies at 69
     Mood Dour Among Venezuelan Ruling Party Backers
    Hillary Clinton Says ΓÇÿWeΓÇÖre Not WinningΓÇÖ Fight Against ISIS 
    Jimmy Carter Says Latest Brain Scan Shows No Cancer
    One Direction Leads the Way on Twitter's List of 2015 Tweets
    Promises of Grocery Stores in Needy Areas Mostly Unfulfilled
    McNabb Scores Tiebreaking Goal, Kings Beat Lightning 3-1
    Grocery Chains Leave Food Deserts Barren, AP Analysis Finds
    Medical Examiner Shortage: Facts About Death Investigations
    Roethlisberger Throws 4 TD Passes, Steelers Roll Colts 45-10
    Grocery Chains Leave Food Deserts Barren, AP Analysis Finds
    

    【讨论】:

      【解决方案2】:

      让我们通过extract()提取它:

      >>> link.span.extract()     # remove the first `span` tag that we don't need
      >>> time = link.span.extract()
      >>> time
      <span class="metaH_timeDay">2 hours, 45 minutes ago</span>
      >>> link.text
      ' Obama Seeks to Remove Fear From ISIS Fight'
      >>> time.text
      '2 hours, 45 minutes ago'
      >>> 
      

      【讨论】:

        猜你喜欢
        • 2020-09-26
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-12-15
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多