有没有办法删除或分离网络抓取数据？在 Python 中答案

【问题标题】：Is there a way to erase or separate web scraping data? in Python有没有办法删除或分离网络抓取数据？在 Python 中
【发布时间】：2016-03-11 15:42:05
【问题描述】：

您好，我正在从 ABC 新闻网站抓取最新消息，我正在抓取的代码如下所示：

 <a href="/Politics/huckabee-draws-cheers-fundraiser-west-bank-settlement/story?id=35615831" name="lpos=widget[A_3_freeformlite_4380645_homepage]&amp;lid=link[Headline_2]">Huckabee Draws Cheers at Fundraiser for West Bank Settlement<span class="metaH_timeDay">41 minutes ago</span></a>

但你注意到我在 a 标签内有一个 span 标签，所以当我用 BeautifulSoup 抓取它时，我会得到如下信息：

41 分钟前，Huckabee 在为约旦河西岸定居点筹款活动中获得欢呼

但它给我的时间正好在我的数据旁边，我想分开 41 分钟，所以它看起来像这样：

41 分钟前，Huckabee 在为约旦河西岸定居点筹款活动中欢呼

或者至少删除它！

我的代码如下所示：

import requests
from bs4 import BeautifulSoup

url = "http://abcnews.go.com/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

for x in range(1,10):
   for link in soup.find_all("a",{"name": "lpos=widget[A_3_freeformlite_4380645_homepage]&lid=link[Headline_"+str(x)+"]"}):
    print link.text
    print link.find_all("",{"class": "metaH_timeDay"})[0].text
    print ""

有人可以帮我吗？

【问题讨论】：

标签： python html web-scraping beautifulsoup python-requests

【解决方案1】：

您也可以使用decompose() 函数-运行一段时间 lop 以从该div 中删除所有span 标记-

import requests
from bs4 import BeautifulSoup

url = "http://abcnews.go.com/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

for x in range(1):
    d=soup.select("div.h a")
    for j in d:
        j = str(j)
        f = BeautifulSoup(j,'html.parser')
        while f.span:
            f.span.decompose()
        print f.text.encode('utf-8')

输出-

 Obama Seeks to Remove Fear From ISIS Fight
Kerry off to Paris Again for Climate Conference
Huckabee Draws Cheers at Fundraiser for West Bank Settlement
Sanders Unveils Plan to Address Climate Change
 FBI Looking Into Blatter's Role in Bribery Case
Armed Bank Robbery Suspect Shot in Miami Had Escaped From Half-Way House
13 Injured in Attack on Government Office in Western China
Police Arrest Mother of Newborn Baby Who Was Buried Alive
Shooting Suspect's Neighbor Says He Became 'More Withdrawn'
 Justice Department to Investigate Chicago Police
Hillary Clinton Corrects Flub, Thanks to Justice Breyer
 Dashcam Must Be Working
Clinton Laughs Off TrumpΓÇÖs Claims That She Lacks ΓÇÿStaminaΓÇÖ
 Man Killed in Wisconsin Standoff Was a Hostage
 2 New York College Students Abducted, Held Hostage
Transgender Actress, Warhol Muse Holly Woodlawn Dies at 69
 Mood Dour Among Venezuelan Ruling Party Backers
Hillary Clinton Says ΓÇÿWeΓÇÖre Not WinningΓÇÖ Fight Against ISIS 
Jimmy Carter Says Latest Brain Scan Shows No Cancer
One Direction Leads the Way on Twitter's List of 2015 Tweets
Promises of Grocery Stores in Needy Areas Mostly Unfulfilled
McNabb Scores Tiebreaking Goal, Kings Beat Lightning 3-1
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds
Medical Examiner Shortage: Facts About Death Investigations
Roethlisberger Throws 4 TD Passes, Steelers Roll Colts 45-10
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds

【讨论】：

【解决方案2】：

让我们通过extract()提取它：

>>> link.span.extract()     # remove the first `span` tag that we don't need
>>> time = link.span.extract()
>>> time
<span class="metaH_timeDay">2 hours, 45 minutes ago</span>
>>> link.text
' Obama Seeks to Remove Fear From ISIS Fight'
>>> time.text
'2 hours, 45 minutes ago'
>>>

【讨论】：