【问题标题】:Can't clear pandas dataframe无法清除熊猫数据框
【发布时间】:2021-07-19 01:09:29
【问题描述】:

这是我第一次使用 pandas,我正在尝试制作一个 forloop,从网站中提取 mp3 链接并将它们放入 csv 文件中。对于每个专辑链接,它会创建一个新文件夹和一个新的 csv 文件,然后将 mp3 放入 csv 中。

一切正常但是我有一个主要问题 - 数据框不断将最后一个循环中的数据附加到当前循环中。所以我的数据框list 越来越大。

代码如下:

from bs4 import BeautifulSoup
import urllib.request
import re
import pandas as pd
import os
import csv

def mpull():
   albumlist()
   baseurl = "https://downloads.khinsider.com"
   alist = albumlist.albums_str
   llist = albumlist.link_str
   fullsoup = []
   for l, ab in zip(llist, alist):
      os.mkdir(ab)
      url = urllib.request.urlopen(l)
      content = url.read()
      soup = BeautifulSoup(content, features="html.parser")
      for a in soup.findAll('a',href=re.compile('/*\.mp3')):
         df = pd.DataFrame([])
         fullsoup.append(baseurl+a['href'])
         remove_dup(fullsoup)
         df = pd.DataFrame(fullsoup)
      df.to_csv(ab+"/"+ab+".csv", index=False, header=False) 
      print(fullsoup)
mpull()

我想要的是这个:

007 everything or nothing:

https://downloads.khinsider.com/game-soundtracks/album/007-everything-or-nothing/EON-01-James-Bond-Theme.mp3
https://downloads.khinsider.com/game-soundtracks/album/007-everything-or-nothing/EON-02-Russian-Liar.mp3
#MORE 007 everything or nothing songs

我得到的是这样的:

007 everything or nothing:
#songs from the last loop appear first for some reason
https://downloads.khinsider.com/game-soundtracks/album/007-blood-stone/01-%2520James%2520Bond-Blood%2520Stone%2520Theme%2520Song.mp3
https://downloads.khinsider.com/game-soundtracks/album/007-blood-stone/02-%2520M%2520Puts%2520Her%2520Trust%2520in%2520Bond.mp3
#Then the right songs appear afterwards 
https://downloads.khinsider.com/game-soundtracks/album/007-everything-or-nothing/EON-01-James-Bond-Theme.mp3
https://downloads.khinsider.com/game-soundtracks/album/007-everything-or-nothing/EON-02-Russian-Liar.mp3
#MORE 007 everything or nothing songs

我尝试了什么: 我尝试将del df 添加到循环的末尾,如下所示:

from bs4 import BeautifulSoup
import urllib.request
import re
import pandas as pd
import os
import csv

def mpull():
   albumlist()
   baseurl = "https://downloads.khinsider.com"
   alist = albumlist.albums_str
   llist = albumlist.link_str
   fullsoup = []
   for l, ab in zip(llist, alist):
      os.mkdir(ab)
      url = urllib.request.urlopen(l)
      content = url.read()
      soup = BeautifulSoup(content, features="html.parser")
      for a in soup.findAll('a',href=re.compile('/*\.mp3')):
         df = pd.DataFrame([])
         fullsoup.append(baseurl+a['href'])
         remove_dup(fullsoup)
         df = pd.DataFrame(fullsoup)
      df.to_csv(ab+"/"+ab+".csv", index=False, header=False) 
      del def
      print(fullsoup)
mpull()

但这似乎没有做任何事情,或者更确切地说 - 它仍在将最后一个循环的数据帧附加到当前的 csv 迭代中。

任何想法都会很棒。谢谢!

【问题讨论】:

    标签: python pandas dataframe csv


    【解决方案1】:

    我明白了!!!而不是尝试删除数据框df。我需要在每次循环迭代时删除 fullsoup 列表,这样它就不会在每个循环中保留列表的数据。

    def mpull():
       albumlist()
       baseurl = "https://downloads.khinsider.com"
       alist = albumlist.albums_str
       llist = albumlist.link_str
       for l, ab in zip(llist, alist):
          fullsoup = []
          os.mkdir(ab)
          url = urllib.request.urlopen(l)
          content = url.read()
          soup = BeautifulSoup(content, features="html.parser")
          for a in soup.findAll('a',href=re.compile('/*\.mp3')):
             df = pd.DataFrame([])
             fullsoup.append(baseurl+a['href'])
             # remove_dup(fullsoup)
             df = pd.DataFrame(fullsoup)
          print(fullsoup)
          del fullsoup
          df.to_csv(ab+"/"+ab+".csv", index=False, header=False) 
    mpull()
    

    【讨论】:

    • 如果你在for-loop 中设置了fullsoup = [],那么你就不需要del fullsoup。在原始版本中,您在for-loop 之前设置了fullsoup = [],因此您需要fullsoup.clear() 来删除以前的值。
    • 我事先尝试过,但没有按预期工作。它不会将网页中的所有 mp3 放入一个列表中,而是只会获得一个链接,然后转到下一个链接。所以我不得不将fullsoup 放在第一个 for 循环(而不是第二个)中,以免发生这种情况。除了有时出现一些网页错误外,一切都运行良好。但这对我来说没关系:)
    猜你喜欢
    • 2019-05-13
    • 2013-11-09
    • 2018-09-20
    • 2015-02-04
    • 1970-01-01
    • 2017-09-28
    • 2016-11-12
    • 2017-04-29
    相关资源
    最近更新 更多