【发布时间】:2021-04-04 03:48:58
【问题描述】:
我正在尝试编写一个脚本,该脚本可以从 ftp 服务器上已经存在的文件中提取数据(如果有的话),并继续监视 ftp 目录中是否有任何新传入文件也从新出现的文件中提取数据并附加到数据中框架。更详细:
- 登录到给定文件夹中的 FTP 服务器
- 如果文件夹中存在任何文件,使用某些函数从每个文件中提取数据并附加到 pandas 数据框
- 继续查看目录中出现的任何新文件,从新出现的文件中提取数据并附加到 pandas 数据框
- 如果等待超过时间限制退出,则等待新文件出现
到目前为止我所写的:
import pandas as pd
import numpy as np
import os
from ftplib import FTP
from time import sleep
import time
# Here I define my empty data frame to which I will append my extracted data
cols = [ 'Channel' , 'Voltage' , 'Amplitude', 'Time_(ms)', 'Bubble_period_(ms)']
all_results = pd.DataFrame(columns = cols)
# Here I define my empty list which will be used to append the
data = []
#Function to monitor FTP and extract data
def extract_ftp_results(ftp_folder_path):
global all_results
ftp = FTP()
ftp.connect('10.199.44.240', 21)
ftp.login('display')
ftp.cwd(str(ftp_folder_path))
print("Connection Established {}".format(ftp.getwelcome()))
#Local directory where I copy each file for extracting attributes/data
direct = 'C:\\Users\\QC\\Desktop\\ftp_local\\'
#Create a list with one element to compare with the contents of ftp directory
# it will to be used in running for loop later in the function
old_files = ['1']
#Start a while loop to monitor the ftp directory
while True: #Start a while loop to monitor FTP directory
new_files = ftp.nlst() #List the filenames of FTP directory and store in variable
if len(old_files) != 0 and new_files != old_files: # Check if filenames match with old_files
changes = [i for i in new_files if i not in old_files] # store the contents which don't match
for x in changes: #for each filename that was not in the old_files
filename = str(direct + x) # Define a filename where it will be written
localfile = open(filename, 'wb') #Open that filename in write mode
ftp.retrbinary('RETR'+' ' + x , localfile.write, 1024) #Fetch from FTP and write
localfile.close() # Close file
print("updating data ***************************************************")
print("found new file---> {}".format(str(filename).split('\\')[-1]))
print("")
print("Calculating Attributes")
print("*****************************************************************")
sensor_arr , nfh_arr, gcs, mask, chan = extract_data(filename) #extract data in np.array
i=0
num_cluster = 18
sequence = gcs[11:19]
shot = gcs[25:29]
while i < num_cluster: #loop through the numpy array extracted from file
poa, pot, bp = bubble_attributes(apply_filter(nfh_arr[15:500, i]))#extract attributes
values = [shot, chan, poa, pot, bp]
zipped = zip(cols, values) # zip attributes with column name
a_dictionary = dict(zipped) # convert to dictionary
data.append(a_dictionary) # Append dictionary to data list
chan = chan +1
i += 1
os.remove(filename) # remove file from local machine
all_results = all_results.append(data, True) # append list of dictionaries to dataframe
old_files = new_files
a = time.perf_counter() #start time counter
if time.perf_counter() > a + 100:
print("Done Waiting") # break if wait for new file appearing exceeds
break
问题是我得到了一个数据框,其中包含来自文件的重复值一次又一次地附加到数据框中,就像对于 new_file 列表的每个元素,循环每次都从开始运行。 有人可以帮忙吗
【问题讨论】:
-
changes中的文件是否包含old_files文件中的数据? -
我实际上将 old_files 定义为一个 hack 来监视目录中出现的任何新文件,ftp 目录的内容将一次又一次地存储在 new_files(Python list) 变量中,它们将随后检查它们是否匹配 old_files (Python 列表)的内容,如果它们不匹配,则不匹配的元素将向下移动到 for 循环。一旦 for 循环完成 new_files 将等于 old_files,我忘了在代码中添加它,现在编辑
-
values = [shot, chan, poa, pot, bp]- 每个字典包含多个数据点?有没有办法区分唯一的数据点 - 你怎么知道是否有重复?每个唯一数据点是否会有唯一的(shot,chan)值?您需要在保存之前或之后进行过滤,但您需要知道如何确定独特点。 -
是的,每个 (shot,chan) 值都是唯一的
-
我尝试使用
all_results = all_results.drop_duplicates()但这不值得,因为循环中的值总是从第一个值或更改元素(Python 列表)变量开始,这使得循环需要很长时间
标签: python python-3.x pandas loops ftp