【问题标题】:How to get nearest match in csv file python如何在csv文件python中获得最近的匹配
【发布时间】:2022-01-18 18:16:03
【问题描述】:

如果想在 python 中的大 .csv 文件中获得最接近的匹配。我的(缩短的).csv 文件是:

0,4,5,0,132,24055,0,64,6,23215,39635,22,21451751,3233419908,8,0,4126,368,15087,0
0,4,5,16,52,22607,0,64,6,24727,22,39635,3233439332,21453192,8,0,26,501,28207,0
1,4,5,0,40,1727,0,128,6,29216,62281,22,123196295,3338477204,5,0,26,513,30738,0
0,4,5,0,116,24108,0,64,6,23178,39635,22,21452647,3233437508,8,0,4126,644,61163,0
0,4,5,0,724,32046,0,64,6,14632,38655,22,1452688218,1828171762,8,0,4126,343,31853,0
0,4,5,0,76,26502,0,128,6,4405,50266,22,1776918274,3172205875,5,0,4126,512,9381,0
1,4,5,0,40,7662,0,64,6,39665,22,62202,3176642698,3972914889,5,0,26,501,63331,0
1,4,5,0,52,939,0,128,6,29992,62206,22,1466629610,0,8,0,44,64240,43460,0
0,4,5,16,76,10076,0,64,6,37199,22,50268,4016221794,718292575,5,0,4126,501,310,0
0,4,5,0,40,26722,0,128,6,4221,50270,22,38340335,3852724687,5,0,26,510,36549,0
0,4,5,0,76,26631,0,128,6,4276,50266,22,1776920362,3172222235,5,0,4126,511,61692,0
0,4,5,16,148,38558,0,64,6,8680,22,37221,2019795091,3598991383,8,0,4126,501,9098,0
0,4,5,0,52,24058,0,64,6,23292,39635,22,21452135,3233420036,8,0,26,368,38558,0
0,4,5,16,76,10249,0,64,6,37026,22,50266,3172221011,1776919966,5,0,4126,501,31557,0
0,4,5,16,212,38490,0,64,6,8684,22,37221,2019776067,3598991175,8,0,4126,501,56063,0
0,4,5,0,60,0,0,64,6,47342,22,44751,2722242689,3606442876,10,0,4426,65160,29042,0
0,4,5,16,76,10234,0,64,6,37041,22,50266,3172220319,1776919498,5,0,4126,501,49854,0
1,4,5,0,1016,1737,0,128,6,28230,62273,22,3387237183,3449598142,5,0,4126,513,49536,0
1,4,5,0,40,20630,0,64,6,26697,22,62288,4040909519,95375909,5,0,26,501,36104,0
0,4,5,16,180,22591,0,64,6,24615,22,39635,3233437764,21452775,8,0,4126,501,28548,0
0,4,5,0,52,31654,0,64,6,15696,47873,22,3476257438,205382502,8,0,26,368,59804,0
1,4,5,0,320,20922,0,64,6,26125,22,62195,2187234888,2519273239,5,0,4126,501,52263,0
0,4,5,0,1132,22526,0,64,6,23744,22,39635,3233417124,21450447,8,0,4126,509,12391,0
1,4,5,0,52,0,0,64,6,47315,22,62282,3209938138,2722777338,8,0,4426,64240,36683,0
0,4,5,0,52,3091,0,64,6,44259,22,38655,1828172842,1452688914,8,0,26,504,7425,0
0,4,5,16,132,10184,0,64,6,37035,22,50266,3172212167,1776918310,5,0,4126,501,44260,0
0,4,5,16,256,10167,0,64,6,36928,22,50266,3172210503,1776918310,5,0,4126,501,19165,0
1,4,5,0,120,2043,0,128,6,28820,62294,22,644393448,2960970388,5,0,4126,512,36939,0
0,4,5,16,196,38575,0,64,6,8615,22,37221,2019796627,3598991543,8,0,4126,501,29587,0
0,4,5,16,148,22599,0,64,6,24639,22,39635,3233438532,21452967,8,0,4126,501,41316,0
1,4,5,0,88,1733,0,128,6,29162,62267,22,872073945,3114048214,5,0,4126,508,23918,0

我已经做了一个程序,但它还没有完成,我不知道我该如何完成它。我必须使用其他程序吗?:

with open("<dir>", "r") as file:
    file = file.readlines()
len_ = len(file)

string = "4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0" #The string, that I want to find the neares data in the .csv data.

list_ = []

for i in range(1, len_):
    item = str(file[i])
    item2 = item[2:]
    list_.append(item2)
    
for item in list_:

算法:在行上从左到右查找,找到与搜索数据连续匹配最多的行。

【问题讨论】:

  • 用什么方式最接近的匹配?由于这是 csv,您是要匹配来自多个列的值,还是要像字符串一样匹配它们?这里的预期结果是什么?
  • 期望值应该是csv文件中的第一个,在本例中为0或1
  • 输出应该是第一个值。我很想获得所有其他值的最接近匹配。 @JCaesar
  • 您需要明确定义“最近匹配”的含义。

标签: python csv


【解决方案1】:

您似乎正在处理一个机器学习问题,其中包含一个数据集和一个点来查找最近的邻居。我假设您想要数据集中与给定点具有最短欧式距离(19 维)的点。

我会使用带有 NearestNeighbors 算法的 pandas 和 scikit-learn 包。 上传包

from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

将 file.csv 上传为 Pandas DataFrame(带有通用列名)

df = pd.read_csv('file.csv', index_col=False, names=np.arange(20))

由于您希望将第一列值作为结果,因此我将其移至名为“first_column”的 Pandas 系列并将其从“df”数据帧中删除

first_column = df[0]
df.drop(columns=[0], inplace=True)

你所谓的“字符串”我称之为“y”并将其设置为numpy数组:

 y = np.array([[4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0]])

现在让我们拟合 NearestNeighbors 模型

nnb = NearestNeighbors(n_neighbors=1).fit(df)

现在计算数据集中哪个点最接近给定点 y:

distances, indices = nnb.kneighbors(y, n_neighbors=1)
print(indices)
[[13]]

因此,最近的点在数据框中的索引为 13。让我们打印 first_column 的第 13 位

print(first_column.loc[13])
0

【讨论】:

  • 谢谢大佬,对我帮助很大!我很感激!
猜你喜欢
  • 2022-08-18
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2011-06-04
  • 1970-01-01
  • 2020-07-17
  • 2010-10-19
  • 1970-01-01
相关资源
最近更新 更多