如何摆脱 csv 文件中的 NaN 值？ Python答案

【问题标题】：How to get rid of NaN values in csv file? Python如何摆脱 csv 文件中的 NaN 值？ Python
【发布时间】：2020-02-26 17:52:58
【问题描述】：

首先，我知道关于这个问题的答案，但直到现在它们都没有为我工作。无论如何，我想知道你的答案，虽然我已经使用过那个解决方案。

我有一个名为 mbti_datasets.csv 的 csv 文件。第一列的标签是type，第二列的标签是description。每行代表一种新的人格类型（及其各自的类型和描述）。

TYPE        | DESCRIPTION
 a          | This personality likes to eat apples...\nThey look like monkeys...\nIn fact, are strong people...
 b          | b.description
 c          | c.description
 d          | d.description
...16 types | ...

在下面的代码中，当描述有 \n 时，我尝试复制每种性格类型。

代码：

import pandas as pd

# Reading the file
path_root = 'gdrive/My Drive/Colab Notebooks/MBTI/mbti_datasets.csv'
root_fn = path_rooth + 'mbti_datasets.csv'
df = pd.read_csv(path_root, sep = ',', quotechar = '"', usecols = [0, 1])

# split the column where there are new lines and turn it into a series
serie = df['description'].str.split('\n').apply(pd.Series, 1).stack()

# remove the second index for the DataFrame and the series to share indexes
serie.index = serie.index.droplevel(1)

# give it a name to join it to the DataFrame
serie.name = 'description'

# remove original column
del df['description']

# join the series with the DataFrame, based on the shared index
df = df.join(serie)

# New file name and writing the new csv file
root_new_fn = path_root + 'mbti_new.csv'

df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)

print(new_df)

预期输出：

TYPE | DESCRIPTION
 a   | This personality likes to eat apples... 
 a   | They look like monkeys...
 a   | In fact, are strong people...
 b   | b.description
 b   | b.description
 c   | c.description
...  | ...

当前输出：

TYPE | DESCRIPTION
 a   | This personality likes to eat apples...
 a   | They look like monkeys...NaN
 a   | NaN
 a   | In fact, are strong people...NaN
 b   | b.description...NaN
 b   | NaN
 b   | b.description
 c   | c.description
...  | ...

我不是 100% 确定，但我认为 NaN 值是 \r。

按要求上传到 github 的文件： CSV FILES

使用@YOLO 解决方案： CSV YOLO FILE 例如。哪里失败了：

2 INTJ  Existe soledad en la cima y-- siendo # adds -- in blank random blank spaces
3 INTJ  -- y las mujeres # adds -- in the beginning
3 INTJ  (...) el 0--8-- de la poblaci # doesnt end the word 'población'
10 INTJ icos-- un conflicto que parecer--a imposible. # starts letters randomly
12 INTJ c #adds just 1 letter

完全理解的翻译：

2 INTJ There is loneliness at the top and-- being # adds -- in blank spaces
3 INTJ -- and women # adds - in the beginning
3 INTJ (...) on 0--8-- of the popula-- # doesnt end the word 'population'
10 INTJ icos-- a conflict that seems--to impossible. # starts letters randomly
12 INTJ c #adds just 1 letter

当我显示是否有任何 NaN 值和类型时：

print(new_df['descripcion'].isnull())

<class 'float'>
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9      True
10    False
11     True
continue...

【问题讨论】：

先用.replace('\r','')去掉\r怎么样？
@MatthewSon 我已经尝试过了，正如我之前所说，我不能 100% 确定这个 NaN 值是否为 \r
那么请提供Minimal, Reproducible Example。否则我们所能做的就是猜测 NaN 值的来源。如果我们真的有文件或样本集或类似的东西，它可能会更容易提供帮助。
@LeoE 我刚刚将文件上传到github并分享了描述中的链接。

标签： python pandas csv dataframe

【解决方案1】：

这是一种方法，我必须找到一种解决方法来替换 \n 字符，但不知何故它不能以直接的方式工作：

df['DESCRIPTION'] = df['DESCRIPTION'].str.replace('[^a-zA-Z0-9\s.]','--').str.split('--n')

df = df.explode('DESCRIPTION')

print(df)

           TYPE                               DESCRIPTION
0   a             This personality likes to eat apples...
0   a                           They look like monkeys...
0   a                      In fact-- are strong people...
1   b                                       b.description
2   c                                       c.description
3   d                                       d.description

【讨论】：

它正在努力摆脱 NaN 值，但正在破坏句子的 sintaxis，不完成单词或不完成句子。例如：--This p--rsonalit is ...。我认为是因为单词的重音（áéíóú）（描述是西班牙语和英语）。
另外，我不完全理解这个字符串'[^a-zA-Z0-9\s.]'的工作原理，也许完全理解这部分可以给我一个准确的解决方案。
您的代码的结果：github.com/GUNTERMAXIMUS/mbti/blob/master/mbti_new%20(1).csv
你能用更多失败的样本来更新这个问题吗？ [^a-zA-Z0-9\s.] 基本上会删除所有不是字母、数字、空格或点的内容。

【解决方案2】：

问题可以归结为描述单元格，因为有些部分有两条新的连续行，它们之间没有任何内容。

我刚刚使用 .dropna() 读取创建的新 csv，并在没有 NaN 值的情况下重写它。无论如何，我认为重复这个过程不是最好的方法，但它作为一种解决方案是直接的。

df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn).dropna()

new_df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)

print(type(new_df.iloc[7, 1]))# where was a NaN value
print(new_df['descripcion'].isnull())

<class 'str'>
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
and continues...

【讨论】：