【问题标题】:double quoted elements in csv cant read with pandascsv中的双引号元素无法用熊猫读取
【发布时间】:2014-12-23 02:51:25
【问题描述】:

我有一个输入文件,其中每个值都存储为字符串。 它位于一个 csv 文件中,每个条目都包含在双引号中。

示例文件:

"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"

只有六列。我需要在 pandas read_csv 中输入哪些选项才能正确读取?

我目前正在尝试:

import pandas as pd
df = pd.read_csv(file, quotechar='"')

但这给了我错误信息: CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 14

这显然意味着它忽略了 '"' 并将每个逗号解析为一个字段。 但是,对于第 3 行,第 3 到第 6 列应该是带有逗号的字符串。 ("1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD")

如何让 pandas.read_csv 正确解析?

谢谢。

【问题讨论】:

标签: python csv pandas


【解决方案1】:

这会奏效。它回退到 python 解析器(因为你有非常规分隔符,例如它们是逗号,有时是空格)。如果你只有逗号,它会使用 c-parser 并且更快。

In [1]: import csv

In [2]: !cat test.csv
"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"

In [3]: pd.read_csv('test.csv',sep=',\s+',quoting=csv.QUOTE_ALL)
pandas/io/parsers.py:637: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  ParserWarning)
Out[3]: 
     "column1","column2" "column3"   "column4"   "column5"   "column6"
"AM"                "07"       "1"        "SD"        "SD"        "CR"
"AM"                "08"   "1,2,3"  "PR,SD,SD"  "PR,SD,SD"  "PR,SD,SD"
"AM"                "01"       "2"        "SD"        "SD"        "SD"

【讨论】:

  • 它对我不起作用.. 我的巨大 csv 对sed 很耗时,其中包含4366201,"Erud","Facebook,Ado-Ekiti","2018-03-22 10:38:42","UR",0,0,\N ,\N,\N,\N,\N,\N 之类的行,并给出ParserError: ' ' expected after '"' 我什至尝试过pd.read_csv("users.csv", sep=",", delimiter="\n", quoting=csv.QUOTE_ALL, engine="python", quotechar='"', encoding="utf-8")
  • 最终对我有用的是pd.read_csv("users.csv", sep=",", encoding="utf-8", names=["id", "name"...])
  • 注意:sep=',\s*' 使用 quotechar='"', quoting=csv.QUOTE_ALL 似乎会中断。似乎阅读这将是等效的。但是,这不是我发现的. 把这个留给其他人。
  • 这仅适用于 python 引擎。当您需要 low_memory=True 时,解决方案将不起作用
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-12-21
  • 2018-05-15
  • 2022-01-15
相关资源
最近更新 更多