【问题标题】:reading a text file in python with variable spacing在python中读取具有可变间距的文本文件
【发布时间】:2017-10-04 07:26:27
【问题描述】:

我想将以下数据以文本文件的形式加载到 python 中:

      pclass  survived                                               name  
0          1         1                      Allen, Miss. Elisabeth Walton   
1          1         1                     Allison, Master. Hudson Trevor   
2          1         0                       Allison, Miss. Helen Loraine   
3          1         0               Allison, Mr. Hudson Joshua Creighton   
4          1         0    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
5          1         1                                Anderson, Mr. Harry   
6          1         1                  Andrews, Miss. Kornelia Theodosia   
7          1         0                             Andrews, Mr. Thomas Jr   
8          1         1      Appleton, Mrs. Edward Dale (Charlotte Lamson)   
9          1         0                            Artagaveytia, Mr. Ramon   
10         1         0                             Astor, Col. John Jacob   

由于空格不是一个常数,而且由于最后一个字段(名称)在它们之间有一个空格,我在解析它时遇到了麻烦。我尝试了以下方法:

pd.read_csv("test.csv",sep = "\s+", header=0, index_col=0)

但它给出了一个错误:

CParserError: Error tokenizing data. C error: Expected 7 fields in line 5, saw 8

【问题讨论】:

    标签: python python-3.x csv pandas


    【解决方案1】:

    '\s+' 假定一个或多个空格仍会解析您的最后一列。而是使用假设两个或更多的正则表达式。

    pd.read_csv("test.csv", sep="\s{2,}", header=0, index_col=0, engine='python')
    

    整个工作示例

    from io import StringIO
    import pandas as pd
    
    txt = """     pclass  survived                                               name  
    0          1         1                      Allen, Miss. Elisabeth Walton   
    1          1         1                     Allison, Master. Hudson Trevor   
    2          1         0                       Allison, Miss. Helen Loraine   
    3          1         0               Allison, Mr. Hudson Joshua Creighton   
    4          1         0    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
    5          1         1                                Anderson, Mr. Harry   
    6          1         1                  Andrews, Miss. Kornelia Theodosia   
    7          1         0                             Andrews, Mr. Thomas Jr   
    8          1         1      Appleton, Mrs. Edward Dale (Charlotte Lamson)   
    9          1         0                            Artagaveytia, Mr. Ramon   
    10         1         0                             Astor, Col. John Jacob   
    """
    
    pd.read_csv(StringIO(txt), sep="\s{2,}", header=0, index_col=0, engine='python')
    
        pclass  survived                                             name
    0        1         1                    Allen, Miss. Elisabeth Walton
    1        1         1                   Allison, Master. Hudson Trevor
    2        1         0                     Allison, Miss. Helen Loraine
    3        1         0             Allison, Mr. Hudson Joshua Creighton
    4        1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
    5        1         1                              Anderson, Mr. Harry
    6        1         1                Andrews, Miss. Kornelia Theodosia
    7        1         0                           Andrews, Mr. Thomas Jr
    8        1         1    Appleton, Mrs. Edward Dale (Charlotte Lamson)
    9        1         0                          Artagaveytia, Mr. Ramon
    10       1         0                           Astor, Col. John Jacob
    

    【讨论】:

      【解决方案2】:

      您可以使用pandas.read_fwf(又名:固定宽度格式)来执行此操作:

      代码:

      df = pd.read_fwf(StringIO(data), header=1, index_col=0)
      

      测试代码:

      from io import StringIO
      import pandas as pd
      
      data = u"""
            pclass  survived                                               name
      0          1         1                      Allen, Miss. Elisabeth Walton
      1          1         1                     Allison, Master. Hudson Trevor
      2          1         0                       Allison, Miss. Helen Loraine
      3          1         0               Allison, Mr. Hudson Joshua Creighton
      4          1         0    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
      5          1         1                                Anderson, Mr. Harry
      6          1         1                  Andrews, Miss. Kornelia Theodosia
      7          1         0                             Andrews, Mr. Thomas Jr
      8          1         1      Appleton, Mrs. Edward Dale (Charlotte Lamson)
      9          1         0                            Artagaveytia, Mr. Ramon
      10         1         0                             Astor, Col. John Jacob"""
      
      df = pd.read_fwf(StringIO(data), header=1, index_col=0)
      print(df)
      

      结果:

          pclass  survived                                             name
      0        1         1                    Allen, Miss. Elisabeth Walton
      1        1         1                   Allison, Master. Hudson Trevor
      2        1         0                     Allison, Miss. Helen Loraine
      3        1         0             Allison, Mr. Hudson Joshua Creighton
      4        1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
      5        1         1                              Anderson, Mr. Harry
      6        1         1                Andrews, Miss. Kornelia Theodosia
      7        1         0                           Andrews, Mr. Thomas Jr
      8        1         1    Appleton, Mrs. Edward Dale (Charlotte Lamson)
      9        1         0                          Artagaveytia, Mr. Ramon
      10       1         0                           Astor, Col. John Jacob
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-10-30
        • 1970-01-01
        • 2020-07-16
        相关资源
        最近更新 更多