【问题标题】:string to pandas dataframe字符串到熊猫数据框
【发布时间】:2017-02-24 00:23:45
【问题描述】:

解析一个大型pdf文档后,我最终得到了python格式的字符串:

Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
Shares;Shares
TANSEISHA CO.,LTD.;(9743)48,424,071;0
MEITEC CORPORATION;(9744)31,300,000;0
TKC Corporation;(9746)26,731,033;0
ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May  2013 Resolution based 1;0Shares
May  2013 Resolution based 2;0Shares

是否可以将其转换为熊猫数据框,如下所示,其中列由“;”分隔。因此,从字符串中查看上述部分,我的 df 应该如下所示:

Company Name    (Code) at End of Month    Reason for Alteration  ....
Value,etc       after Alteration          Remarks Shares .....

作为另一个问题,我的行并不总是有相同数量的由“;”分隔的字符串,这意味着我需要找到一种方法来查看我的列(我不介意设置像具有 15 列的数据框然后删除那些我不需要的) 谢谢

【问题讨论】:

  • 我投反对票!我无法弄清楚是什么逻辑让我们从文本到您显示的数据框。

标签: string pandas delimiter


【解决方案1】:

我会将您在字符串中读取的内容拆分为列表列表。可能使用正则表达式来查找每条记录的开头(或者至少使用您知道它出现在哪里的东西,它看起来像(代码)在月底可能有效)并切入您的方式。像这样的:

import re
import pandas as pd

# Start your list of list off with your expected headers
mystringlist = [["Company Name", 
                 "(Code) at End of Month", 
                 "Reason for Alteration",
                 "Value,etc",
                 "after Alteration",
                 "Remarks Shares"]]

# This will be used to store the start and end indexes of each record
indexlist = []

# A recursive function to find the start location of each record. It expects a list of 1s and 0s
def find_start(thestring, startloc=0):
    if startloc >=len(thestring):
        return
    else:
        foundindex = thestring.find("1",startloc)
        indexlist.append(foundindex)
        return find_start(thestring, foundindex+1)


# Split on your delimiter
mystring = thestring.split(";")

# Use a list comprehension to make your list of 1s 
# and 0s based on the location of a fixed regular expressible record
stringloc = "".join([1 if re.match(x, "\(\d+\)\d+,\d+,\d+") else 0 for x in mystring])

find_start(stringloc)

# Make your list of list based on found indexes
# We subtract 1 from the index position because we want the element
# that immediately precedes the element we find (it's an easier regex 
# to make when it's a consistent structure.
for x in indexlist:
    if mystringlist.index(x)+1 != len(indexlist):
        mystringlist.append(mystring[x-1:indexlist[indexlist.index(x)+1]-1])

# Turn mystring list into a data frame
mydf = pd.DataFrame(mystringlist)

【讨论】:

    【解决方案2】:

    这是使用StringIO 使您的结果看起来像一个打开的文件句柄的好机会,这样您就可以使用pd.read_csv

    In [1]: import pandas as pd
    
    In [2]: from StringIO import StringIO
    
    In [3]: s = """Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
       ...: Shares;Shares
       ...: TANSEISHA CO.,LTD.;(9743)48,424,071;0
       ...: MEITEC CORPORATION;(9744)31,300,000;0
       ...: TKC Corporation;(9746)26,731,033;0
       ...: ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May  2013 Resolution based 1;0Shares
       ...: May  2013 Resolution based 2;0Shares"""
    
    In [4]: pd.read_csv(StringIO(s), sep=";")
    Out [4]:                    Company Name (Code) at End of Month Reason for Alteration No. of Shares                  Bond Symbol, etc.   Value, etc.              after Alteration  Remarks
    0                        Shares                 Shares                                 NaN                                NaN           NaN                           NaN      NaN
    1            TANSEISHA CO.,LTD.       (9743)48,424,071                                   0                                NaN           NaN                           NaN      NaN
    2            MEITEC CORPORATION       (9744)31,300,000                                   0                                NaN           NaN                           NaN      NaN
    3               TKC Corporation       (9746)26,731,033                                   0                                NaN           NaN                           NaN      NaN
    4                ASATSU-DK INC.                 (9747)                          42,155,400  Exercise of Subscription Warrants           0.0  May  2013 Resolution based 1  0Shares
    5  May  2013 Resolution based 2                0Shares                                 NaN                                NaN           NaN                           NaN      NaN
    

    请注意,看起来确实需要从这里解决一些明显的数据清理问题,但这至少应该让您有一个开始。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2015-06-25
      • 2018-11-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-07-01
      相关资源
      最近更新 更多