字符串到熊猫数据框答案

【问题标题】：string to pandas dataframe字符串到熊猫数据框
【发布时间】：2017-02-24 00:23:45
【问题描述】：

解析一个大型pdf文档后，我最终得到了python格式的字符串：

Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
Shares;Shares
TANSEISHA CO.,LTD.;(9743)48,424,071;0
MEITEC CORPORATION;(9744)31,300,000;0
TKC Corporation;(9746)26,731,033;0
ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May  2013 Resolution based 1;0Shares
May  2013 Resolution based 2;0Shares

是否可以将其转换为熊猫数据框，如下所示，其中列由“;”分隔。因此，从字符串中查看上述部分，我的 df 应该如下所示：

Company Name    (Code) at End of Month    Reason for Alteration  ....
Value,etc       after Alteration          Remarks Shares .....

作为另一个问题，我的行并不总是有相同数量的由“;”分隔的字符串，这意味着我需要找到一种方法来查看我的列（我不介意设置像具有 15 列的数据框然后删除那些我不需要的）谢谢

【问题讨论】：

我投反对票！我无法弄清楚是什么逻辑让我们从文本到您显示的数据框。

标签： string pandas delimiter

【解决方案1】：

我会将您在字符串中读取的内容拆分为列表列表。可能使用正则表达式来查找每条记录的开头（或者至少使用您知道它出现在哪里的东西，它看起来像（代码）在月底可能有效）并切入您的方式。像这样的：

import re
import pandas as pd

# Start your list of list off with your expected headers
mystringlist = [["Company Name", 
                 "(Code) at End of Month", 
                 "Reason for Alteration",
                 "Value,etc",
                 "after Alteration",
                 "Remarks Shares"]]

# This will be used to store the start and end indexes of each record
indexlist = []

# A recursive function to find the start location of each record. It expects a list of 1s and 0s
def find_start(thestring, startloc=0):
    if startloc >=len(thestring):
        return
    else:
        foundindex = thestring.find("1",startloc)
        indexlist.append(foundindex)
        return find_start(thestring, foundindex+1)


# Split on your delimiter
mystring = thestring.split(";")

# Use a list comprehension to make your list of 1s 
# and 0s based on the location of a fixed regular expressible record
stringloc = "".join([1 if re.match(x, "\(\d+\)\d+,\d+,\d+") else 0 for x in mystring])

find_start(stringloc)

# Make your list of list based on found indexes
# We subtract 1 from the index position because we want the element
# that immediately precedes the element we find (it's an easier regex 
# to make when it's a consistent structure.
for x in indexlist:
    if mystringlist.index(x)+1 != len(indexlist):
        mystringlist.append(mystring[x-1:indexlist[indexlist.index(x)+1]-1])

# Turn mystring list into a data frame
mydf = pd.DataFrame(mystringlist)

【讨论】：

【解决方案2】：

这是使用StringIO 使您的结果看起来像一个打开的文件句柄的好机会，这样您就可以使用pd.read_csv：

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s = """Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
   ...: Shares;Shares
   ...: TANSEISHA CO.,LTD.;(9743)48,424,071;0
   ...: MEITEC CORPORATION;(9744)31,300,000;0
   ...: TKC Corporation;(9746)26,731,033;0
   ...: ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May  2013 Resolution based 1;0Shares
   ...: May  2013 Resolution based 2;0Shares"""

In [4]: pd.read_csv(StringIO(s), sep=";")
Out [4]:                    Company Name (Code) at End of Month Reason for Alteration No. of Shares                  Bond Symbol, etc.   Value, etc.              after Alteration  Remarks
0                        Shares                 Shares                                 NaN                                NaN           NaN                           NaN      NaN
1            TANSEISHA CO.,LTD.       (9743)48,424,071                                   0                                NaN           NaN                           NaN      NaN
2            MEITEC CORPORATION       (9744)31,300,000                                   0                                NaN           NaN                           NaN      NaN
3               TKC Corporation       (9746)26,731,033                                   0                                NaN           NaN                           NaN      NaN
4                ASATSU-DK INC.                 (9747)                          42,155,400  Exercise of Subscription Warrants           0.0  May  2013 Resolution based 1  0Shares
5  May  2013 Resolution based 2                0Shares                                 NaN                                NaN           NaN                           NaN      NaN

请注意，看起来确实需要从这里解决一些明显的数据清理问题，但这至少应该让您有一个开始。

【讨论】：