子字符串 python 熊猫答案

【问题标题】：sub string python pandas子字符串 python 熊猫
【发布时间】：2014-04-05 03:59:36
【问题描述】：

我有一个 pandas 数据框，其中有一个字符串列。帧的长度超过 200 万行，循环提取我需要的元素是一个糟糕的选择。我当前的代码如下所示

for i in range(len(table["series_id"])):
    table["state_code"] = table["series_id"][i][2:4]
    table["area_code"] = table["series_id"][i][5:9]
    table["supersector_code"] = table["series_id"][i][11:12]

其中“series_id”是包含多个信息字段的字符串我要创建一个示例数据元素：

列：

 [series_id, year, month, value, footnotes]

数据：

[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
 ['SMS01000000000000001' '2006' 'M02' 1970.4 '']
 ['SMS01000000000000001' '2006' 'M03' 1976.6 '']

但是 series_id 是我正在努力解决的感兴趣的列。我已经查看了 python 的 str.FUNCTION，特别是 pandas。

http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern

有一个部分描述了每个字符串函数，即特别是 get 和 slice 是我想使用的函数。理想情况下，我可以设想这样的解决方案：

table["state_code"] = table["series_id"].str.get(1:3)

或

table["state_code"] = table["series_id"].str.slice(1:3)

或

table["state_code"] = table["series_id"].str.slice([1:3])

当我尝试以下函数时，我得到“：”的无效语法。

但是我似乎无法找出正确的方法来执行向量操作以在熊猫数据框列上获取子字符串。

谢谢

【问题讨论】：

我想你想要的是table["state_code"] = table["series_id"].str[1:3]
注意：这是一种非常糟糕的遍历行的方法，要么使用 iterrows，要么使用 apply。使用 range 这样创建一个巨大的 python 列表（在 python 2 中），xrange 稍微好一点。

标签： python string pandas substring

【解决方案1】：

我想我会使用 str.extract 和一些正则表达式（您可以根据自己的需要进行调整）：

In [11]: s = pd.Series(["SMU78000009092000001"])

In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]: 
  state_code area_code supersector_code
0        U78      0000               92

这读作：以任意两个字符（被忽略）开始 (^)，接下来的三个（任意）字符是 state_code，后跟任意字符（忽略），然后是四位数字是area_code, ...

【讨论】：

只是好奇，'Out[12]' 是否返回数据框？
@user3376660 是的，这是一个 DataFrame，您提取的组名作为列名:)
@user3376660 您可能需要稍微调整一下数字以满足您的需求！