【问题标题】:Create a new column with the sum of occurrences from a existing column containing nested lists从包含嵌套列表的现有列中创建一个包含出现次数总和的新列
【发布时间】:2018-05-04 21:14:34
【问题描述】:

我有一个相对较大的数据框,如下所示:

(我已经在这里上传了 csv 文件 - ufile.io/526t4)

    value
0   [[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"],[121,181,"S"],[182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]]
1   [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]
2   [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]
3   [[20,79,"D"]]
...
12352   [[25,36,"S"],[37,89,"C"],[90,115,"S"]]
12353   [[1,16,"D"],[17,407,"C"],[408,416,"D"]]
12354   [[16,21,"D"],[22,108,"C"],[109,123,"D"],[124,164,"C"],[165,421,"S"]]
12355 rows × 1 columns

我想创建一个包含所有“D”出现次数总和的新列

以第一行为例:

x = [[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"][121,181,"S"],182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]]
new_colum_D = (sum([y[1]-y[0] for y in x if y[2]=="D"])) # applied for all rows

new_colum_D = 第一行的值为 130

我尝试了以下方法:

df['Column_D']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))

但我收到以下消息:IndexError: string index out of range

IndexError                                Traceback (most recent call last)
<ipython-input-7-f7f23d42d4e5> in <module>()
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if 
y[2]=="D"]))
~\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   2549             else:
   2550                 values = self.asobject
-> 2551                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2552 
   2553         if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-7-f7f23d42d4e5> in <lambda>(x)
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))
<ipython-input-7-f7f23d42d4e5> in <listcomp>(.0)
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))

IndexError: string index out of range

【问题讨论】:

  • 第一行的[114,120,"C"][121,181,"S"],182,187,"C"], 应该是[114,120,"C"],[121,181,"S"],[182,187,"C"],吗?
  • 是的!谢谢,我会更新代码

标签: python python-3.x pandas dataframe lambda


【解决方案1】:

你很亲密。您可以在列表理解中构建您的计算。然后将列表分配给一个系列。

您可能感觉您正在使用pd.DataFrame.apply 对计算进行矢量化,但事实并非如此:apply 只是一个带有一些额外开销的薄薄的循环。

df = pd.DataFrame({'value': [[[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"],[121,181,"S"], [182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]],
                             [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]],
                             [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]]})

df['value'] = [sum([y[1]-y[0] for y in x if y[2]=="D"]) for x in df['value']]

print(df)

   value
0    130
1      5
2      5

【讨论】:

  • 我试过 df['column_D'] = [sum([y[1]-y[0] for y in x if y[2]=="D"]) for x in df ['value']] 但仍然得到字符串索引超出范围错误
  • @Julia,这很奇怪。上面的代码(在完全复制粘贴之后)是否按预期工作?如果是这样,那么您的数据有些奇怪。
  • @Julia,您在问题中包含的错误使用df['sum']=df["value"].apply(...。您可以使用我的解决方案中的代码运行并显示您遇到的错误吗?
  • 它对于前 3 行非常有效,但是当我尝试整个数据集时它不起作用
猜你喜欢
  • 2015-08-16
  • 2023-04-06
  • 2021-06-18
  • 2018-05-05
  • 1970-01-01
  • 2020-05-10
  • 2019-03-29
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多