扩展（）不产生列表答案

【问题标题】：extend() not producing a list扩展（）不产生列表
【发布时间】：2020-09-04 02:22:32
【问题描述】：

我正在处理一个字符串列表和一个包含字符串的数据框。想象一下场景：

A = ['the', 'a', 'with', 'from', 'on']

和一个数据框：

df = {'col1':['string', 'string'], 'col2':['镇上的人', '公交车上的人']}

我现在正在尝试在我的 data_frame 中创建一个新列，该列将在我的 data_frame 的第 2 列中显示我的列表 A 中的值（在本例中：the, from, a）

我写的是这样的：

def words_in_A（行）：
     资源=[]
     对于 A 中的项目：
          如果在行中的项目：
              res.extend（项目）
              返回资源

df[col3] = df[col2].apply(lambda x: words_in_A(x))

我希望输出是一个包含多个值的列表：

col 1 col2 col3
将镇上的人串成“the”、“from”、“a”
把公共汽车上的人串成'the'，'on'，'a'

但该函数只返回最后一项 ('a') 而不是列表。我不确定为什么使用 extend() 没有为我生成列表。请帮忙！

【问题讨论】：

您的“return”将返回“if”标识的第一个项目。你的意思是有不同的缩进吗？
首先避免使用list作为变量名，它是保留字
你预期的输出是什么。请注意，col3 确实包含一个列表，但它只是由.extend 生成的列表，其中包含来自A 的第一项在每一行中的空列表...
@RichieV（顺便说一句，有人告诉我这是内置的）
@RichieV 好吧，它不是保留，否则根据定义，你不能使用它，但无论如何都是个好建议

标签： python pandas

【解决方案1】：

您的代码只需要稍微调整缩进并使用append 而不是extend。如果您扩展，则字符串'the' 将被视为一个列表，每个字母都将附加到收集列表中。

def words_in_A(row): 
    lst = []
    for item in A:
        if item in row:
            lst.append(item) 
    return lst

老实说，虽然列表理解甚至 Shubham 使用正则表达式的答案会比 apply 更快，但我的立场是正确的。这是您的数据帧的时间安排，但有 20,000 行而不是 2 行。

with apply 0.078s
with list comp 0.076s
with regex 0.168s
with regex, no join 0.141s

还有测试代码

from time import time

t0 = time()
df['col3'] = df['col2'].apply(words_in_A)
print('with apply', f'{time() - t0:.3f}s')

t0 = time()
df['col3'] = [[item for item in A if item in row] for row in df.col2]
print('with list comp', f'{time() - t0:.3f}s')

t0 = time()
pat = rf"(?i)\b(?:{'|'.join(A)})\b"
df['col3'] = df['col2'].str.findall(pat).str.join(', ')
print('with regex', f'{time() - t0:.3f}s')

t0 = time()
pat = rf"(?i)\b(?:{'|'.join(A)})\b"
df['col3'] = df['col2'].str.findall(pat)
print('with regex, no join', f'{time() - t0:.3f}s')

输出

         col1                 col2          col3
0      string  the man from a town  the, from, a
1      string    a person on a bus      a, on, a
2      string  the man from a town  the, from, a
3      string    a person on a bus      a, on, a
4      string  the man from a town  the, from, a
...       ...                  ...           ...
19995  string    a person on a bus      a, on, a
19996  string  the man from a town  the, from, a
19997  string    a person on a bus      a, on, a
19998  string  the man from a town  the, from, a
19999  string    a person on a bus      a, on, a

[20000 rows x 3 columns]

【讨论】：

很好，只是想指出我正在使用额外的步骤.join，所以我认为这需要时间，请您在没有.str.join的情况下进行测试
@ShubhamSharma 似乎findall 调用是最昂贵的，谁会想到，正则表达式通常如此之快
这很有趣，感谢您测试 Richie :) +1
谢谢你们——我还没有意识到退货缩进的重要性。谢谢你们！

【解决方案2】：

extend()：迭代其参数并将每个元素添加到列表并扩展列表。

所以x.extend("one") 将导致 ['o','n','e'] 您需要的是 x.append 将 one 附加到列表末尾 x。

此外，您正在填充名为 res 的列表，因此您必须在完成操作后将其返回。

A = ['the', 'a', 'with', 'from', 'on']
df = {'col1':['string', 'string'], 'col2':['the man from a town', 'a person on a bus']}
df = pd.DataFrame(df)

def words_in_A(row): 
  res=[]
  for item in A:
    if item in row:
      res.append(item) 
  return res

df['col3'] = df['col2'].apply(lambda x: words_in_A(x))
print (df)

输出：

     col1                 col2            col3
0  string  the man from a town  [the, a, from]
1  string    a person on a bus         [a, on]

Python 风格：

df['col3'] = df['col2'].apply(lambda x: list(set(x.split()).intersection(A)))

【讨论】：

这很好——我使用了 extend([item]) 来逃避这个问题。附加是否更可取？
是的，这里首选追加，因为[item] 创建了一个单项列表并扩展了必须迭代此列表。

【解决方案3】：

使用Series.str.findall 和正则表达式pattern 从列表A 中查找所有匹配值，然后使用Series.str.join：

pat = rf"(?i)\b(?:{'|'.join(A)})\b"
df['col3'] = df['col2'].str.findall(pat).str.join(', ')

结果：

     col1                 col2          col3
0  string  the man from a town  the, from, a
1  string     the man on a bus    the, on, a

【讨论】：

这很好用——但只是为了我自己的理智，我上面的函数到底出了什么问题？
这里不需要使用apply，但如果你愿意，我认为like this应该可以工作..