【问题标题】:Search for strings in a for loop without referencing each string individually in the loop在 for 循环中搜索字符串,而不在循环中单独引用每个字符串
【发布时间】:2016-10-17 23:29:15
【问题描述】:

我有文本数据要分类。使用我指定单个字符串的 for 循环,我正在确定特定单词或短语是否存在于另一列的行中。如果为 true,则循环将特定值附加到新列表中。然后将新列表添加到DataFrame。然而,这种方法对于我的实际数据来说太笨拙了,因为我需要为大量测试指定大量字符串。

有没有一种方法可以将单个字符串分组到循环可以在其中搜索的单个数据结构中?这意味着循环中的每个测试都将仅引用一个数据结构,而不是循环内拼写的单个字符串。这个可以吗?

下面是我目前正在做的一个可重现的例子,突出了这个问题。

    data = {
        'opinion': ['He said it was too expensive',
                      'She said it was too costly',
                      'He thought it was not fast enough',
                      'They thought is was not right and too much money',
                      'Her view was that it was too small and too slow', 
                   ]}

df = pd.DataFrame(data, columns = ['opinion'])
df

创建这个:

    opinion
0   He said it was too expensive
1   She said it was too costly
2   He thought it was not fast enough
3   They thought is was not right and too much money
4   Her view was that it was too small and too slow

然后这个 for 循环进行以下分类。

new_col=[]

for row in df['opinion']:
    if 'too expensive' in row or 'too costly' in row or 'too much money' in row:
        new_col.append('Too Expensive')
    elif 'not fast enough' in row or 'too slow' in row:
        new_col.append('Too Slow')

df['reason'] = new_col
df

    opinion                                           reason
0   He said it was too expensive                      Too Expensive
1   She said it was too costly                        Too Expensive
2   He thought it was not fast enough                 Too Slow
3   They thought is was not right and too much money  Too Expensive
4   Her view was that it was too small and too slow   Too Slow

在我的实际数据中,虽然我无法在每个测试的循环内写入多个单独的字符串,但数量太多了。

【问题讨论】:

    标签: python string pandas search dataframe


    【解决方案1】:

    您可以将您的术语保存在dictionarieslist 中,其中keysreplacementvalues 包含liststo_replace

    words = [{'Too Expensive': ['too expensive', 'too costly', 'too much money'],
          'Too Slow': ['not fast enough', 'too slow']}]
    

    然后loop 超过words,使用str.containsregex 一次查看所有to_replace,并使用.loc[] 来识别和分配。

    for word in words:
        for replacement, to_replace in word.items():
            df.loc[df.opinion.str.contains('|'.join(to_replace)), 'reason'] = replacement
    

    得到:

                                                opinion         reason
    0                      He said it was too expensive  Too Expensive
    1                        She said it was too costly  Too Expensive
    2                 He thought it was not fast enough       Too Slow
    3  They thought is was not right and too much money  Too Expensive
    4   Her view was that it was too small and too slow       Too Slow
    

    【讨论】:

      【解决方案2】:

      这应该可行:

      test_strings = ['too expensive', 'too costly', 'too much money']
      for row in df['opinion']:
          for tester in test_strings:
              if tester in row:
                  new_col.append("Too Expensive")
                  break
      

      【讨论】:

      • 你忘了Too Slow
      • 我认为 MWE 就足够了 :)
      【解决方案3】:

      我认为在这种情况下使用 RegEx 会更方便:

      df['reason'] = ''
      
      df.ix[df.opinion.str.lower().str.contains(r'too\s+(?:expensive|costly|much money)'), 'reason'] = 'Too Expensive'
      
      df.ix[df.opinion.str.lower().str.contains(r'(?:not fast enough|too slow)'), 'reason'] = 'Too Slow'
      
      In [309]: df
      Out[309]:
                                                  opinion         reason
      0                      He said it was too expensive  Too Expensive
      1                        She said it was too costly  Too Expensive
      2                 He thought it was not fast enough       Too Slow
      3  They thought is was not right and too much money  Too Expensive
      4   Her view was that it was too small and too slow       Too Slow
      

      【讨论】:

        【解决方案4】:

        Pandas 有一个将函数应用于行的快速解决方案,因此 .apply 几乎就是为此而设计的。理想情况下,矢量化是最快的,但我想不出办法。 .apply 在那之后,迭代行是最慢的,所以如果可能的话最好避免它。

        此外,您可能希望为关键字列表使用字典,作为扩大潜在关键字列表的便捷方式。

        def categorizer(x):
        main_dict = {"too much money":"too expensive", "too expensive":"too expensive", "too costly":"too expensive", "too slow":"too slow", "not fast enough": "not fast enough"}
        for key in main_dict:
            if key in x:
                return main_dict[key]
        df["Category"] = df["opinion"].apply(lambda x:categorizer(x)) 
        

        【讨论】:

        • .apply 不只是为你做迭代吗?我不认为它比迭代更快。
        • 不,.apply 以行/列为基础工作。 .applymap 是您所想的,它以元素为基础运行。使用 .apply vs iterrows 可以在大数据集上获得相当显着的速度提升
        猜你喜欢
        • 2014-10-11
        • 2020-01-22
        • 2021-03-18
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-08-13
        • 1970-01-01
        • 2019-06-05
        相关资源
        最近更新 更多