【问题标题】:Deleting elements in list of list using list comprehensions(Python)使用列表推导删除列表中的元素(Python)
【发布时间】:2020-05-07 14:10:22
【问题描述】:

我有以下数据:

[['The',
  'Fulton',
  'County',
  'Grand',
  'Jury',
  'said',
  'Friday',
  'an',
  'investigation',
  'of',
  "Atlanta's",
  'recent',
  'primary',
  'election',
  'produced',
  '``',
  'no',
  'evidence',
  "''",
  'that',
  'any',
  'irregularities',
  'took',
  'place',
  '.'],
 ['The',
  'jury',
  'further',
  'said',
  'in',
  'term-end',
  'presentments',
  'that',
  'the',
  'City',
  'Executive',
  'Committee',
  ',',
  'which',
  'had',
  'over-all',
  'charge',
  'of',
  'the',
  'election',
  ',',
  '``',
  'deserves',
  'the',
  'praise',
  'and',
  'thanks',
  'of',
  'the',
  'City',
  'of',
  'Atlanta',
  "''",
  'for',
  'the',
  'manner',
  'in',
  'which',
  'the',
  'election',
  'was',
  'conducted',
  '.']]

所以我有一个包含 2 个其他列表的列表(在我的情况下,我在一个大列表中有 50000 个列表)。 我想删除所有标点符号和停用词,如“the”、“a”、“of”等。

这是我编写的代码:

import string
from nltk.corpus import stopwords
nltk.download('stopwords')

punct = list(string.punctuation)
punct.append("``")
punct.append("''")
stops = set(stopwords.words("english")) 

res = [[word.lower() for word in sentence if word not in punct or word.lower() in not stops] for sentence in dataset] 

但它返回的列表与我最初拥有的列表相同。 我的代码有什么问题?

【问题讨论】:

    标签: python string list list-comprehension


    【解决方案1】:

    你应该使用and 而不是or

    res = [[word.lower() for word in sentence if word not in punct and word.lower() not in stops] for sentence in dataset]
    

    否则你会得到所有元素,因为它们至少不存在于stopspunct 列表之一中。

    【讨论】:

      【解决方案2】:

      由于punctstops 没有重叠,每个 单词要么不在其中一个,要么不在另一个(或可能两者兼有);您想测试不在 both 中的单词。

      【讨论】:

        【解决方案3】:

        假设可以更新stops,这是一种避免2级理解的替代方法

        import string
        import nltk
        from nltk.corpus import stopwords
        
        
        dataset = [
          ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an',
           'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election',
           'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities',
           'took', 'place', '.'],
          ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments',
           'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had',
           'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves',
           'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta',
           "''", 'for', 'the', 'manner',
           'in', 'which', 'the', 'election', 'was', 'conducted', '.']
          ]
        
        nltk.download('stopwords')
        
        punct = list(string.punctuation)
        punct.append("``")
        punct.append("''")
        
        stops = set(stopwords.words("english"))
        
        # Union of punct and stops
        stops.update(punct)
        res1 = [[word for word in sentence if word.lower() not in stops]
                for sentence in dataset]
        
        # Alternative solution that avoids an explict 2-level list comprehension
        def filter_the(sentence, stops):
            return [word for word in sentence if word.lower() not in stops]
        
        
        res2 = [filter_the(sentence, stops) for sentence in dataset]
        
        
        print(res1 == res2)
        
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2016-04-01
          • 1970-01-01
          • 1970-01-01
          • 2019-07-21
          • 2016-05-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多