【问题标题】:How to enumerate combinations of words?如何枚举单词的组合?
【发布时间】:2019-07-01 07:23:01
【问题描述】:

给定带有斜线 definitive/deterministic arithmetic/calculation 的文本,目标是枚举单词的可能组合,例如预期输出:

definitive arithmetic
deterministic arithmetic
definitive calculation
deterministic calculation

另一个例子,输入voice/speech wave information processing method/technique,预期输出:

voice wave information processing method
voice wave information processing technique
speech wave information processing method
speech wave information processing technique

有时会有括号,预期的输出将是枚举输出有和没有括号内的术语,例如输入bactericidal/microbidical (nature/properties),预期输出:

bactericidal
microbidical
bactericidal nature
bactericidal properties
microbidical nature
microbidical properties

我试过这个用单斜线解决文本,但它太hacky了,有没有更简单的方法?

for english in inputs:
    if sum([1 for tok in english.split(' ') if '/' in tok]) == 1:
        x = [1 if '/' in tok else 0 for tok in english.split(' ') ]

        left = english.split(' ')[:x.index(1)]
        word = english.split(' ')[x.index(1)].split('/')
        right = english.split(' ')[x.index(1)+1:]

        for tok in word:
            print(' '.join([left + [tok] + right][0]))

我如何也捕获多个斜线的情况?

以下是可能的输入列表:

definitive/deterministic arithmetic/calculation
random/stochastic arithmetic/calculation
both ends/edges/terminals
to draw/attract/receive attention
strict/rigorous/exact solution
both ends/edges/terminals
easy to conduct/perform/carry out
easy to conduct/perform/carry out
between/among (equals/fellows)
reference/standard/nominal value
one kind/type/variety/species
primary cause/source/origin
to be disordered/disturbed/chaotic
same category/class/rank
while keeping/preserving/maintaining/holding
driving/operating in the reverse/opposite direction
only/just that portion/much
cannot doubt/question/suspect
does not reach/attain/match
tube/pipe/duct axis
recatangular/Cartesian/orthogonal coordinates
tube/pipe/duct wall
acoustic duct/conduit/channel
site of damage/failure/fault
voice/speech wave information processing method/technique
fundamental/basic theorem/proposition
single/individual item/product
one body/unit/entity
first stage/grade/step
time/era of great leaps/strides
one form/shape/figure
reserve/spare circuit/line
basic/base/backing material
set/collection/group of tables
in the form of a thin sheet/laminate/veneer
minute/microscopic pore/gap
forming/molding and working/machining
small amount/quantity/dose
liquid crystal form/state/shape
to rub/grind/chafe the surface
the phenomenon of fracture/failure/collapse
compound/composite/combined effect
molecular form/shape/structure
…st/…nd/….rd/…th group (periodic table)
the architectural/construction world/realm
to seal/consolidate a material by firing/baking
large block/clump/clod
bruned/baked/fired brick
unbruned/unbaked/unfired brick
kiln/furnance/oven surface
stationary/stator vane/blade
moving/rotor vane/blade
industrial kiln/furnance/oven
mean/average pore size/diameter
hardened/cured/set material
kiln/oven/furnance lining
piping (layout/arrangement/system)
metallic luster/brilliance/shine
mechanical treatment/working/processing
thin-sheet/laminate/veneer manufacture
thin sheet/laminate/veneer
vehicle (cars/trucks/trains) field
sheet/panel/plate thickness
corrosion prevention/resistance/protection
wriggling/squirming/slithering motion
method for forming/molding/shaping
object to be molded/formed/shaped
pressurized molding/forming/shaping equipment
premolded/preformed object/body
to seal/consolidate a material by firing/baking
furnance/kiln/oven wall
slipping/sliding/gliding mode
bactericidal/microbidical (nature/properties)
secondary/rechargeable cell/battery
new region/domain/area

【问题讨论】:

    标签: python regex string combinations enumerate


    【解决方案1】:

    看来您应该只使用itertools.product()。您可以拆分空格和'/',这将适用于单个单词和组。例如:

    from itertools import product
    
    s = "definitive/deterministic arithmetic/calculation"
    l = [g.split('/') for g in s.split(' ')]
    [" ".join(words) for words in product(*l)]
    

    结果:

    ['definitive arithmetic',
     'definitive calculation',
     'deterministic arithmetic',
     'deterministic calculation']
    

    或:

    s = "voice/speech wave information processing method/technique"
    l = [g.split('/') for g in s.split(' ')]
    [" ".join(words) for words in product(*l)]
    

    结果:

    ['voice wave information processing method',
     'voice wave information processing technique',
     'speech wave information processing method',
     'speech wave information processing technique']
    

    【讨论】:

    • 带括号的输入呢?
    【解决方案2】:

    这将尊重输入中的括号。这个想法是在开头用/替换括号(...),所以(string1/string2)会变成/string1/string2。然后split('/') 将创建包含空字符串['', 'string1', 'string2'] 的列表。然后你将使用itertools.product:

    data = [
        'definitive/deterministic arithmetic/calculation',
        'vehicle (cars/trucks/trains) field',
    ]
    
    import re
    from itertools import product
    
    for d in data:
        l = [w.split('/') for w in re.sub(r'\(([^)]+)\)', r'/\1', d).split()]
        print([' '.join(i for i in p if i) for p in product(*l)])
    

    打印:

    ['definitive arithmetic', 'definitive calculation', 'deterministic arithmetic', 'deterministic calculation']
    ['vehicle field', 'vehicle cars field', 'vehicle trucks field', 'vehicle trains field']
    

    【讨论】:

      【解决方案3】:

      根据您的问题,这里有一个正则表达式,可用于根据需要解析您的输入

      \w+/\w+|\W+\w+|\W+\w+\W+

      使用这个正则表达式,你可以得到你需要的所有单词。在这个正则表达式中,我们放置可选条件来检查字符串是否有可选字符(如()或其他一些字符)

      您可以使用 re.findall() 方法以文本字符串为例来解析上述正则表达式

      import re
      
      expression="\w+\/\w+|\W+\w+|\W+\w+\W+"
      
      test_string="abcd/efgh(sabdhaksdaksdas)/ijkl/mnop1/qerst/(abcdef)"
      
      print(re.findall(expression,test_string))
      

      此正则表达式可能存在一些问题。您可以以此为起点。

      在此之后,您可以使用以下答案中的 itertools.product 方法来获取所有可能的单词组合。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-02-14
        • 1970-01-01
        • 2020-05-19
        • 2011-04-09
        • 2015-08-03
        相关资源
        最近更新 更多