【问题标题】：Regex for any number of words before new line换行前任意数量单词的正则表达式
【发布时间】：2017-08-16 02:25:31
【问题描述】：

我解析了段落中的一些文本，我想将其拆分以插入表格中。

字符串看起来像：

["Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string \n 123 some more text (50% and some more text) \n"]

我想要做的是拆分新行之前的第一个文本字符串，因为它是 - 不管它可能是什么。我从尝试这个[A-Za-z]*\s*[A-Za-z]*\s* 开始，但很快意识到这不会削减它，因为这个字符串中的文本是可变的。

然后我想取第二个字符串中的数字，以下似乎是这样做的：

\d+

最后我想得到第二个字符串中的百分比，以下似乎适用：

\d+(%)+

我正计划在函数中使用这些，但在为第一部分编译正则表达式时遇到了困难？我还想知道我为后两部分使用的正则表达式是否最有效？

更新：希望这能让它更清楚一点？

输入：

[‘ The first chunk of text \n 123 the stats I want (25% the percentage I want) \n The Second chunk of text \n 456 the second stats I want (50% the second percentage I want) \n The third chunk of text \n 789 the third stats I want (75% the third percentage) \n The fourth chunk of text \n 101 The fourth stats (100% the fourth percentage) \n]

期望的输出：

【问题讨论】：

\d+(%)+ 中的括号完全是多余的。您真的打算允许超过 1 个百分号吗？
只是出于好奇，如果您使用的是 python，为什么要使用正则表达式？ yourstring.split('\n')[0] 不会成功吗？
你想要\n之前的任何东西，对吧？只需执行 .*\n。还是我误解了你的要求？
@mkingsbu 我想要偶数行中的所有单词和奇数行中的数字 - 如果这有意义吗？
我明白了。是的，在这种情况下，正如其他人所提到的，我认为正则表达式是解决此问题的错误方法。正则表达式必须将所有逻辑都包含在其自身中，这可能很困难，具体取决于文件的大小。我不知道 Python，但我知道如何在 Bash 中做到这一点。我会制作一个 C 风格的 for 循环并将变量读入其中。如果计数是偶数，那么我会将它附加到偶数矩阵，如果它是奇数，我会将它附加到那里。也许最好问一个稍微更广泛的问题，关于您要如何处理这些数据？

标签： python regex

【解决方案1】：

2 首行

您可以使用split 获取前两行：

import re

data = ["Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string \n 123 some more text (50% and some more text) \n"]

first_line, second_line = data[0].split("\n")[:2]
print first_line
# Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string

digit_match = re.search('\d+(?![\d%])', second_line)
if digit_match:
    print digit_match.group()
    # 123

percent_match = re.search('\d+%', second_line)
if percent_match:
    print percent_match.group()
    # 50%

请注意，如果百分比写在其他数字之前，\d+ 将匹配百分比（不带 %）。我添加了negative-lookahead 以确保匹配的数字后没有数字或%。

每对

如果你想继续解析成对的行：

data = [" The first chunk of text \n 123 the stats I want (25% the percentage I want) \n The Second chunk of text \n 456 the second stats I want (50% the second percentage I want) \n The third chunk of text \n 789 the third stats I want (75% the third percentage) \n The fourth chunk of text \n 101 The fourth stats (100% the fourth percentage) \n"]

import re

lines = data[0].strip().split("\n")

# TODO: Make sure there's an even number of lines
for i in range(0, len(lines), 2):
    first_line, second_line = lines[i:i + 2]

    print first_line

    digit_match = re.search('\d+(?![\d%])', second_line)
    if digit_match:
        print digit_match.group()

    percent_match = re.search('\d+%', second_line)
    if percent_match:
        print percent_match.group()

它输出：

The first chunk of text 
123
25%
 The Second chunk of text 
456
50%
 The third chunk of text 
789
75%
 The fourth chunk of text 
101
100%

【讨论】：

谢谢你，真的很感激！有没有办法可以将其应用于不仅仅是第一个实例？ IE。我的实际数据包含超过 2 个字符串，我希望对所有字符串都这样做吗？
@Maverick：更新。
谢谢您，非常感谢您的帮助！