正则表达式：从 URL 中提取特定值答案

【问题标题】：Regex: Extract specific value from URL正则表达式：从 URL 中提取特定值
【发布时间】：2017-11-06 02:14:51
【问题描述】：

我在使用 re 库从 URL 中提取字符串时遇到了一些麻烦。

这是一个例子：

http://www.example.it/remoteconnexion.aspx?u=xxxxx@abc.it&direction=vente.aspx%3pid%xx123%63abcd"

我有一个数据框，我想使用另一列中的值添加一列，在此示例中 df['URL_REG'] 包含：'123'？

df['URL_REG'] = df['URL'].map(lambda x : re.findall(r'[REGEX]+', x)[0])

URL 的结构可以改变，但我想要的部分总是在 'direction=vente.aspx%3pid%' 和 '%' 之间。

【问题讨论】：

标签： python regex pandas parsing dataframe

【解决方案1】：

使用矢量化Series.str.extract()方法：

In [50]: df['URL_REG'] = df.URL.str.extract(r'direction=vente.aspx\%3pid\%([^\%]+)\%*',
                                            expand=False)

In [51]: df
Out[51]:
                                                 URL URL_REG
0  http://www.example.it/remoteconnexion.aspx?u=x...   xx123

更新：

我只想要 '123' 部分而不是 'xx123'，其中 'xx' 是十六进制数

In [53]: df['URL_REG'] = df.URL.str.extract(r'direction=vente.aspx\%3pid\%\w{2}(\d+)\%*', 
                                            expand=False)

In [54]: df
Out[54]:
                                                 URL URL_REG
0  http://www.example.it/remoteconnexion.aspx?u=x...     123

【讨论】：

它的作品，但我只是忘了标记我只想要'123'部分而不是'xx123'
@Omar14，你的 URL 中有 %xx 还是那些 xx - 数字？
大多数时候我只有一个数字和一个字母。 '3d123' 其中 xx = 3d。

【解决方案2】：

你可以使用这个模式：

import re

url='http://www.example.it/remoteconnexion.aspxu=xxxxx@abc.it&direction=vente.aspx%3pid%xx123%63abcd'
output = re.findall('3pid%(.*?)%', url)

print(output)

输出：

['xx123']

然后将相同的模式应用到您的 DataFrame。

例如：

import pandas as pd
import re

df = pd.DataFrame(['http://www.example.it/remoteconnexion.aspx?u=xxxxx@abc.it&direction=vente.aspx%3pid%xx123%63abcd'], columns = ['URL'])

output = df['URL'].apply(lambda x : re.findall('3pid%(.*?)%', x))

print(output)

# Or, maybe if you want to return the url and the data captured:
# output = df['URL'].apply(lambda x : (x, re.findall('3pid%(.*?)%', x)))
# output[0]
# >>> ('http://www.example.it/remoteconnexion.aspx?u=xxxxx@abc.it&direction=vente.aspx%3pid%xx123%63abcd', 
#   ['xx123'])

输出：

0    [xx123]
Name: URL, dtype: object

【讨论】：