【问题标题】:Python Domain Name Regular Expression PatternPython 域名正则表达式模式
【发布时间】:2016-05-30 20:20:42
【问题描述】:

我希望能够按照以下规则匹配域:

  • 域名应该是a-z | AZ | 0-9 和连字符 (-)
  • 域名长度应介于 1 到 63 个字符之间
  • 最后一个 Tld 必须至少为两个字符,最多为 6 个字符
  • 域名不应以连字符 (-) 开头或结尾(例如 -google.com 或 google-.com)
  • 域名可以是子域(例如 mkyong.blogspot.com)

我已经有了 java 风格的正则表达式,我只需要这个 python 风格的

^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$

我找不到任何 python 正则表达式,因为每个人都希望使用 urlparse。我不需要按域、端口、tlds 等拆分 url,我只需要做一个简单的域替换,所以正则表达式应该是我的解决方案

我做了什么:

expectedstring = re.sub(r"^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$" , "XXX" , string)

示例字符串:

string = "This is why this domain example.com will never be the same after some years, it might just be example.co.uk but will never get to example.-com. Documents could be located in this specific location http://en.example.com/documents/print.doc as you probably already know."

expectedstring = "This is why this domain XXX will never be the same after some years, it might just be XXX but will never get to example.-com. Documents could be located in this specific location http://XXX/documents/print.doc as you probably already know."

有效域名列表

  • www.google.com
  • google.com
  • mkyong123.com
  • mkyong-info.com
  • sub.mkyong.com
  • sub.mkyong-info.com
  • mkyong.com.au
  • g.co
  • mkyong.t.t.co

无效域名列表,以及原因。

  • mkyong.t.t.c - Tld 的长度必须在 2 到 6 之间
  • mkyong,com - 不允许使用逗号
  • mkyong - 没有 Tld
  • mkyong.123 , Tld 不允许数字
  • .com - 必须以 [A-Za-z0-9] 开头
  • mkyong.com/users - 无 Tld
  • mkyong.com - 不能以连字符开头 -
  • mkyong-.com - 不能以连字符结尾 -
  • sub.-mkyong.com - 不能以连字符开头 -
  • sub.mkyong-.com - 不能以连字符结尾 -

【问题讨论】:

  • 当你在 Python 中尝试这个“java 风格的正则表达式”时发生了什么?对我来说,这看起来像是完全正常的标准正则表达式语法。
  • 我在做:string = re.sub(r"^((((([A-Za-z0-9]+){1,63}\.)|(([A -Za-z0-9]+(\-)+[A-Za-z0-9]+){1,63}\.))+){1,255}$" , "XXX" , string) 没有任何变化
  • 嗯,这是与您的问题不同的正则表达式。另外,string 是什么?
  • 我搞砸了,我已经更新了我的问题以匹配正确的正则表达式并且正在使用
  • 这是一个好域名吗? mkyong.t.t.t.co

标签: java python regex


【解决方案1】:

我根据给定域名列表运行测试(python 2.7x):

import re
valid_domains = """
www.google.com
google.com
mkyong123.com
mkyong-info.com
sub.mkyong.com
sub.mkyong-info.com
mkyong.com.au
g.co
mkyong.t.t.co
"""

invalid_domains = """
mkyong.t.t.c
mkyong,com
mkyong
mkyong.123
.com
mkyong.com/users
-mkyong.com
mkyong-.com
sub.-mkyong.com
sub.mkyong-.com
"""

valid_names = valid_domains.split()
invalid_names = invalid_domains.split()

# match 1 character domain name or 2+ domain name
pattern = '^([A-Za-z0-9]\.|[A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9]\.){1,3}[A-Za-z]{2,6}$'

print 'checking valid domain names ============'
for name in valid_names:
    print name.ljust(50), ('True' if re.match(pattern, name) else 'False').rjust(5)

print '\nchecking invalid domain names ============'
for name in invalid_names:
    print name.ljust(50), ('True' if re.match(pattern, name) else 'False').rjust(5)

输出:

checking valid domain names ============
www.google.com                                      True
google.com                                          True
mkyong123.com                                       True
mkyong-info.com                                     True
sub.mkyong.com                                      True
sub.mkyong-info.com                                 True
mkyong.com.au                                       True
g.co                                                True
mkyong.t.t.co                                       True

checking invalid domain names ============
mkyong.t.t.c                                       False
mkyong,com                                         False
mkyong                                             False
mkyong.123                                         False
.com                                               False
mkyong.com/users                                   False
-mkyong.com                                        False
mkyong-.com                                        False
sub.-mkyong.com                                    False
sub.mkyong-.com                                    False

[编辑] 为了达到与提供的预期字符串相同的结果,我提出了以下方法(不检查“http(s)”):

import re

# match 1 character domain name or 2+ domain name
pattern = '(//|\s+|^)(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}'

string = "This is why this domain example.com will never be the same after some years, it might just be example.co.uk but will never get to example.-com. Documents could be located in this specific location http://en.example.com/documents/print.doc as you probably already know."
expectedstring = "This is why this domain XXX will never be the same after some years, it might just be XXX but will never get to example.-com. Documents could be located in this specific location http://XXX/documents/print.doc as you probably already know."

resultstring = ''.join([re.sub(pattern , "\g<1>XXX" , string)])

print 'resultstring: \n', resultstring
print '\nare they equal? ', expectedstring == resultstring

输出是:

resultstring: 
This is why this domain XXX will never be the same after some years, it might just be XXX but will never get to example.-com. Documents could be located in this specific location http://XXX/documents/print.doc as you probably already know.

are they equal?  True

【讨论】:

  • 用你的正则表达式尝试了我的字符串 string = re.sub(r'^([A-Za-z0-9]\.|[A-Za-z0-9][A-Za- z0-9-]{0,61}[A-Za-z0-9]\.){1,3}[A-Za-z]{2,6}$' , "XXX" , 字符串)并且仍然不做任何替换。我什至在这里测试了你的正则表达式:regexr.com/3cr2h 仍然不匹配
  • 对于regexr.com的在线工具,只尝试一行字符串(例如,www.demo.com),你会找到一个匹配。
  • @faceoff:刚刚更新了我的方法以获得预期的字符串。
  • 为什么用“http”分割字符串?这个怎么样: string = re.sub(r"(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.) +[a-z0-9][a-z0-9-]{0,61}[a-z0-9]" , "XXX" , string) - 做同样的工作,即使我是 python 新手,看起来更简单。
  • @faceoff:我在 regexr.com 上尝试了您的正则表达式,请参阅:i.stack.imgur.com/o8QKp.jpg。匹配项是 mkyong.123、-mkyong.com、sub.-mkyong.com、sub.mkyong-.com、3.141、foo@demo.net、mkyong.t.t.t.co,但不能匹配 www.GOOGLE.com。这是完全错误的。请尝试您的 re.sub 看看您是否可以解决您自己的问题。我知道我的正则表达式远非最简单的,但它可以用“XXX”匹配或替换域名,对吧?