正则表达式从 URL 中提取子域？答案

【问题标题】：Regex to extract subdomain from URL?正则表达式从 URL 中提取子域？
【发布时间】：2009-07-27 16:16:13
【问题描述】：

我有一堆这样的域名进来：

http://subdomain.example.com（example.com 始终为 example.com，但子域不同）。

我需要“子域”。

有耐心学习正则表达式的好心人能帮帮我吗？

【问题讨论】：

是的，你可以拥有 string.string.domain.gtld

标签： regex

【解决方案1】：

上述正则表达式的问题是：如果你不知道协议是什么，或者域后缀是什么，你会得到一些意想不到的结果。这是针对这些情况的一些正则表达式说明。 :D

/(?:http[s]*\:\/\/)*(.*?)\.(?=[^\/]*\..{2,5})/i  //javascript

这应该始终返回您在第 1 组中的子域（如果存在）。这是一个 Javascript 示例，但它也应该适用于任何其他支持正向预测断言的引擎：

// EXAMPLE of use
var regex = /(?:http[s]*\:\/\/)*(.*?)\.(?=[^\/]*\..{2,5})/i
  , whoKnowsWhatItCouldBe = [
                        "www.mydomain.com/whatever/my-site" //matches: www
                      , "mydomain.com"// does not match
                      , "http://mydomain.com" // does not match
                      , "https://mydomain.com"// does not match
                      , "banana.com/somethingelse" // does not match
                      , "https://banana.com/somethingelse.org" // does not match
                      , "http://what-ever.mydomain.mu" //matches: what-ever
                      , "dev-www.thisdomain.com/whatever" // matches: dev-www
                      , "hot-MamaSitas.SomE_doma-in.au.xxx"//matches: hot-MamaSitas
                  , "http://hot-MamaSitas.SomE_doma-in.au.xxx" // matches: hot-MamaSitas
                  , "пуст.пустыня.ru" //even non english chars! Woohoo! matches: пуст
                  , "пустыня.ru" //does not match
                  ];

// Run a loop and test it out.
for ( var i = 0, length = whoKnowsWhatItCouldBe.length; i < length; i++ ){
    var result = whoKnowsWhatItCouldBe[i].match(regex);
    if(result != null){
      // YAY! We have a match!
    } else {
      // Boo... No subdomain was found
    }
}

【讨论】：

这显然是最好的答案，因为它考虑了协议、无/多个子域，并且它是独立于域的。
我想知道多个子域的所需输出...您希望它返回one.two 还是只返回one？我想我们可以调整正则表达式以在域之前提取所有 (.\.) 组......也许稍后
干得好，+1。 (file:\/\/|http:\/\/|https:\/\/|\/\/)*(.*?)\.(?=[^\/]*\..{2,5}) 如果你想允许其他协议
这在谷歌分析中工作以按子域过滤 - 必须删除前导 / 和尾随 /i (?:http[s]*\:\/\/)*(.*?)\.(?=[^\/]*\..{2,5})
@WebandFlow，结果SomE_doma-in 是您示例的子域，不是吗？我不清楚你的预期，与你得到的。我个人希望SomE_doma-in 能够匹配...

【解决方案2】：

/(http:\/\/)?(([^.]+)\.)?domain\.com/

如果提供了“子域”，则 $3（或 \3）将包含“子域”。

如果您想将子域放在第一个组中，并且您的正则表达式引擎支持非捕获组（害羞组），使用这是回文建议的：

/(?:http:\/\/)?(?:([^.]+)\.)?domain\.com/

【讨论】：

是的。他没有提到语言/库，所以我想让正则表达式尽可能便携——不确定是否所有实现都允许非捕获组。
如果你不知道domain是什么？
@DallasClark 在这种情况下，我会在下面推荐我的答案

【解决方案3】：

纯子域字符串（结果为$1）：

^http://([^.]+)\.domain\.com

将http:// 设为可选（结果为 2 美元）：

^(http://)?([^.]+)\.domain\.com

将http:// 和子域设为可选（结果为$3）：

(http://)?(([^.]+)\.)?domain\.com

【讨论】：

【解决方案4】：

应该是

\Qhttp://\E(\w+)\.domain\.com

子域将是第一组。

【讨论】：

【解决方案5】：

#!/usr/bin/perl

use strict;
use warnings;

my $s = 'http://subdomain.example.com';
my $subdomain = (split qr{/{2}|\.}, $s)[1];

print "'$subdomain'\n";

【讨论】：

【解决方案6】：

对于带有点字符的数学子域，我使用了这个

https?:\/\/?(?:([^*]+)\.)?domain\.com

获取协议之后直到域的所有匹配字符。

https://sub.domain.com（子）

https://sub.sub.domain.com (sub.sub) ...

【讨论】：

【解决方案7】：

第一组

http://(.*).example.com

【讨论】：

当然，忘记.* 将匹配一个空字符串，更重要的是，句点代表任何字符。