【问题标题】：Determine protocol of the link using Python alternatives使用 Python 替代方案确定链接协议
【发布时间】：2016-11-27 04:40:16
【问题描述】：

我需要找出最好的方法来确定用于访问特定链接的协议。输入：字符串链接地址（以protocol://...开头）

这是我发现实现必要功能的最方便的方法：

def detectProtocol(url):
    ind = url.find("://")
    return url[0:ind] if (ind != -1) else 'default_prot'

但我很感兴趣，从性能的角度来看，最好的方法是什么。也许使用re 匹配会更好？（但不是那么用户友好）

提前致谢！

附：如果你有自己的选择，欢迎分享

【问题讨论】：

与其尝试发明自己的 URL 解析方案，为什么不使用 Python 的标准库呢？根据您使用的 Python 版本，您可能需要 urllib.parse 模块 (Python 3) 或 urlparse 模块 (Python 2)。
@Blckknght 感谢您的回答。实际上，我必须使用适用于 2.7.x 和 3.x python 版本的东西。我会看看你的建议。更新：使用内置插件是一个不错的选择，但问题是我在两个 python 专业中都找不到使用它的方法；（
重复？ stackoverflow.com/questions/3883871/…
@MichaelHoff 这是关于比较而不是关于使用正则表达式的实际方式。

标签： python regex parsing

【解决方案1】：

性能对比

这种比较忽略了所用函数的稳定性以及协同效应等其他方面。例如，urlparse 提供的信息比仅方案更多，因此可用于为其他需求提供数据。

Python 2.7.11+

Testing detect_protocol_by_index
1.56482505798
Testing detect_protocol_by_urlparse
9.13317012787
Testing detect_protocol_by_regex
3.11044311523

Python 3.5.1+

Testing detect_protocol_by_index
1.5673476169999958
Testing detect_protocol_by_urlparse
15.466406801000176
Testing detect_protocol_by_regex
3.0660895540004276

来源

import sys 
import timeit
import re

if sys.version_info >= (3, 0): 
    from urllib.parse import urlparse
else:
    from urlparse import urlparse


def detect_protocol_by_index(url):
    ind = url.find("://")
    return url[0:ind] if (ind != -1) else 'default_prot'

def detect_protocol_by_urlparse(url):
    scheme = urlparse(url).scheme
    return scheme if scheme else 'default_prot'

regex = re.compile('^[^:]+(?=:\/\/)')
def detect_protocol_by_regex(url):
    match = regex.match(url)
    return match.group(0) if match else 'default_prot'

### TEST SETUP ###

test_urls = ['www.example.com', 'http://example.com', 'https://example.com', 'ftp://example.com']

def run_test(func):
    for url in test_urls:
        func(url)

def run_tests():
    funcs = [detect_protocol_by_index, detect_protocol_by_urlparse, detect_protocol_by_regex]
    for func in funcs:
        print("Testing {}".format(func.__name__))
        print(timeit.timeit('run_test({})'.format(func.__name__), setup="from __main__ import run_test, {}".format(func.__name__)))

if __name__ == '__main__':
    run_tests()

【讨论】：

非常感谢您的回答！这就是我要找的东西！

【解决方案2】：

您可以为此使用正则表达式 (r'^[a-zA-Z]+://')，并在检查它是否有效之前对其进行编译。

但是你有一个内置函数：

import urlparse
url = urlparse.urlparse('https://www.wwww.com')
print url.scheme

输出：

>>> https

【讨论】：

内置插件很好，但没有办法让它们同时在 Python 2 和 Python 3 中工作（因为我没有找到方法）。我感兴趣什么更好：find() 或 regex()？性能很重要。

【解决方案3】：

如果您正在寻找跨 python 版本的解决方案：

try:
    import urlparse
except ImportError:
    import urllib.parse as urlparse

url = urlparse.urlparse('https://www.example.com')

print(url.scheme)

如果您希望打印相同，可以将from __future__ import print_function 添加到脚本顶部。

【讨论】：

表示感谢。让东西跨版本兼容的好方法！