【发布时间】:2015-12-16 20:07:37
【问题描述】:
根据https://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.2011,C++11标准的正则表达式引擎应该在GCC中完成。现在,谁能解释一下为什么这个简单的例子
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main ()
{
string string_array[] = {"http://www.cplusplus.com/reference/regex/regex_match/",
"tcp://192.168.2.1:1234/hello/how/are/you",
"https://mail.google.com/mail/u/0/?tab=wm#inbox/15178022db56df29?projector=1"};
regex e("^(?:([A-Za-z]+):)?(\\/{0,3})([0-9.\\-A-Za-z]+)(?::(\\d+))?(?:\\/([^?#]*))?(?:\\?([^#]*))?(?:#(.*))?$");
for(int i=0; i<3; i++)
{
smatch sm;
regex_match (string_array[i],sm,e);
for (unsigned i=0; i<sm.size(); ++i)
{
cout << "[" << sm[i] << "] ";
}
cout << endl;
}
return 0;
}
这个输出的结果(注意例如第二行的端口号解析不正确,但似乎有很多错误)
[http://www.cplusplus.com/reference/regex/regex_match/] [http] [//] [www.cplusplus.com/reference/regex] [] [regex_match/] [] []
[tcp://192.168.2.1:1234/hello/how/are/you] [tcp] [//] [192.168.2.1:1234/hello/how/are/you] [] [] [] []
[https://mail.google.com/mail/u/0/?tab=wm#inbox/15178022db56df29?projector=1] [https] [//] [mail.google.com/mail/u/0/?tab=wm] [] [] [] [inbox/15178022db56df29?projector=1]
而它的python对应物
import re
string_array = ["http://www.cplusplus.com/reference/regex/regex_match/",
"tcp://192.168.2.1:1234/hello/how/are/you",
"https://mail.google.com/mail/u/0/?tab=wm#inbox/15178022db56df29?projector=1"]
e = re.compile("^(?:([A-Za-z]+):)?(\\/{0,3})([0-9.\\-A-Za-z]+)(?::(\\d+))?(?:\\/([^?#]*))?(?:\\?([^#]*))?(?:#(.*))?$");
for i in range(len(string_array)):
m = e.match(string_array[i])
print(m.groups())
正确打印了吗?
('http', '//', 'www.cplusplus.com', None, 'reference/regex/regex_match/', None, None)
('tcp', '//', '192.168.2.1', '1234', 'hello/how/are/you', None, None)
('https', '//', 'mail.google.com', None, 'mail/u/0/', 'tab=wm', 'inbox/15178022db56df29?projector=1')
我在 archlinux 上使用 gcc 5.3.0
编辑:
我把程序改成了这个,检查了正则表达式 syntax_option_type 标志
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main ()
{
string string_array[] = {"http://www.cplusplus.com/reference/regex/regex_match/",
"tcp://192.168.2.1:1234/hello/how/are/you",
"https://mail.google.com/mail/u/0/?tab=wm#inbox/15178022db56df29?projector=1"};
regex e("^(?:([A-Za-z]+):)?(\\/{0,3})([0-9.\\-A-Za-z]+)(?::(\\d+))?(?:\\/([^?#]*))?(?:\\?([^#]*))?(?:#(.*))?$");
for(int i=0; i<3; i++)
{
smatch sm;
cout << "match: " <<regex_match (string_array[i],sm,e) << endl;
for (unsigned i=0; i<sm.size(); ++i)
{
cout << "[" << sm[i].str() << "] ";
}
}
cout << endl;
switch(e.flags())
{
case regex_constants::basic:
cout << "POSIX syntax was used" << endl;
break;
case regex_constants::awk:
cout << "POSIX awk syntax was used" << endl;
break;
case regex_constants::ECMAScript:
cout << "ECMA syntax was used" << endl;
break;
case regex_constants::egrep:
cout << "POSIX egrep syntax was used" << endl;
break;
}
return 0;
}
最后我居然得到了
match: 1
[http://www.cplusplus.com/reference/regex/regex_match/] [http] [//] [www.cplusplus.com/reference/regex] [] [regex_match/] [] [] match: 1
[tcp://192.168.2.1:1234/hello/how/are/you] [tcp] [//] [192.168.2.1:1234/hello/how/are/you] [] [] [] [] match: 1
[https://mail.google.com/mail/u/0/?tab=wm#inbox/15178022db56df29?projector=1] [https] [//] [mail.google.com/mail/u/0/?tab=wm] [] [] [] [inbox/15178022db56df29?projector=1]
ECMA syntax was used
这似乎真的是一个编译器错误..
【问题讨论】:
-
regex_match需要一个完整的字符串匹配,re.match只需要一个匹配在字符串的开头。 -
@stribizhev 好的,但是由于regex_match返回true,像re.match(返回一个匹配对象)有什么区别?
-
您还有其他问题,您的 c++ 正则表达式
[0-9.\-A-Za-z]+中的连字符已正确转义。 -
好吧,我认为 C++ 正则表达式在技术上没有任何问题。它可以通过某些类跨越行,但除此之外,您一次只使用一个字符串,所以应该没问题。如果组没有被填充,那是因为它们碰巧不匹配可选组,但整个正则表达式成功。
"^(?:([A-Za-z]+):)?(/{0,3})([0-9.\\-A-Za-z]+)(?::(\\d+))?(?:/([^?#\\r\\n]*))?(?:\\?([^#\\r\\n]*))?(?:\\#(.*))?$"