如何从网址中提取标题？答案

【问题标题】：how to extract a headline form a url?如何从网址中提取标题？
【发布时间】：2016-10-24 18:27:19
【问题描述】：

我有一个标题数据集，例如

http://www.stackoverflow.com/lifestyle/tech/this-is-a-very-nice-headline-my-friend/2013/04/26/acjhrjk-2e1-1krjke4-9el8c-2eheje_story.html?tid=sm_fb

http://www.stackoverflow.com/2015/07/15/sports/baseball/another-very-nice.html?smid=tw-somedia&seid=auto

http://worldnews.stack.com/news/2013/07/22/54216-hello-another-one-here?lite

http://www.stack.com/article_email/hello-one-here-that-is-cool-1545545554-lMyQjAxMTAHFJELMDgxWj

http://www.stack.com/2013/11/13/tech/tricky-one/the-real-one/index.html

http://www.stack.com/2013/11/13/tech/the-good-one.html

http://www.stack.com/news/science-and-technology/54512-hello-world-here-is-a-weird-character#b02g07f20b14

我需要从这些链接中提取正确的标题，即：

这是一个非常好的标题-我的朋友
另一个非常好的
你好-另一个-这里
你好，这里很酷
真正的人
最好的人
hello-world-here-is-a-weird-character

所以该规则似乎找到了word1-word2-word3- 形式的最长字符串，该字符串的左右边框有一个/，并且没有考虑 p>

多于 3 位的单词（例如第一个链接中的 acjhrjk-2e1-1krjke4-9el8c-2eheje，或第三个链接中的 54216，
不包括 .html 之类的内容。

如何在 Python 中使用 regex 做到这一点？不幸的是，我相信正则表达式是唯一可行的解决方案。 yurl 或urlparse 等包可以捕获url的路径，但后来我又回到使用正则表达式来获取标题..

非常感谢！

【问题讨论】：

标签： python regex string url-parameters urlparse

【解决方案1】：

毕竟，正则表达式可能不是您的最佳选择。
但是，根据您提出的规范，您可以执行以下操作：

import re

urls = ['http://www.stackoverflow.com/lifestyle/tech/this-is-a-very-nice-headline-my-friend/2013/04/26/acjhrjk-2e1-1krjke4-9el8c-2eheje_story.html?tid=sm_fb',
'http://www.stackoverflow.com/2015/07/15/sports/baseball/another-very-nice.html?smid=tw-somedia&seid=auto',
'http://worldnews.stack.com/news/2013/07/22/54216-hello-another-one-here?lite',
'http://www.stack.com/article_email/hello-one-here-that-is-cool-1545545554-lMyQjAxMTAHFJELMDgxWj',
'http://www.stack.com/2013/11/13/tech/tricky-one/the-real-one/index.html',
'http://www.stack.com/2013/11/13/tech/the-good-one.html',
'http://www.stack.com/news/science-and-technology/54512-hello-world-here-is-a-weird-character#b02g07f20b14']

regex = re.compile(r'(?<=/)([-\w]+)(?=[.?/#]|$)')
digits = re.compile(r'-?\d{3,}-?')

for url in urls:
    substrings = regex.findall(url)
    longest = max(substrings, key=len)
    headline = re.sub(digits, '', longest)
    print headline

这将打印

 this-is-a-very-nice-headline-my-friend
 another-very-nice
 hello-another-one-here
 hello-one-here-that-is-coollMyQjAxMTAHFJELMDgxWj
 the-real-one
 the-good-one
 hello-world-here-is-a-weird-character

见a demo on ideone.com。

说明

在这里，正则表达式使用 lookarounds 来查找后面的 / 和前面的 .?/# 之一。任何单词字符和中间的破折号都会被捕获。
这不是很具体，但如果您正在寻找最长的子字符串并在之后消除三个以上的连续数字，这可能是一个很好的起点。
正如 cmets 中已经说过的，使用语言工具可能会更好。

【讨论】：

谢谢！！所以它没有捕捉到第二个标题？
另外，你为什么说正则表达式可能不是我最好的选择？
比如像 nltk？
@Noobie：是的。但以前从未使用过。大概是两者的结合。为了提供更好的答案，请提供更多网址。
@Noobie：不：它会寻找其中之一（毕竟它是一个字符类）。