【问题标题】:Extract information part f URL in python在python中提取信息部分f URL
【发布时间】:2019-08-27 03:45:11
【问题描述】:

我有一个 200k 网址的列表,一般格式为:

http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....

/ 前后the-headline-of-the-article 的数量不同

这是一些示例数据:

'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
 'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
 'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
 'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
 'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
 'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
 'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
 'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
 'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
 'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
 'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
 'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
 'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',

我只想提取the-headline-of-the-article

即。

call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story

我确信这是可能的,但在 python 中使用正则表达式相对较新。

在伪代码中,我在想:

  • /分割所有内容

  • 只保留包含-的块

  • 将所有-替换为\s

这在 python 中是否可行(我是 python n00b)?

【问题讨论】:

  • 你的算法很好。你为什么不继续实施它?
  • 我认为您的算法不足以在图森样本中返回正确的片段。您可能需要从每个路径段中提取单词并返回该段中在英语词典中可以找到的单词最多的单词
  • 第一个和第三个 url 不一致。

标签: python regex url


【解决方案1】:
urls = [...]
for url in urls:
    bits = url.split('/') # Split each url at the '/'
    bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
    print (bits_with_hyphens)

[1] 请注意,您的算法假定拆分 url 后只有一个片段会有连字符,鉴于您的示例,这是不正确的。所以在 [1],我保留了所有这样做的位。

输出:

['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']

PS。我认为你的算法可以做一些思考。我看到的问题:

  • 多个位可能包含连字符,其中:
    • 两者都只包含字典单词(见第一个和第四个输出)
    • 其中一个“显然”不是标题(请参阅底部的第二个和第三个)
  • 真实标题末尾的虚假字符串片段:例如“13721842.php”、“revenues.asp”、“210002719.html”
  • 需要用空格代替“/”以外的字符,(参见第四个“General+News”)

【讨论】:

  • national news 不是头条新闻!
  • 正如答案所述,这是故意的。在一般情况下,您无法知道,对吗?尽管提出启发式方法可能是可行的,例如如果还有一个以上的提取片段,则转储任何仅带有一个连字符的提取片段。
  • 有没有办法保持连字符最多的位?如果是这样,则取最大计数。我假设python中的字符串中有一个计数函数
【解决方案2】:

这是一个略有不同的变体,它似乎从您提供的样本中产生了良好的结果。

在带有破折号的部分中,我们修剪掉任何尾随的十六进制字符串和文件扩展名;然后,我们从每个 URL 中提取破折号最多的一个,最后用空格替换剩余的破折号。

import re

regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)

for url in urls:
    parts = url.split('/')
    trimmed = [regex.sub('', x) for x in parts if '-' in x]
    longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
    print(longest.replace('-', ' '))

输出:

call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision

我最初的尝试只会在提取最长的 URL 后清除 URL 末尾的数字,并且它适用于您的示例;但是在拆分时立即修剪尾随数字可能更能抵抗这些模式的变化。

【讨论】:

  • 最接近的解决方案。我曾想过同样的路线,但在某些示例中,拆分后最长的字符串不是标题位。我正在考虑为最大计数为 - 的块添加计数,假设标题将具有 >3 -,而有时在非标题块中有 2 -
  • 从您的样本来看,删除长十六进制或数字字符串序列后的最后一个也可能是正确的。当然,如果没有更多样本,这是推测性的。
【解决方案3】:

由于 url 的模式不一致,说明第一个和第三个 url 的模式与其余的不同。

使用r.split():

s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
 'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
 'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
 'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
 'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
 'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
 'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
 'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
 'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
 'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
 'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
 'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
 'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']



for url in s:
  url = url.replace("-", " ")
  if url.rsplit('/', 1)[1] == '':   # For case 1 and 3rd url
       if url.rsplit('/', 2)[1].isdigit():   # For 3rd case url
            print(url.rsplit('/', 3)[1])
       else:
           print(url.rsplit('/', 2)[1])
  else:
       print(url.rsplit('/', 1)[1])   # except 1st and 3rd case urls

输出

call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

【讨论】:

  • 对特殊情况进行硬编码可能不是一个好主意,因为这是为了应对 OP 提供的示例中未显示的其他变化。
  • @tripleee 实际上,我不会扩展当前的方法。将添加替代方案。
猜你喜欢
  • 2023-03-15
  • 1970-01-01
  • 1970-01-01
  • 2017-03-22
  • 1970-01-01
  • 2013-07-15
  • 2021-05-16
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多