【问题标题】:Exclude href links in search results using regular expressions使用正则表达式在搜索结果中排除 href 链接
【发布时间】:2018-08-22 15:14:33
【问题描述】:

我正在尝试从我的 Google API 搜索结果中排除某些链接。我正在尝试使用从 links_to_exclude 列表中提取的正则表达式。这种方法仍然会输出我不想要的链接。

返回的一些链接:

https://money.cnn.com/2018/08/21/technology/facebook-disinformation-iran-russia/index.html

https://www.cnn.com/videos/politics/2018/08/22/carl-bernstein-worse-than-watergate-egregious-trump-newday-sot-vpx.cnn

https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news

如何使用正则表达式排除这些链接?

links_to_exclude = ['cnn.com', 'nytimes.com']

for item in search_terms:
results = google_search(item, api_key, cse_id, num=1)
for result in results:
    rtn_link = result.get('link')
    for link in links_to_exclude:
        regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
        if re.search(regex, rtn_link):
            continue
        else:
            pprint.pprint(result.get('link'))

【问题讨论】:

    标签: python regex python-3.x list google-api


    【解决方案1】:

    您的正则表达式似乎是正确的。我认为您只是缺少脚本上的import re

    请看这里:https://ideone.com/Uzcf1K

    import re
    
    links_to_exclude = ['cnn.com', 'nytimes.com']
    results = ['https://foo.bar', 'https://money.cnn.com/2018/08/21/technology/facebook-disinformation-iran-russia/index.html','https://www.cnn.com/videos/politics/2018/08/22/carl-bernstein-worse-than-watergate-egregious-trump-newday-sot-vpx.cnn','https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news']
    
    for result in results:
        print "URL: " + result
        for link in links_to_exclude:
            regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
            if re.search(regex, result):
                print '  Matches: ' + link
            else:
                print '  Does not match: ' + link
    

    输出:

    URL: https://foo.bar
      Does not match: cnn.com
      Does not match: nytimes.com
    URL: https://money.cnn.com/2018/08/21/technology/facebook-disinformation-iran-russia/index.html
      Matches: cnn.com
      Does not match: nytimes.com
    URL: https://www.cnn.com/videos/politics/2018/08/22/carl-bernstein-worse-than-watergate-egregious-trump-newday-sot-vpx.cnn
      Matches: cnn.com
      Does not match: nytimes.com
    URL: https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news
      Does not match: cnn.com
      Matches: nytimes.com
    

    【讨论】:

    • 你是对的,我的代码是正确的。这个问题是由我从 Google API 返回的搜索结果引起的。感谢您对我的正则表达式进行完整性检查。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-11-26
    • 2011-07-09
    • 1970-01-01
    • 1970-01-01
    • 2022-12-09
    • 1970-01-01
    相关资源
    最近更新 更多