【问题标题】:URL redirection problem网址重定向问题
【发布时间】:2011-01-16 17:41:45
【问题描述】:

我有以下网址

http://bit.ly/cDdh1c

当您将上面的网址放在浏览器中并按回车键时,它将重定向到下面的网址 http://www.kennystopproducts.info/Top/?hop=arnishad

但是当我尝试通过 python 程序(在下面你可以看到代码)找到上面相同的 url http://bit.ly/cDdh1c 的基本 url(在消除所有重定向 url 之后)时,我得到了以下 url http://www.cbtrends.com/作为基本 url。请参阅下面的日志文件

为什么相同的 url 在浏览器和 python 程序中表现不同。我应该在 python 程序中更改什么以便它可以重定向到正确的 url?我想知道这种奇怪的行为是如何发生的。?

我观察到类似行为的其他网址是

  1. http://bit.ly/bEKyOx ====> http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150413977509 (通过浏览器)
  2. http://www.ebay.com(通过python 程序)

          maxattempts = 5
          turl = url
          while (maxattempts  >  0) :               
            host,path = urlparse.urlsplit(turl)[1:3]
            if  len(host.strip()) == 0 :
               return None
    
            try: 
                    connection = httplib.HTTPConnection(host,timeout=10)
                    connection.request("HEAD", path)
                    resp = connection.getresponse()                      
            except:                         
                     return None                     
            maxattempts = maxattempts - 1
            if (resp.status >= 300) and (resp.status <= 399):
                self.logger.debug("The present %s is a redirection one" %turl)
                turl = resp.getheader('location')
            elif (resp.status >= 200) and (resp.status <= 299) :
                self.logger.debug("The present url %s is a proper one" %turl)
                return turl
            else :
                #some problem with this url
                return None               
          return None
    

供您参考的日志文件

2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/

【问题讨论】:

    标签: python url redirect bit.ly


    【解决方案1】:

    你的问题是,当你调用 urlsplit 时,你的路径变量只包含路径,缺少查询。

    所以,试试吧:

    import httplib
    import urlparse
    
    def getUrl(url):
        maxattempts = 10
        turl = url
        while (maxattempts  >  0) :               
            host,path,query = urlparse.urlsplit(turl)[1:4]
            if  len(host.strip()) == 0 :
                return None
            try: 
                connection = httplib.HTTPConnection(host,timeout=10)
                connection.request("GET", path+'?'+query)
                resp = connection.getresponse()
            except:                         
                return None                     
            maxattempts = maxattempts - 1
            if (resp.status >= 300) and (resp.status <= 399):
                turl = resp.getheader('location')
            elif (resp.status >= 200) and (resp.status <= 299) :
                return turl
            else :
                #some problem with this url
                return None               
        return None
    print getUrl('http://bit.ly/cDdh1c')
    

    【讨论】:

      【解决方案2】:

      您的问题来自这一行:

      host,path = urlparse.urlsplit(turl)[1:3]
      

      您忽略了查询字符串。因此,在您提供的示例日志中,您将执行的第二个 HEAD 请求将在 http://www.cbtrends.com/get-product.html 上,没有 GET 参数。在浏览器中打开该 URL,您会看到它重定向到 http://www.cbtrends.com/

      你必须使用urlsplit返回的元组的all元素来计算路径。

      parts = urlparse.urlsplit(turl)
      host = parts[1]
      path = "%s?%s#%s" % parts[2:5]
      

      【讨论】:

        猜你喜欢
        • 2012-07-17
        • 1970-01-01
        • 2021-04-10
        • 1970-01-01
        • 2010-11-29
        • 1970-01-01
        相关资源
        最近更新 更多