【问题标题】:Detecting honest web crawlers检测诚实的网络爬虫
【发布时间】:2010-10-07 08:45:03
【问题描述】:

我想(在服务器端)检测哪些请求来自机器人。在这一点上,我不关心恶意机器人,只关心那些玩得很好的机器人。我见过一些方法,主要涉及将用户代理字符串与“bot”等关键字进行匹配。但这似乎很尴尬,不完整且无法维护。那么有人有更可靠的方法吗?如果没有,您是否有任何资源可以用来与所有友好的用户代理保持同步?

如果您好奇:我并没有试图做任何违反任何搜索引擎政策的事情。我们有一个网站部分,用户会随机看到几个稍微不同的页面版本之一。但是,如果检测到网络爬虫,我们将始终为它们提供相同的版本,以使索引保持一致。

我也在使用 Java,但我想这种方法对于任何服务器端技术都是类似的。

【问题讨论】:

    标签: c# web-crawler bots


    【解决方案1】:
    void CheckBrowserCaps()
        {
            String labelText = "";
            System.Web.HttpBrowserCapabilities myBrowserCaps = Request.Browser;
            if (((System.Web.Configuration.HttpCapabilitiesBase)myBrowserCaps).Crawler)
            {
                labelText = "Browser is a search engine.";
            }
            else
            {
                labelText = "Browser is not a search engine.";
            }
    
            Label1.Text = labelText;
        }
    

    HttpCapabilitiesBase.Crawler Property

    【讨论】:

      【解决方案2】:

      您说在“bot”上匹配用户代理可能很尴尬,但我们发现它是一个很好的匹配。我们的研究表明,它将覆盖您收到的大约 98% 的点击。我们也没有遇到任何误报匹配。如果您想将其提高到 99.9%,您可以添加一些其他著名的匹配项,例如“crawler”、“baiduspider”、“ia_archiver”、“curl”等。我们已经在我们的生产系统上进行了数百万次的测试点击数。

      这里有几个 c# 解决方案给你:

      1) 最简单

      处理未命中时最快。即来自非机器人的流量 - 普通用户。 捕获 99+% 的爬虫。

      bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
      

      2) 中等

      处理命中时最快。即来自机器人的流量。错过也很快。 捕获接近 100% 的爬虫。 预先匹配“bot”、“crawler”、“spider”。 您可以向其中添加任何其他已知的爬虫。

      List<string> Crawlers3 = new List<string>()
      {
          "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
          "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
          "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
          "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
          "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
          "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
          "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
          "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
          "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
          "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
          "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
          "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
          "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
          "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
          "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
          "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
          "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
          "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
          "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
      };
      string ua = Request.UserAgent.ToLower();
      bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
      

      3) 偏执狂

      相当快,但比选项 1 和 2 慢一点。 这是最准确的,并且允许您根据需要维护列表。 如果您担心将来出现误报,您可以维护一个单独的名称列表,其中包含“bot”。 如果我们得到一个简短的匹配,我们会记录它并检查它是否有误报。

      // crawlers that have 'bot' in their useragent
      List<string> Crawlers1 = new List<string>()
      {
          "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
          "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
          "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
          "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
          "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
          "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
          "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
          "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
          "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
      };
      
      // crawlers that don't have 'bot' in their useragent
      List<string> Crawlers2 = new List<string>()
      {
          "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
          "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
          "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
          "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
          "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
          "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
          "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
          "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
          "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
          "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
          "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
          "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
          "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
          "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
          "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
          "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
          "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
          "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
          "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
          "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
          "legs","curl","webs","wget","sift","cmc"
      };
      
      string ua = Request.UserAgent.ToLower();
      string match = null;
      
      if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
      else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));
      
      if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);
      
      bool iscrawler = match != null;
      

      注意事项:

      • 继续向正则表达式选项 1 添加名称很诱人。但如果这样做,它会变得更慢。如果您想要更完整的列表,那么使用 lambda 的 linq 会更快。
      • 确保 .ToLower() 在您的 linq 方法之外 - 请记住该方法是一个循环,您将在每次迭代期间修改字符串。
      • 始终将最重的机器人放在列表的开头,以便它们更快匹配。
      • 将列表放入一个静态类中,这样它们就不会在每次浏览时都重新构建。

      蜜罐

      唯一真正的替代方法是在您的网站上创建一个只有机器人才能访问的“蜜罐”链接。然后,您将访问蜜罐页面的用户代理字符串记录到数据库中。然后,您可以使用这些记录的字符串对爬虫进行分类。

      Postives: 会匹配一些未声明自己的未知爬虫。

      Negatives: 并不是所有的爬虫都能深入挖掘到您网站上的每个链接,因此它们可能无法到达您的蜜罐。

      【讨论】:

      • 我们是否有一些 C# nugget 包,其中包含最流行的网络蜘蛛的标头列表?我的意思是不时更新会很好,因为一些蜘蛛停止工作,一些改变他们的标题
      • 嗯,我不知道。另一个问题是一些蜘蛛没有在任何位置“注册”,甚至没有设置用户代理字符串。我可以创建一个蜘蛛现在并从我的电脑上运行它..
      • 由于您的列表 Crawlers1 以条目“bot”结尾,因此您在此列表中的查找将始终成功 ui.Contains("bot")...。因此您甚至不需要在这种情况下检查列表。更改列表以删除“bot”,或者,如果它是有效条目,则跳过包含代码并假设它是一个 bot。
      • 嗨,安迪,你是对的。根据我的回答,我将“机器人”一词作为一个包罗万象的内容留在那里,但如果他们不想要误报,有些人可能想删除它。如果他们确实保留了它,那么他们就不需要像你建议的那样进行子查找。我用它来收获新的匹配并记录它们。
      • It will match some unknown crawlers that aren’t declaring themselves. - 如果爬虫使用普通用户代理(即,就像他们是普通用户一样),这可能会很危险。
      【解决方案3】:

      我很确定大部分机器人不使用 robots.txt,但这是我的第一个想法。

      在我看来,检测机器人的最佳方法是使用请求之间的时间间隔,如果请求之间的时间一直很快,那么它就是机器人。

      【讨论】:

        【解决方案4】:

        像这样快速而肮脏的事情可能是一个好的开始:

        return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i
        

        注意:rails 代码,但 regex 是普遍适用的。

        【讨论】:

        • 一切都是为了快速和肮脏.. 不过需要注意的是,我发现每年至少重新审视这些类型的解决方案并扩大“肮脏”列表很有用,因为它们往往会增长。对我来说,这对于只需要 90%+ 准确率的数字很有用..
        【解决方案5】:

        一个建议是在您的页面上创建一个只有机器人会跟随的空锚点。普通用户看不到链接,留下蜘蛛和机器人跟随。例如,指向子文件夹的空锚标记会在您的日志中记录获取请求...

        <a href="dontfollowme.aspx"></a>
        

        许多人在运行 HoneyPot 时使用此方法来捕获未遵循 robots.txt 文件的恶意机器人。我在我写的ASP.NET honeypot solution 中使用了空锚方法来捕获和阻止那些令人毛骨悚然的爬虫......

        【讨论】:

        • 只是出于好奇,这让我想知道这是否会影响可访问性。就像有人可能不小心使用 Tab 键选择了该锚点,然后按回车键最终单击它。好吧,显然不是(请参阅jsbin.com/efipa 进行快速测试),但我当然只使用普通浏览器进行了测试。
        • 需要小心使用此类技术,以免您的网站因使用黑帽 SEO 技术而被列入黑名单。
        • 另外,如果机器人也像任何其他访问者一样使用普通用户代理怎么办?
        【解决方案6】:

        您可以在 robotstxt.org Robots Database 中找到关于已知“良好”网络爬虫的非常全面的数据库。利用这些数据将比仅在用户代理中匹配 bot 更有效。

        【讨论】:

          【解决方案7】:

          任何入口页面为 /robots.txt 的访问者都可能是机器人。

          【讨论】:

          • 或者,为了不那么严格,请求 robots.txt 的访问者可能是机器人,尽管有一些 Firefox 插件可以在人类浏览时抓取它。
          • 任何去那里的机器人都可能是一个表现良好、受人尊敬的机器人,你可能想访问你的网站 :-)
          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2021-05-08
          • 1970-01-01
          • 2011-12-11
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2013-08-27
          相关资源
          最近更新 更多