在 crawler4j 中禁用 RobotServer答案

【问题标题】：Disable RobotServer in crawler4j在 crawler4j 中禁用 RobotServer
【发布时间】：2025-12-15 08:45:02
【问题描述】：

我需要对网站进行爬网以定期检查网址是否可用。为此，我使用 crawler4j。

我的问题来自一些使用<meta name="robots" content="noindex,nofollow" /> 禁用机器人的网页，由于它拥有的内容，因此不应该在搜索引擎中索引这些网页是有意义的。

尽管禁用了 RobotServer 的配置，crawler4j 也没有关注这些链接。使用robotstxtConfig.setEnabled(false);，这一定很容易：

RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setUserAgentName(USER_AGENT_NAME);
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
WebCrawlerController controller = new WebCrawlerController(config, pageFetcher, robotstxtServer);
...

但所描述的网页仍未探索。我已经阅读了代码，这必须足以禁用机器人指令，但它没有按预期工作。也许我跳过了什么？我已经用3.5 和3.6-SNAPSHOT 版本对其进行了测试，结果相同。

【问题讨论】：

标签： crawler4j

【解决方案1】：

我正在使用新版本

   <dependency>
        <groupId>edu.uci.ics</groupId>
        <artifactId>crawler4j</artifactId>
        <version>4.1</version>
    </dependency>`

这样设置 RobotstxtConfig 后，它正在工作：

    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    robotstxtConfig.setEnabled(false);

来自 Crawler4J 的测试结果和源代码证明：

public boolean allows(WebURL webURL) {
if (config.isEnabled()) {
  try {
    URL url = new URL(webURL.getURL());
    String host = getHost(url);
    String path = url.getPath();

    HostDirectives directives = host2directivesCache.get(host);

    if ((directives != null) && directives.needsRefetch()) {
      synchronized (host2directivesCache) {
        host2directivesCache.remove(host);
        directives = null;
      }
    }

    if (directives == null) {
      directives = fetchDirectives(url);
    }

    return directives.allows(path);
  } catch (MalformedURLException e) {
    logger.error("Bad URL in Robots.txt: " + webURL.getURL(), e);
  }
}

return true;
}

当设置为 Enabled 为 false 时，将不再进行检查。

【讨论】：

感谢您的回答。我会用这个新版本来测试它！

【解决方案2】：

为什么不在 crawler4j 中排除有关 Robotstxt 的所有内容？我需要抓取一个网站并忽略机器人，这对我有用。

我像这样在 .crawler 中更改了 CrawlController 和 WebCrawler：

WebCrawler.java：

删除

private RobotstxtServer robotstxtServer;

删除

this.robotstxtServer = crawlController.getRobotstxtServer();

编辑

 if ((shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL)))
 -->
 if ((shouldVisit(webURL)))

编辑

if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) && 
              (shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL)))
-->
if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) && 
              (shouldVisit(webURL)))

CrawlController.java：

删除

import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

删除

 protected RobotstxtServer robotstxtServer;

编辑

public CrawlController(CrawlConfig config, PageFetcher pageFetcher, RobotstxtServer robotstxtServer) throws Exception
-->
public CrawlController(CrawlConfig config, PageFetcher pageFetcher) throws Exception

删除

this.robotstxtServer = robotstxtServer;

编辑

if (!this.robotstxtServer.allows(webUrl)) 
{
  logger.info("Robots.txt does not allow this seed: " + pageUrl);
} 
else 
{
  this.frontier.schedule(webUrl);
}
-->
this.frontier.schedule(webUrl);

删除

public RobotstxtServer getRobotstxtServer()
{
  return this.robotstxtServer;
}
public void setRobotstxtServer(RobotstxtServer robotstxtServer)
{
  this.robotstxtServer = robotstxtServer;
}

希望它是您正在寻找的。p>

【讨论】：

感谢您的回答。您正在修改 crawler4j 库的代码。我宁愿不修改库（可以使用更新）。理论上，我们可以在不更改代码的情况下实现这种行为。