【问题标题】:Connection was refused by other side: 111: Connection refused连接被对方​​拒绝:111:连接被拒绝
【发布时间】:2025-12-02 08:25:01
【问题描述】:

我有一个用于 LinkedIn 的蜘蛛。它在我的本地机器上运行良好,但是当我在 Scrapinghub 上部署时出现错误:

Error downloading <GET https://www.linkedin.com/>: Connection was refused by other side: 111: Connection refused.

Scrapinghub的完整日志为:

0:  2018-08-30 12:58:34 INFO    Log opened.
1:  2018-08-30 12:58:34 INFO    [scrapy.log] Scrapy 1.0.5 started
2:  2018-08-30 12:58:34 INFO    [scrapy.utils.log] Scrapy 1.0.5 started (bot: facebook_stats)
3:  2018-08-30 12:58:34 INFO    [scrapy.utils.log] Optional features available: ssl, http11, boto
4:  2018-08-30 12:58:34 INFO    [scrapy.utils.log] Overridden settings: {'NEWSPIDER_MODULE': 'facebook_stats.spiders', 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['facebook_stats.spiders'], 'RETRY_TIMES': 10, 'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408], 'BOT_NAME': 'facebook_stats', 'MEMUSAGE_LIMIT_MB': 950, 'DOWNLOAD_DELAY': 1, 'TELNETCONSOLE_HOST': '0.0.0.0', 'LOG_FILE': 'scrapy.log', 'MEMUSAGE_ENABLED': True, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'}
5:  2018-08-30 12:58:34 INFO    [scrapy.log] HubStorage: writing items to https://storage.scrapinghub.com/items/341545/3/9
6:  2018-08-30 12:58:34 INFO    [scrapy.middleware] Enabled extensions: CoreStats, TelnetConsole, MemoryUsage, LogStats, StackTraceDump, CloseSpider, SpiderState, AutoThrottle, HubstorageExtension
7:  2018-08-30 12:58:35 INFO    [scrapy.middleware] Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
8:  2018-08-30 12:58:35 INFO    [scrapy.middleware] Enabled spider middlewares: HubstorageMiddleware, HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
9:  2018-08-30 12:58:35 INFO    [scrapy.middleware] Enabled item pipelines: CreditCardsPipeline
10: 2018-08-30 12:58:35 INFO    [scrapy.core.engine] Spider opened
11: 2018-08-30 12:58:36 INFO    [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
12: 2018-08-30 12:58:36 INFO    TelnetConsole starting on 6023
13: 2018-08-30 12:59:32 ERROR   [scrapy.core.scraper] Error downloading <GET https://www.linkedin.com/>: Connection was refused by other side: 111: Connection refused.
14: 2018-08-30 12:59:32 INFO    [scrapy.core.engine] Closing spider (finished)
15: 2018-08-30 12:59:33 INFO    [scrapy.statscollectors] Dumping Scrapy stats: More
16: 2018-08-30 12:59:34 INFO    [scrapy.core.engine] Spider closed (finished)
17: 2018-08-30 12:59:34 INFO    Main loop terminated.

我该如何解决这个问题?

【问题讨论】:

    标签: python scrapy scrapinghub


    【解决方案1】:

    领英prohibits scraping:

    禁止的软件和扩展程序

    LinkedIn 致力于保护其会员数据的安全,并确保其网站免遭欺诈和滥用。为了保护我们的会员数据和我们的网站,我们不允许使用任何第三方软件,包括“爬虫”、机器人、浏览器插件或浏览器扩展(也称为“附加组件”),抓取、修改外观或自动执行 LinkedIn 网站上的活动。此类工具违反了User Agreement,包括但不限于第 8.2 节中列出的许多“注意事项”……

    有理由认为他们可能会主动阻止来自 Scrapinghub 和类似服务的连接。

    【讨论】:

    • 所以没有办法在 Scrapinghub 上进一步剪贴页面??
    • @OmarRiaz,考虑到这违反了 LinkedIn 用户协议,我强烈建议不要在 Scrapinghub 或其他任何地方这样做。如果您决定尝试任何方式,您将面临类似这样的技术挑战需要克服。无论您做什么,都不太可能从 Scrapinghub 抓取 LinkedIn。
    最近更新 更多