Scrapy爬虫被阻止并得到404答案

【问题标题】：Scrapy crawler is being blocked and gets 404Scrapy爬虫被阻止并得到404
【发布时间】：2020-07-01 22:03:53
【问题描述】：

我正在尝试使用 Scrapy 抓取页面“https://zhuanlan.zhihu.com/wangzhenotes”，配置为in the post，并在本文结尾。

这个命令

scrapy shell 'https://zhuanlan.zhihu.com/wangzhenotes'

了解我

2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://zhuanlan.zhihu.com/robots.txt> (referer: None)
2020-07-02 05:50:04 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on.
...
2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://zhuanlan.zhihu.com/wangzhenotes> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10ac98790>
[s]   item       {}
[s]   request    <GET https://zhuanlan.zhihu.com/wangzhenotes>
[s]   response   <200 https://zhuanlan.zhihu.com/wangzhenotes>
...

我猜爬虫被阻止了，因为这个命令只有 3，

len(response.xpath('//span'))

虽然在 Chrome 浏览器的源代码中搜索“span”超过 80，

而response.css("h2.ContentItem-title") 得到一个空列表[]。

我如何获得这些跨度？

这是我正在使用的配置，与the referred post中的配置相同。

class CustomMiddleware(object):
    def process_request(self, request, spider):
        request.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"


DOWNLOADER_MIDDLEWARES = {
    'projectname.middlewares.CustomMiddleware': 543,
}

【问题讨论】：

ROBOTSTXT_OBEY = False 应该可以解决 404 错误..

标签： scrapy

【解决方案1】：

问题在于页面源中不存在 span 和此类 h2.ContentItem-title 元素。它们来自单独的请求。

这是一个如何使用requests 模块获取信息的示例，但您也可以使用相同的方法使用scrapy：

import requests

headers = {
    'authority': 'www.zhihu.com',
    'x-requested-with': 'fetch',
    'x-ab-param': 'top_quality=0;li_topics_search=0;qap_question_visitor= 0;se_click_v_v=0;se_topicfeed=0;se_aa_base=0;se_video200=0;tsp_hotlist_ui=1;zr_expslotpaid=1;pf_fuceng=1;qap_labeltype=1;zr_km_answer=open_cvr;zr_zr_search_sims=0;top_v_album=1;pf_adjust=0;li_svip_cardshow=1;tp_dingyue_video=0;top_test_4_liguangyi=1;li_se_section=1;zr_rel_search=base;zr_training_first=false;tp_discover=0;tp_move_scorecard=0;li_salt_hot=1;zr_intervene=0;zr_slotpaidexp=1;se_college=default;se_colorfultab=1;se_entity22=0;tp_m_intro_re_topic=1;top_universalebook=1;zr_training_boost=false;zr_ans_rec=gbrank;se_whitelist=0;se_searchvideo=3;li_video_section=0;li_vip_verti_search=0;zr_topic_rpc=0;zr_rec_answer_cp=open;se_cla_v2=1;se_col_boost=0;se_v_v005=0;top_ebook=0;zr_search_topic=0;tp_sft=a;tsp_ad_cardredesign=0;li_paid_answer_exp=0;tsp_ios_cardredesign=0;pf_newguide_vertical=0;ls_video_commercial=0;tp_header_style=1;se_v040=0;zw_sameq_sorce=999;zr_art_rec=base;se_web0answer=0;se_bsi=0;ls_videoad=2;li_svip_tab_search=1;zr_test_aa1=0;se_multi_images=0;tp_club_qa_entrance=1;li_yxzl_new_style_a=1;se_sim_bst=1;se_bert_eng=0;tp_club_fdv4=0;soc_notification=1;ug_follow_topic_1=2;li_car_meta=0;li_panswer_topic=0;zr_search_sim2=0;tp_club_entrance=1;ug_newtag=1;li_answer_card=0;tp_movie_ux=0;se_sug_term=0;pf_noti_entry_num=0;se_mobilecard=0;se_cardrank_3=0;se_oneboxtopic=1;se_v045=0;tp_club_feed=0;tp_topic_tab_new=0-0-0;se_adsrank=4;tp_contents=2;se_v_rate=0;se_major=0;tp_meta_card=0;tp_topic_style=0;top_hotcommerce=1;li_catalog_card=1;se_searchwiki=0;se_ffzx_jushen1=0;se_v040_2=0;pf_creator_card=1;pf_profile2_tab=0;li_ebook_gen_search=0;tp_club_top=0;pf_foltopic_usernum=0;li_viptab_name=0;tp_topic_tab=0;tp_club_bt=0;se_wil_act=0;se_content0=1;se_vdnn_4=0;se_v_v006=0;se_videobox=0;soc_feed_intelligent=0;top_root=0;ls_fmp4=0;zr_slot_training=1;qap_question_author=0;zr_search_paid=1;zr_search_sims=0;se_hi_trunc=0;tp_club__entrance2=1;ls_recommend_test=0',
    'x-zse-86': '1.0_a720UDu0k8YxUC28Zw2qUJuqHU2Ygu28B0xqFruBoH2p',
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
    'x-zse-83': '3_2.0',
    'accept': '*/*',
    'origin': 'https://zhuanlan.zhihu.com',
    'sec-fetch-site': 'same-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://zhuanlan.zhihu.com/wangzhenotes',
    'accept-language': 'en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7,uk;q=0.6,en-GB;q=0.5',
}

response = requests.get('https://www.zhihu.com/api/v4/columns/wangzhenotes/items', headers=headers)

print(response.json())

在scrapy中解析对json的响应，可以使用如下代码：

j_obj = json.loads(response.body_as_unicode())

【讨论】：

@Roman，我同意你的回答，即使我认为网站使用这个 api 与数据库通信-(zhihu.com/api/v4/columns/wangzhenotes/items) ..so python-requests 单独可以完成工作，scrapy 会矫枉过正。