【发布时间】:2021-10-19 00:58:03
【问题描述】:
请原谅我的错误,如有任何疑问请添加评论
我试图通过正则表达式从各种博客中抓取以数字开头的 h2 和粗体标记中的数据,但我使用这个正则表达式只得到句子的起始词而不是完整的标题
response.css('h2::text').re(r'\d+\.\s*\w+')
我不知道我错在哪里。预期的输出应该是这样的
the desired output is: [1. Golgappa at Chawla's and Nand's,2. Pyaaz
Kachori at Rawat Mishthan Bhandar,2. Pyaaz Kachori at Rawat Mishthan
Bhandar,4. Best of Indian Street Food at Masala Chowk,........ so on]
and [1. Keema Baati,2. Pyaaz Kachori ,3. Dal Baati Churma...so on]
我得到的是
2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/robots.txt> (referer: None)
2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/pages/street-food-in-jaipur-1483.html> (referer: None)
['1. Golgappa', '2. Pyaaz', '3. Masala', '4. Best', '5. Kaathi', '6. Pav', '7. Omelette', '8. Chicken', '9. Lassi', '10. Shrikhand', '11. Kulfi', '12. Sweets', '13. Fast', '14. Cold']
2021-08-17 05:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/> (referer: None)
['1. Keema', '2. Pyaaz', '3. Dal', '4. Shrikhand', '5. Ghewar', '6. Mawa', '7. Mirchi', '8. Gatte', '9. Rajasthani', '10. Laal']
2021-08-17 05:55:33 [scrapy.core.engine] INFO: Closing spider (finished)
如果你能建议一个正则表达式会很有帮助
如果您想访问该网站,那么这些是我正在抓取的网站
https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/ 和 https://www.holidify.com/pages/street-food-in-jaipur-1483.html
这是我的代码,以防你想看
import scrapy
import re
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.tasteatlas.com','www.lih.travel','www.crazymasalafood.com','www.holidify.com','www.jaipurcityblog.com','www.trip101.com','www.adequatetravel.com']
start_urls = ['https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',
'https://www.holidify.com/pages/street-food-in-jaipur-1483.html'
]
def parse(self, response):
if response.css('h2::text').re(r'\d+\.\s*\w+'):
print(response.css('h2::text').re(r'\d+\.\s*\w+'))
elif response.css('b::text').re(r'\d+\.\s*\w+'):
print(response.css('b::text').re(r'\d+\.\s*\w+'))
【问题讨论】:
-
这种方法是否有助于同时抓取多个奇怪的网站
-
你想让它同时在多个线程上运行?
-
对不起,我很愚蠢,但我正在寻找一种方法,以便我可以同时从多个网站(如 holidify.com/pages/street-food-in-jaipur-1483.html)中抓取列表
-
Regex 不适合 HTML 解析 IMO。为什么不使用 BeautifulSoup?
标签: python html web-scraping scrapy