可以抓取标签内所有数据的正则表达式答案

【问题标题】：regular expression that can scrape all data within a tag可以抓取标签内所有数据的正则表达式
【发布时间】：2021-10-19 00:58:03
【问题描述】：

请原谅我的错误，如有任何疑问请添加评论

我试图通过正则表达式从各种博客中抓取以数字开头的 h2 和粗体标记中的数据，但我使用这个正则表达式只得到句子的起始词而不是完整的标题

 response.css('h2::text').re(r'\d+\.\s*\w+')

我不知道我错在哪里。预期的输出应该是这样的

 the desired output is: [1. Golgappa at Chawla's and Nand's,2. Pyaaz 
 Kachori at Rawat Mishthan Bhandar,2. Pyaaz Kachori at Rawat Mishthan 
 Bhandar,4. Best of Indian Street Food at Masala Chowk,........ so on] 
 and [1. Keema Baati,2. Pyaaz Kachori ,3. Dal Baati Churma...so on]

我得到的是

2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/robots.txt> (referer: None)
2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/pages/street-food-in-jaipur-1483.html> (referer: None)
['1. Golgappa', '2. Pyaaz', '3. Masala', '4. Best', '5. Kaathi', '6. Pav', '7. Omelette', '8. Chicken', '9. Lassi', '10. Shrikhand', '11. Kulfi', '12. Sweets', '13. Fast', '14. Cold']
2021-08-17 05:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/> (referer: None)
['1. Keema', '2. Pyaaz', '3. Dal', '4. Shrikhand', '5. Ghewar', '6. Mawa', '7. Mirchi', '8. Gatte', '9. Rajasthani', '10. Laal']
2021-08-17 05:55:33 [scrapy.core.engine] INFO: Closing spider (finished)

如果你能建议一个正则表达式会很有帮助

如果您想访问该网站，那么这些是我正在抓取的网站

https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/ 和 https://www.holidify.com/pages/street-food-in-jaipur-1483.html

这是我的代码，以防你想看

import scrapy
import re

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.tasteatlas.com','www.lih.travel','www.crazymasalafood.com','www.holidify.com','www.jaipurcityblog.com','www.trip101.com','www.adequatetravel.com']

    start_urls = ['https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',
                  'https://www.holidify.com/pages/street-food-in-jaipur-1483.html'
                  ]

    def parse(self, response):


        if response.css('h2::text').re(r'\d+\.\s*\w+'):
            print(response.css('h2::text').re(r'\d+\.\s*\w+'))

        elif response.css('b::text').re(r'\d+\.\s*\w+'):
            print(response.css('b::text').re(r'\d+\.\s*\w+'))

【问题讨论】：

Don't use RegEx to parse HTML! Use an HTML parser.
这种方法是否有助于同时抓取多个奇怪的网站
你想让它同时在多个线程上运行？
对不起，我很愚蠢，但我正在寻找一种方法，以便我可以同时从多个网站（如 holidify.com/pages/street-food-in-jaipur-1483.html）中抓取列表
Regex 不适合 HTML 解析 IMO。为什么不使用 BeautifulSoup？

标签： python html web-scraping scrapy

【解决方案1】：

这可以通过报馆来完成

import re
from newspaper import Article
import nltk
from pprint import pprint

urls=['https://www.jaipurcityblog.com/9-iconic-famous-dishes-of-jaipur-that-you-have-to-try/',

                  'https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',

                  'https://www.lih.travel/famous-foods-in-jaipur/',
                  'https://www.holidify.com/pages/street-food-in-jaipur-1483.html']
extacted_data=[]
for url in urls:
    site = Article(url)

    site.download()
    site.parse()
    site.nlp()
    data= site.text
    pattern=re.findall(r'\d+\.\s*[a-zA-Z]+.*',data)
    print(pattern)

输出：

['1. Dal Baati Churma', '2. Pyaaz Ki Kachori', '3. Gatte ki Sabji', '4. Mawa Kachori', '5. Kalakand', '6. Lassi', '7. Aam ki Launji', '8. Chokhani Kheer', '9. Mirchi   Vada']
['1. Keema Baati', '2. Pyaaz Kachori', '3. Dal Baati Churma', '4. Shrikhand', '5. Ghewar', '6. Mawa Kachori', '7. Mirchi Bada', '8. Gatte Ki Subzi', '9. Rajasthani Thali',     '10. Laal Maas']
['1. Rajasthani Thali (Plate) at Chokhi Dhani Village Resort', '2. Laal Maans at Handi', '3. Lassi at Lassiwala', '4. Anokhi Café for Penne Pasta & Cheese Cake', '5. Daal  Baluchi at Baluchi Restaurant', '6. Pyaz Kachori at Rawat', '7. Chicken Lollipop at Niro’s', '8. Hibiscus Ice Tea at Tapri', '9. Omelet at Sanjay Omelette', '1981. This    special egg eatery of Jaipur also treats some never tried before egg specialties. If you are an egg-fan with a sweet tooth, then this is your place. Slurp the “Egg Rabri”  of Sanjay Omelette and feel the heavenly juice of eggs in your mouth. Appreciate the good taste of egg in never before way with just a visit to “Sanjay Omelette”.', '10.   Paalak Paneer & Missi Roti at Sharma Dhabha']
["1. Golgappa at Chawla's and Nand's", '2. Pyaaz Kachori at Rawat Mishthan Bhandar', '3. Masala Chai at Gulab Ji Chaiwala', '4. Best of Indian Street Food at Masala    Chowk', '5. Kaathi Roll at Al Bake', "6. Pav Bhaji at Pandit's", "7. Omelette at Sanjay's", '8. Chicken Tikka at Sethi Bar-Be-Que', '9. Lassi at Lassiwala', '10. Shrikhand     at Falahaar', '11. Kulfi Faluda at Bapu Bazaar', '12. Sweets from Laxmi Mishthan Bhandar (LMB)', "13. Fast Food at Aunty's Cafe", '14. Cold Coffee at Gyan Vihar Dairy  (GVD)']

【讨论】：

【解决方案2】：

这是在问题中使用 scrapy 的另一种方法，与 Fazlul 的答案不同，它不会将子节点中的文本与父节点中的文本分开。

    def parse(self, response):
        r = re.compile(r'\d+\.')
        # get header texts:
        h2s = [e.xpath('string()').extract_first() for e in response.xpath('//h2')]
        nh2s = list(filter(r.match, h2s))       # get numbered headers
        if nh2s: print(nh2s)
        …

【讨论】：