在网站上查找单词并获取其页面链接答案

【问题标题】：find a word on a website and get its page link在网站上查找单词并获取其页面链接
【发布时间】：2021-09-12 12:37:18
【问题描述】：

我想抓取一些网站，看看那里是否存在“katalog”一词。如果是，我想检索该单词所在的所有选项卡/子页面的链接。有可能吗？

我尝试按照本教程进行操作，但最后得到的 wordlist.csv 是空的，即使网站上确实存在单词 catalog。

https://www.phooky.com/blog/find-specific-words-on-web-pages-with-scrapy/

        wordlist = [
            "katalog",
            "downloads",
            "download"
            ]

def find_all_substrings(string, sub):
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    allowed_domains = ["www.reichelt.com/"]
    start_urls = ["https://www.reichelt.com/"]
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    crawl_count = 0
    words_found = 0                                 

    def check_buzzwords(self, response):

        self.__class__.crawl_count += 1

        crawl_count = self.__class__.crawl_count

        url = response.url
        contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
        data = response.body.decode('utf-8')

        for word in wordlist:
                substrings = find_all_substrings(data, word)
                print("substrings", substrings)
                for pos in substrings:
                        ok = False
                        if not ok:
                                self.__class__.words_found += 1
                                print(word + ";" + url + ";")
        return Item()

    def _requests_to_follow(self, response):
        if getattr(response, "encoding", None) != None:
                return CrawlSpider._requests_to_follow(self, response)
        else:
                return []

如何在网站上找到一个单词的所有实例并获取该单词所在页面的链接？

【问题讨论】：

您发送空项目return Item()，因此您得到空文件。您至少应该在yield 目录中包含for-loop 中的数据 - 比如yield {"word": word, "url": url}。
我不明白你为什么使用 __class__ 。您可以在开始时创建wordlist - 甚至在课外。无需一次又一次地创建相同的列表。你可以在开始时使用import re。无需一次又一次地导入它。当所有导入都在开头时，其他人可以看到运行此代码需要哪些模块。
但首先您应该在网络浏览器中关闭 JavaScript 并在网络浏览器中加载您的页面。你会看到scrapy 可以从页面得到什么——因为scrapy 不能运行JavaScript。如果页面使用 JavaScript 添加项目，那么您将需要 Selenium 或 Splash 来控制可以运行 JavaScript 的 Web 浏览器。见Scrapy-Selenium 和Scrapy-Splash
此页面向我显示英文文本，它没有katalog，而是catalog。我必须使用https://www.reichelt.com/?LANGUAGE=PL 来获取带有katalog 的波兰语页面
我经历了scrapy selenium，但我真的不知道如何在我的情况下使用它。我可以在我现有的代码中添加一个步骤，以便首先关闭 javascript，然后我们查找这些单词吗？此外，我尝试在现有代码中使用"https://www.reichelt.com/?LANGUAGE=PL"，但我没有看到任何子字符串的打印语句。 @furas

标签： python python-3.x web-scraping scrapy web-crawler

【解决方案1】：

主要问题是错误allowed_domain - 它必须没有路径/

    allowed_domains = ["www.reichelt.com"]

其他问题可能是本教程已有 3 年历史了（有 Scarpy 文档的链接1.5，但最新版本是2.5.0）。

它还使用了一些无用的代码行。

它得到contenttype，但从不将它用于decoderequest.body。您的网址使用iso8859-1 表示原始语言，utf-8 表示?LANGUAGE=PL - 但您可以简单地使用request.text，它会自动解码。

它也使用ok = False，后来检查它，但它完全没用。

最少的工作代码 - 您可以将其复制到单个文件并以 python script.py 运行，而无需创建项目。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re

wordlist = [
    "katalog",
    "catalog",
    "downloads",
    "download",
]

def find_all_substrings(string, sub):
    return [match.start() for match in re.finditer(re.escape(sub), string)]

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    
    allowed_domains = ["www.reichelt.com"]
    start_urls = ["https://www.reichelt.com/"]
    #start_urls = ["https://www.reichelt.com/?LANGUAGE=PL"]
    
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    #crawl_count = 0
    #words_found = 0                                 

    def check_buzzwords(self, response):
        print('[check_buzzwords] url:', response.url)
        
        #self.crawl_count += 1

        #content_type = response.headers.get("content-type", "").decode('utf-8').lower()
        #print('content_type:', content_type)
        #data = response.body.decode('utf-8')
        
        data = response.text

        for word in wordlist:
            print('[check_buzzwords] check word:', word)
            substrings = find_all_substrings(data, word)
            print('[check_buzzwords] substrings:', substrings)
            
            for pos in substrings:
                #self.words_found += 1
                # only display
                print('[check_buzzwords] word: {} | pos: {} | sub: {} | url: {}'.format(word, pos, data[pos-20:pos+20], response.url))
                # send to file
                yield {'word': word, 'pos': pos, 'sub': data[pos-20:pos+20], 'url': response.url}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(WebsiteSpider)
c.start()

编辑：

我在生成的数据中添加了data[pos-20:pos+20]，以查看子字符串在哪里，有时它位于.../elements/adw_2018/catalog/... 之类的URL 或<img alt=""catalog"" 之类的其他地方 - 所以使用regex 不一定是个好主意。也许更好的是使用xpath 或css selector 仅在某些地方或链接中搜索文本。

编辑：

使用列表中的单词搜索链接的版本。它使用response.xpath 搜索所有 linsk，然后检查href 中是否有单词 - 所以它不需要regex。

问题可能是它将与-downloads-（与s）的链接视为与单词download和downloads的链接，因此需要更复杂的方法来检查（即使用regex）来对待它仅作为与单词downloads的链接

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

wordlist = [
    "katalog",
    "catalog",
    "downloads",
    "download",
]

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    
    allowed_domains = ["www.reichelt.com"]
    start_urls = ["https://www.reichelt.com/"]
    
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    def check_buzzwords(self, response):
        print('[check_buzzwords] url:', response.url)
        
        links = response.xpath('//a[@href]')
        
        for word in wordlist:
            
            for link in links:
                url = link.attrib.get('href')
                if word in url:
                    print('[check_buzzwords] word: {} | url: {} | page: {}'.format(word, url, response.url))
                    # send to file
                    yield {'word': word, 'url': url, 'page': response.url}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(WebsiteSpider)
c.start()

【讨论】：

因此，例如，域上的这两个链接有一个目录 cdn-reichelt.de/katalog/06-2021 和 reichelt.com/?ACTION=72，但它们都没有出现在 csv 中。为什么会这样？
另外，如果我尝试抓取这个，eppendorf.com,in 终端我只能看到正在抓取的不同语言页面的链接。它不会深入所有链接吗？在 csv 中没有给我任何输出
allowed_domains = ["www.reichelt.com"] 不允许 cdn-reichelt.de 和 reichelt.com。它只访问以www.reichelt.com 开头的网址，如果页面上有katalog，它会添加访问过的网址——而不是指向下一页的链接。您必须更改 allowed_domains = ["www.reichelt.com", "reichelt.com", "cdn-reichelt.de"]。或者您应该在页面上搜索链接，检查它们是否在 URL 中包含单词 katalog 并将这些 url 添加到 CSV，而不是当前页面的 url。
要访问波兰语页面，您应该以 eppendorf.com/PL-pl 开头。此 URL 在 <selection> 中（可能它使用 JavaScript 重新加载页面）但 LinkExtractor() 仅用于获取 <a> 中的 URL
我添加了代码，首先获取页面上的所有链接，然后检查href是否有来自列表的单词。

【解决方案2】：

您可以使用 requests-html 并呈现页面：

from requests_html import HTMLSession

session = HTMLSession()
url = "https://www.reichelt.com/"

r = session.get(url)
r.html.render(sleep=2)

if "your_word" in r.html.text: #or r.html.html if you want it in raw html
    print([link for link in r.html.absolute_links if "your_word" in link])

【讨论】：

我将“your word”替换为“katalog”，但在打印输出中没有看到任何结果。您有没有为这个 url 找到任何东西？
我试过https://www.reichelt.com/和https://www.reichelt.com
没有结果，因为页面中没有“katalog”这个词。但是例如，如果您搜索“usb”，您将获得所有指向 USB 产品的链接