Nokogiri 和 Mechanize 帮助（单击 Nokogiri 通过 Mechanize 找到的链接）答案

【问题标题】：Nokogiri and Mechanize help (clicking links found by Nokogiri via Mechanize)Nokogiri 和 Mechanize 帮助（单击 Nokogiri 通过 Mechanize 找到的链接）
【发布时间】：2013-08-14 02:54:27
【问题描述】：

我通过css表单page = agent.get('http://www.print-index.ru/default.aspx?p=81&gr=198')搜索链接，之后我在页面变量中有很多链接，但我不知道如何使用它们，如何通过Mechanize点击它们。我在stackoverflow上发现了这个方法：

page = agent.get "http://google.com"
node = page.search ".//p[@class='posted']"
Mechanize::Page::Link.new(node, agent, page).click

但它只适用于一个链接，所以我怎样才能将这种方法用于多个链接。

如果我应该发布更多信息，请说出来。

【问题讨论】：

如果可能的话，你需要2-3个链接的html代码..
请说出哪个html代码？
好吧..做一件事..你能告诉我puts node.size的答案吗？
你想完成什么？蜘蛛网站？自动化与网站的一些交互？机械化对于后者来说很棒，但对于第一个来说并不是那么好，因为它有太多的开销。无论哪种方式，我们都需要有关您尝试执行的操作的更多信息。

标签： ruby-on-rails ruby parsing mechanize-ruby

【解决方案1】：

如果您的目标只是进入下一页，然后从中刮掉一些信息，那么您真正关心的是：

页面内容（用于抓取您的数据）
您需要访问的下一页的 URL

您可以使用Mechanize 或其他方式来访问页面内容，例如OpenURI (which is part of Ruby standard lib)。作为旁注，Mechanize 在幕后使用 Nokogiri；当您开始深入分析页面上的元素时，您会看到它们以 Nokogiri 相关对象的形式返回。

无论如何，如果这是我的项目，我可能会使用OpenURI 获取页面内容，然后使用Nokogiri 进行搜索。我喜欢使用 Ruby 标准库而不是需要额外的依赖项的想法。

这是一个使用OpenURI的示例：

require 'nokogiri'
require 'open-uri'

printing_page = Nokogiri::HTML(open("http://www.print-index.ru/default.aspx?p=81&gr=198"))

# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...

# Find the next page to visit.  Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.css('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page

about_project_page = Nokogiri::HTML(open(about_project_link_in_navbar_menu_url)) # Get the About page's content

# ....
# Do something...
# ....

这是一个使用Mechanize 获取页面内容的示例（它们非常相似）：

require 'mechanize'

agent = Mechanize.new
printing_page = agent.get("http://www.print-index.ru/default.aspx?p=81&gr=198")

# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...

# Find the next page to visit.  Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.search('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page

about_project_page = agent.get(about_project_link_in_navbar_menu_url)

# ....
# Do something...
# ....

PS 我用谷歌将俄语翻译成英语。如果变量名不正确，我很抱歉！ :X

【讨论】：

很高兴听到这个消息！