使用 Python 进行网页抓取 [关闭]答案

【问题标题】：Web scraping with Python [closed]使用 Python 进行网页抓取 [关闭]
【发布时间】：2011-01-06 02:21:36
【问题描述】：

我想从网站获取每日日出/日落时间。是否可以使用 Python 抓取网页内容？使用了哪些模块？有教程吗？

【问题讨论】：

Python 有多个网页抓取选项。在回答类似问题时，我列举了一些选项here。
为什么不直接使用 Python 标准库中内置的 HTML Parser？当然，对于一项如此简单且不频繁的任务（每天仅一次），我认为没有理由搜索任何其他工具。 docs.python.org/2.7/library/htmlparser.html
希望这篇文章可能对某些人有用。适合初学者的好教程。 samranga.blogspot.com/2015/08/web-scraping-beginner-python.html它使用漂亮的汤python库用python进行网页抓取。

【解决方案1】：

您可以使用urllib2 发出 HTTP 请求，然后您将拥有网页内容。

你可以这样得到：

import urllib2
response = urllib2.urlopen('http://example.com')
html = response.read()

Beautiful Soup 是一个 Python HTML 解析器，应该可以用于屏幕抓取。

特别是，here 是他们关于解析 HTML 文档的教程。

祝你好运！

【讨论】：

设置最大读取字节数可能是个好主意。 response.read(100000000) 之类的，这样 ISO 的那些 URL 就不会填满你的 RAM。快乐挖矿。

【解决方案2】：

将 urllib2 与出色的 BeautifulSoup 库结合使用：

import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string
    # will print date and sunrise

【讨论】：

小注释：这可以使用 requests 包稍微简化，将第 6 行替换为：soup = BeautifulSoup(requests.get('example.com').text)
感谢您的提示。当我写上面的sn-p时，请求包还不存在；-)
@DerrickCoetzee - 您的简化引发了 MissingSchema 错误（至少在我的安装中）。这有效：soup = BeautifulSoup(requests.get('http://example.com').text)
@kmote：那是我输入的，但我忘记了代码周围的backticks，它把它转换成了一个链接。谢谢！
请注意，urllib2 不存在 Python3。 another post

【解决方案3】：

我将网络抓取工作中的脚本收集到了这个bit-bucket library。

您的案例的示例脚本：

from webscraping import download, xpath
D = download.Download()

html = D.get('http://example.com')
for row in xpath.search(html, '//table[@class="spad"]/tbody/tr'):
    cols = xpath.search(row, '/td')
    print 'Sunrise: %s, Sunset: %s' % (cols[1], cols[2])

输出：

Sunrise: 08:39, Sunset: 16:08
Sunrise: 08:39, Sunset: 16:09
Sunrise: 08:39, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:11
Sunrise: 08:40, Sunset: 16:12
Sunrise: 08:40, Sunset: 16:13

【讨论】：

【解决方案4】：

我真的会推荐 Scrapy。

引用已删除的答案：

由于使用异步操作（在 Twisted 之上），因此 Scrapy 爬行比机械化更快。

Scrapy 对在 libxml2 之上解析 (x)html 有更好和最快的支持。

Scrapy 是一个成熟的框架，具有完整的 unicode、处理重定向、压缩响应、奇数编码、集成 http 缓存等。

进入 Scrapy 后，您可以在 5 分钟内编写一个蜘蛛程序，用于下载图像、创建缩略图并将提取的数据直接导出为 csv 或 json。

【讨论】：

我没有注意到这个问题已经2岁了，仍然觉得应该在这里命名Scrapy，以防其他人有同样的问题。
Scrapy 是一个框架，因此很糟糕，并且认为它比你的项目更重要。由于 Twisted 的可怕（不必要）限制，它是一个框架。
@user1244215：这是一个框架，因为框架很好。如果您不想将其用作框架，那么没有什么能阻止您将所有代码塞进一个文件中。
但是不支持 Python 3.x。

【解决方案5】：

我使用Scrapemark（查找网址 - py2）和httlib2（下载图片 - py2+3）的组合。 scrapemark.py有500行代码，但是用的是正则表达式，所以可能没那么快，没有测试。

抓取您的网站的示例：

import sys
from pprint import pprint
from scrapemark import scrape

pprint(scrape("""
    <table class="spad">
        <tbody>
            {*
                <tr>
                    <td>{{[].day}}</td>
                    <td>{{[].sunrise}}</td>
                    <td>{{[].sunset}}</td>
                    {# ... #}
                </tr>
            *}
        </tbody>
    </table>
""", url=sys.argv[1] ))

用法：

python2 sunscraper.py http://www.example.com/

结果：

[{'day': u'1. Dez 2012', 'sunrise': u'08:18', 'sunset': u'16:10'},
 {'day': u'2. Dez 2012', 'sunrise': u'08:19', 'sunset': u'16:10'},
 {'day': u'3. Dez 2012', 'sunrise': u'08:21', 'sunset': u'16:09'},
 {'day': u'4. Dez 2012', 'sunrise': u'08:22', 'sunset': u'16:09'},
 {'day': u'5. Dez 2012', 'sunrise': u'08:23', 'sunset': u'16:08'},
 {'day': u'6. Dez 2012', 'sunrise': u'08:25', 'sunset': u'16:08'},
 {'day': u'7. Dez 2012', 'sunrise': u'08:26', 'sunset': u'16:07'}]

【讨论】：

【解决方案6】：

我强烈建议查看pyquery。它使用类似 jquery（又名 css-like）的语法，这对于来自该背景的人来说非常容易。

对于你的情况，它会是这样的：

from pyquery import *

html = PyQuery(url='http://www.example.com/')
trs = html('table.spad tbody tr')

for tr in trs:
  tds = tr.getchildren()
  print tds[1].text, tds[2].text

输出：

5:16 AM 9:28 PM
5:15 AM 9:30 PM
5:13 AM 9:31 PM
5:12 AM 9:33 PM
5:11 AM 9:34 PM
5:10 AM 9:35 PM
5:09 AM 9:37 PM

【讨论】：

【解决方案7】：

使用CSS Selectors让您的生活更轻松

我知道我来晚了，但我有一个很好的建议给你。

已经有人建议使用BeautifulSoup 我宁愿使用CSS Selectors 来抓取HTML 中的数据

import urllib2
from bs4 import BeautifulSoup

main_url = "http://www.example.com"

main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)

# Scrape all TDs from TRs inside Table
for tr in main_page_soup.select("table.class_of_table"):
   for td in tr.select("td#id"):
       print(td.text)
       # For acnhors inside TD
       print(td.select("a")[0].text)
       # Value of Href attribute
       print(td.select("a")[0]["href"])

# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)
def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
        return page
    except Exception:
        while 1:
            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue

【讨论】：

【解决方案8】：

这是一个简单的网络爬虫，我使用了 BeautifulSoup，我们将搜索类名为 _3NFO0d 的所有链接（锚点）。我用的是 Flipkar.com，它是一家在线零售商店。

import requests
from bs4 import BeautifulSoup
def crawl_flipkart():
    url = 'https://www.flipkart.com/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "lxml")
    for link in soup.findAll('a', {'class': '_3NFO0d'}):
        href = link.get('href')
        print(href)

crawl_flipkart()

【讨论】：

【解决方案9】：

如果我们想从任何特定类别中获取项目的名称，那么我们可以通过使用 css 选择器指定该类别的类名来做到这一点：

import requests ; from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.flipkart.com/').text, "lxml")
for link in soup.select('div._2kSfQ4'):
    print(link.text)

这是部分搜索结果：

Puma, USPA, Adidas & moreUp to 70% OffMen's Shoes
Shirts, T-Shirts...Under ₹599For Men
Nike, UCB, Adidas & moreUnder ₹999Men's Sandals, Slippers
Philips & moreStarting ₹99LED Bulbs & Emergency Lights

【讨论】：

【解决方案10】：

Python 提供了很好的网络抓取选项。最好的有框架的是scrapy。对于初学者来说可能有点棘手，所以这里有一点帮助。
1.安装python 3.5以上（低于2.7的都可以）。
2. 在 conda 中创建一个环境（我做了这个）。
3.在某个位置安装scrapy，然后从那里运行。
4.Scrapy shell会给你一个交互式界面来测试你的代码。
5.Scrapy startproject projectname将创建一个框架。
6. Scrapy genspider spidername 将创建一个蜘蛛。您可以根据需要创建任意数量的蜘蛛。执行此操作时，请确保您位于项目目录中。

更简单的方法是使用requests 和beautiful soup。在开始之前给一个小时的时间来阅读文档，它会解决你的大部分疑惑。 BS4 提供了广泛的解析器供您选择。使用user-agent 和sleep 使抓取更容易。 BS4 返回一个 bs.tag 所以使用variable[0]。如果有 js 运行，你将无法直接使用 requests 和 bs4 进行抓取。您可以获取 api 链接，然后解析 JSON 以获取您需要的信息或尝试selenium。

【讨论】：

这里是否使用 Anaconda 完全无关紧要。创建虚拟环境基本上总是一个好主意，但您不需要conda。