array(2) { ["docs"]=> array(10) { [0]=> array(10) { ["id"]=> string(3) "428" ["text"]=> string(77) "Visual Studio 2017 单独启动MSDN帮助(Microsoft Help Viewer)的方法" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(8) "DonetRen" ["tagsname"]=> string(55) "Visual Studio 2017|MSDN帮助|C#程序|.NET|Help Viewer" ["tagsid"]=> string(23) "[401,402,403,"300",404]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511400964" ["_id"]=> string(3) "428" } [1]=> array(10) { ["id"]=> string(3) "427" ["text"]=> string(42) "npm -v;报错 cannot find module "wrapp"" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(4) "zzty" ["tagsname"]=> string(50) "node.js|npm|cannot find module "wrapp“|node" ["tagsid"]=> string(19) "[398,"239",399,400]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511400760" ["_id"]=> string(3) "427" } [2]=> array(10) { ["id"]=> string(3) "426" ["text"]=> string(54) "说说css中pt、px、em、rem都扮演了什么角色" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(12) "zhengqiaoyin" ["tagsname"]=> string(0) "" ["tagsid"]=> string(2) "[]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511400640" ["_id"]=> string(3) "426" } [3]=> array(10) { ["id"]=> string(3) "425" ["text"]=> string(83) "深入学习JS执行--创建执行上下文(变量对象,作用域链,this)" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(7) "Ry-yuan" ["tagsname"]=> string(33) "Javascript|Javascript执行过程" ["tagsid"]=> string(13) "["169","191"]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511399901" ["_id"]=> string(3) "425" } [4]=> array(10) { ["id"]=> string(3) "424" ["text"]=> string(30) "C# 排序技术研究与对比" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(9) "vveiliang" ["tagsname"]=> string(0) "" ["tagsid"]=> string(2) "[]" ["catesname"]=> string(8) ".Net Dev" ["catesid"]=> string(5) "[199]" ["createtime"]=> string(10) "1511399150" ["_id"]=> string(3) "424" } [5]=> array(10) { ["id"]=> string(3) "423" ["text"]=> string(72) "【算法】小白的算法笔记:快速排序算法的编码和优化" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(9) "penghuwan" ["tagsname"]=> string(6) "算法" ["tagsid"]=> string(7) "["344"]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511398109" ["_id"]=> string(3) "423" } [6]=> array(10) { ["id"]=> string(3) "422" ["text"]=> string(64) "JavaScript数据可视化编程学习(二)Flotr2,雷达图" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(7) "chengxs" ["tagsname"]=> string(28) "数据可视化|前端学习" ["tagsid"]=> string(9) "[396,397]" ["catesname"]=> string(18) "前端基本知识" ["catesid"]=> string(5) "[198]" ["createtime"]=> string(10) "1511397800" ["_id"]=> string(3) "422" } [7]=> array(10) { ["id"]=> string(3) "421" ["text"]=> string(36) "C#表达式目录树(Expression)" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(4) "wwym" ["tagsname"]=> string(0) "" ["tagsid"]=> string(2) "[]" ["catesname"]=> string(4) ".NET" ["catesid"]=> string(7) "["119"]" ["createtime"]=> string(10) "1511397474" ["_id"]=> string(3) "421" } [8]=> array(10) { ["id"]=> string(3) "420" ["text"]=> string(47) "数据结构 队列_队列实例:事件处理" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(7) "idreamo" ["tagsname"]=> string(40) "C语言|数据结构|队列|事件处理" ["tagsid"]=> string(23) "["246","247","248",395]" ["catesname"]=> string(12) "数据结构" ["catesid"]=> string(7) "["133"]" ["createtime"]=> string(10) "1511397279" ["_id"]=> string(3) "420" } [9]=> array(10) { ["id"]=> string(3) "419" ["text"]=> string(47) "久等了,博客园官方Android客户端发布" ["intro"]=> string(288) "目录 ECharts 异步加载 ECharts 数据可视化在过去几年中取得了巨大进展。开发人员对可视化产品的期望不再是简单的图表创建工具,而是在交互、性能、数据处理等方面有更高的要求。 chart.setOption({ color: [ " ["username"]=> string(3) "cmt" ["tagsname"]=> string(0) "" ["tagsid"]=> string(2) "[]" ["catesname"]=> string(0) "" ["catesid"]=> string(2) "[]" ["createtime"]=> string(10) "1511396549" ["_id"]=> string(3) "419" } } ["count"]=> int(200) } 222 Python抓取微博评论(二) - `Elaine - 爱码网
chenyang920

Python抓取微博评论(二)

对于新浪微博评论的抓取,首篇做的时候有些考虑不周,然后现在改正了一些地方,因为有人问,抓取评论的时候“爬前50页的热评,或者最新评论里的前100页“,这样的数据看了看,好像每条微博的评论都只能抓取到前100页,当page=101时,xhr数据就成空,然后没有内容,所以现在是抓取每条微博最近的100页的评论,即1000条评论,

代码有些改动,但是思路都是一样

# -*- coding: utf-8 -*-
import re
import urllib
import urllib2
import os
import stat
import itertools
import re
import sys
import requests
import json
import time
import socket
import urlparse
import csv
import random
from datetime import datetime, timedelta
import lxml.html
from wordcloud import WordCloud
import jieba
import PIL
import matplotlib.pyplot as plt
import numpy as np

from zipfile import ZipFile
from StringIO import StringIO
from downloader import Downloader
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
from itertools import product
import sys
reload(sys)
sys.setdefaultencoding(\'utf8\')
import json,urllib2
def download(url, headers, num_try=2):
    while num_try >0:
        num_try -= 1
        try:
            content = requests.get(url, headers=headers)
            return content.text

        except urllib2.URLError as e:
            print \'Download error\', e.reason

    return None
header_dict = {
                \'Content-Type\':\'application/json; charset=utf-8\',
                \'Accept\':\'application/json, text/plain, */*\',
                \'Accept-Encoding\':\'gzip, deflate, br\',
                \'Accept-Language\':\'zh-CN,zh;q=0.9\',
                \'Connection\':\'keep-alive\',
                \'Cookie\':\'...\',
                \'Host\':\'m.weibo.cn\',
                \'Referer\':\'https://m.weibo.cn/u/1241148864?display=0&retcode=6102\',
                \'User-Agent\':\'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36\',
                \'X-Requested-With\':\'XMLHttpRequest\'
               }

def wordcloudplot(txt):
    path = \'/Users/cy/Downloads/msyh.ttf\'
    path = unicode(path, \'utf8\').encode(\'gb18030\')
    alice_mask = np.array(PIL.Image.open(\'/Users/cy/Desktop/1.jpg\'))
    wordcloud = WordCloud(font_path=path,
                          background_color="white",
                          margin=5, width=1800, height=800, mask=alice_mask, max_words=2000, max_font_size=60,
                          random_state=42)
    wordcloud = wordcloud.generate(txt)
    wordcloud.to_file(\'/Users/cy/Desktop/2.jpg\')
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()


def main():
    a = []
    f = open(r\'/Users/cy/Downloads/a.json\', \'r\').read()
    words = list(jieba.cut(f))
    for word in words:
        if len(word) > 1:
            a.append(word)
    txt = r\' \'.join(a)
    wordcloudplot(txt)

def get_comment(que):
    f = open(\'/Users/cy/Downloads/a.json\', \'w\')
    total_number = 10
    for each in que:
        for i in range(1,total_number):
            textmood = {"id": each,
                        "page": i}
            textmood = json.dumps(textmood)
            uu = \'https://m.weibo.cn/status/\' + str(each)
            header = {\'Connection\': \'keep-alive\',
                      \'Cookie\': \'.......\',
                      \'Accept-Language\': \'zh-CN,zh;q=0.8\',
                      \'Host\': \'m.weibo.cn\',
                      \'Referer\':uu,
                      \'User-Agent\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36\',
                      \'X-Requested-With\': \'XMLHttpRequest\'
                      }
            url = \'https://m.weibo.cn/api/comments/show?id=%s&page=%s\'%(str(each),str(i))
            print url

            req = urllib2.Request(url=url, data=textmood, headers=header)
            res = urllib2.urlopen(req)
            res = res.read()
            contents = res
            d = json.loads(contents, encoding="utf-8")
            total_numbers = d[\'total_number\']
            print total_numbers
            tto = total_numbers / 10 + 1
            if total_number > tto:
                 total_number = min(tto,10)
            if \'data\' in d:
                data = d[\'data\']
                if data != "":
                    for each_one in data:
                        if each_one != "":
                            if each_one[\'text\'] != "":
                                mm = each_one[\'text\'].split(\'<\')
                                if  r\'回复\' not in mm[0]:
                                    index = mm[0]#filter(lambda x: x not in \'0123456789\', mm[0])
                                    print index
                                    f.write(index.encode("u8"))

def get_identified():

    que = []
    url = \'https://m.weibo.cn/api/container/getIndex?uid=1241148864&luicode=10000011&lfid=100103type%3D3%26q%3D%E5%BC%A0%E6%9D%B0&featurecode=20000180&type=uid&value=1241148864&containerid=1076031241148864\'
    for i in range(1,3):
        if i > 1:
            url = \'https://m.weibo.cn/api/container/getIndex?uid=1241148864&luicode=10000011&lfid=100103type%3D3%26q%3D%E5%BC%A0%E6%9D%B0&featurecode=20000180&type=uid&value=1241148864&containerid=1076031241148864&page=\'+str(i)
        print url

        req = download(url, header_dict,2)
        print req
        d = json.loads(req,encoding="utf-8")
        print d

        try:
            data = d[\'data\'][\'cards\']
            print data
        except KeyError,e:
            print e.message

        if data != "":
            for each in data:
                print each[\'itemid\']
                mm = each[\'itemid\']
                if mm != "":
                    identity = mm.split(\'-\')
                    num = identity[1][1:]
                    que.append(num)
                    print num

    get_comment(que)

if __name__ == \'__main__\':
    get_identified()
    main()

 

 

分类:

技术点:

相关文章:

  • 2021-12-25
  • 2022-01-02
  • 2019-08-02
  • 2021-12-16
  • 2018-12-29
  • 2021-11-27
  • 2021-08-07
  • 2020-03-11
猜你喜欢
  • 2020-06-17
  • 2017-12-08
  • 2019-02-13
  • 2021-12-14
  • 2019-02-14
  • 2021-12-20
  • 2020-01-16
  • 2021-10-16
相关资源
相似解决方案