写在文首:头一次写csdn写博客呢,简介一下这个小demo
数据挖掘的课后作业,用到的技术要点:python,scrapy,pymysql,numpy,matplotlib。
步骤就是先用scrapy爬取房屋信息,然后入mysql库,接着从库里到处数据,最后用numpy处理数据并用matplotlib画图。
1、scrapy爬取赶集网:
①、spider里用xpath解析标签的内容;
# 房子名字
house_name = response.xpath("//dd[@class='dd-item title']/a/text()").extract()
# 房子户型
house_type = response.xpath("//dd[@class='dd-item size']/span[1]/text()").extract()
# 房子面积
house_area = response.xpath("//dd[@class='dd-item size']/span[3]/text()").extract()
# 房子总共价格
house_cost = response.xpath("//dd[@class='dd-item info']/div[@class='price']/span[@class='num']/text()").extract()
# 房子单价
house_price = response.xpath("//dd[@class='dd-item info']/div[@class='time']/text()").extract()
# 房子所在的县级市
house_add = response.xpath("//dd[@class='dd-item address']//a[1]/text()").extract()
# 房子所在的街道
house_add_area = response.xpath("//dd[@class='dd-item address']//a[2]/span/text()").extract()
②、把爬到的东西传给pipline,pipline负责把数据写入数据库;
def process_item(self, item, spider):
insert_sql = "insert into ershoufang values (0, %s, %s, %s, %s, %s, %s, %s);"
print('我要往数据库写东西了!!!!!!!!!!!!!!!!!!!!!!!')
self.cur.execute(insert_sql, (
item['house_name'], item['house_type'], item['house_area'], item['house_cost'], item['house_price'],
item['house_add'], item['house_add_area']))
self.conn.commit()
# return item
print('success ! ')
2、展示二手房价折线图
①、从数据库里取数据
def get_house_info(name): # 从数据库获得数据,house_add匹配name的数据
query = "select * from ershoufang where house_add like %s ;"
cursor.execute(query, name)
house_info = tuple_to_list(cursor.fetchall())
return house_info
def get_nt_house_info(name):
cursor.execute("select * from ershoufang;")
nt_house_info = tuple_to_list(cursor.fetchall())
# sleep(1)
print('从数据库成功获取到南通的二手房信息!')
# sleep(1)
return nt_house_info
②、用numpy清晰数据并构造一元线性回归模型
def numpy_deal(info):
""" 用numpy清理数据,构造一元线性回归模型,并展示 """
info = numpy.array(info)
info_s = info[:, [3]] # 获取所有房子的面积/平方
info_c = info[:, [4]] # 获取所有房子的价格,单位(万元)
info_p = info[:, [5]] # 获取所有房子的每平价格单位(元)
info_s = info_s.reshape(1, len(info)).astype(float)[0] # reshape成一维数组,并把字符串改正int型
info_c = info_c.reshape(1, len(info)).astype(float)[0]
info_p = info_p.reshape(1, len(info)).astype(float)[0]
# 下面构造一元线性回归模型,方程1:面积与房价 和 方程2:面积与单价
mean_s = numpy.mean(info_s)
mean_c = numpy.mean(info_c)
mean_p = numpy.mean(info_p)
# 方程1,面积与房价的关系
w1 = (numpy.sum((info_s - mean_s) * (info_c - mean_c))) / (numpy.sum(numpy.power((info_s - mean_s), 2)))
b1 = mean_c - w1 * mean_s
# 方程2,面积与单价的关系
w2 = (numpy.sum((info_s - mean_s) * (info_p - mean_p))) / (numpy.sum(numpy.power((info_s - mean_s), 2)))
b2 = mean_p - w2 * mean_s
return info_s, info_c, info_p, w1, b1, w2, b2
③、展示数据图
plt.plot(s_list, nantong_pre, label='南通', color='black', linewidth=3)
plt.plot(s_list, chongchuan_pre, label='崇川', color='red')
plt.plot(s_list, gangzha_pre, label='港闸', color='orange')
plt.plot(s_list, tongzhou_pre, label='通州', color='yellow')
plt.plot(s_list, haian_pre, label='海安', color='green')
plt.plot(s_list, rugao_pre, label='如皋', color='blue')
plt.plot(s_list, rudong_pre, label='如东', color='indigo')
plt.plot(s_list, haimen_pre, label='海门', color='violet')
plt.plot(s_list, qidong_pre, label='启东', color='pink')
plt.title('南通市县级市房价曲线', fontproperties=my_font)
plt.xlabel('面积/m²', fontproperties=my_font)
plt.ylabel('总价/万元', fontproperties=my_font)
plt.yticks(range(0, 501)[::25])
plt.grid(alpha=0.4)
plt.legend(prop=my_font)
写在最后,本人也是个小菜鸟,希望有一天我也能成为个大佬。:)
有不懂的或者有改进的地方大家多多交流。:)
搬运请注明来源。:)