【项目10】 房价影响因素挖掘
1、项目需求
1、数据清洗、整合
2、计算“房屋售租比”,做初步判断投资上海房产是否可通过租房投资
3、上海市人口密度、路网密度、餐饮价格和“房屋每平米均价”是否有关系?
4、按照离市中心距离每10km,分别再次判断人口密度、路网密度、餐饮价格和“房屋每平米均价”的相关程度
2、实现思路
1、数据正常清洗,处理空值等
2、计算单位平方租房价格和单位平凡售房价格,计算售租比
3、通过qgis查看租房、售房、售租比的空间分布情况,再通过散点图查看各个纬度对房价的影响情况
4、提取超过10km的数据,绘制折线图查看各指标对房价的影响情况
3、实现步骤
1.1、导入模块,读取数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings('ignore')
# 不发出警告
os.chdir('C:\\Users\\Administrator\\Desktop\\项目资料\\项目10房价影响因素挖掘')
df01 = pd.read_csv('house_rent.csv',engine = 'python')
df02 = pd.read_csv('house_sell.csv',engine = 'python')
1.2、数据清洗,合并数据
df01.dropna(inplace = True)
df02.dropna(inplace = True)
df01['rent_dj'] = df01['price']/df01['area']
df1_rent = df01[['community','rent_dj','lng','lat']].groupby(by = 'community').mean()
df1_sell = df02[['property_name','average_price','lng','lat']].groupby(by = 'property_name').mean()
df1_rent.reset_index(inplace = True)
df1_sell.reset_index(inplace = True)
df1_jg = pd.merge(df1_rent,df1_sell,left_on ='community',right_on='property_name')
df1_jg = df1_jg[['community','rent_dj','average_price','lng_x','lat_x']]
df1_jg.columns = ['community','rent_area','sell_area','lng','lat']
2.1、计算售租比,绘图查看并导出数据
df2 = df1_jg.copy()
df2['szb'] = df2['sell_area']/df2['rent_area']
print(df2['szb'].median())
plt.figure(figsize=(15,6))
df2['szb'].plot.hist(bins = 100,grid = True,color = 'g',edgecolor = 'k')
plt.figure(figsize=(15,6))
df2['szb'].plot.box(vert=False, grid = True,sym = '+')
df2 = df2[df2['szb']!=0]
df2.to_csv('data.csv',encoding ='gbk')
结论:如果按照中位数725计算的话,这个售租比实在是低的吓人,投资租房的话725个月(约60年)才能收回本金(这里不考虑货币贬值问题),所以在上海买房不是为了买来租的,而是升值的溢价
3.1、利用问题2的原始数据通过qgis绘制出单位1平方千米的网格图
房屋售价分布
房租租价分布
房屋售租比分布
3.2、读取通过qgis处理后数据信息,绘制各纬度关系图
df3.fillna(0,inplace = True)
def f1(x,col):
return ((x[col]-x[col].min())/(x[col].max()-x[col].min()))
df3['人口密度指标'] = f1(df3,'Z')
df3['路网密度指标'] = f1(df3,'长度')
df3['餐饮价格指标'] = f1(df3,'人均消费_')
df3['距离市中心距离'] = ((df3['lng'] - 353508.848122)**2 + (df3['lat'] - 3456140.926976)**2)**0.5
df3 = df3[['人口密度指标','路网密度指标','餐饮价格指标','距离市中心距离','sell_area_']]
df3 = df3[df3['sell_area_']>0]
df3.reset_index(inplace = True)
del df3['index']
plt.figure(figsize = (15,6))
plt.scatter(df3['路网密度指标'],df3['sell_area_'],s = 2,alpha = 0.5)
plt.figure(figsize = (15,6))
plt.scatter(df3['人口密度指标'],df3['sell_area_'],s = 2,alpha = 0.5)
plt.figure(figsize = (15,6))
plt.scatter(df3['餐饮价格指标'],df3['sell_area_'],s = 2,alpha = 0.5)
plt.figure(figsize = (15,6))
plt.scatter(df3['距离市中心距离'],df3['sell_area_'],s = 2,alpha = 0.5,color = 'r')
道路密度与房价关系图
人口密度与房价关系图
餐饮价格与房价关系图
中心距离与房价关系图
结论:上海的房价和距离中心距离为强关系,越靠近中心越贵,到30公里外处于相对比价平缓阶段,道路密度和人口数量属于中等关系,餐饮价格和房价并没有存在什么正相关
4.1、创建新的dataframe筛选10公里-70公里数据
jlqj = []
cyjg_pearson = []
lwjl_pearson = []
rkmd_pearson = []
zxjl_pearson = []
for i in range(10000,70000,10000):
data = df3[df3['距离市中心距离']<=i]
value = data.corr().loc['sell_area_']
jlqj.append(i)
cyjg_pearson.append(value.loc['餐饮价格指标'])
lwjl_pearson.append(value.loc['路网密度指标'])
rkmd_pearson.append(value.loc['人口密度指标'])
zxjl_pearson.append(value.loc['距离市中心距离'])
df4 = pd.DataFrame({'cyjg_pearson':cyjg_pearson,
'lwjl_pearson':lwjl_pearson,
'rkmd_pearson':rkmd_pearson,
'zxjl_pearson':zxjl_pearson},
index = jlqj)
4.2、绘制折线图查看相关性
from bokeh.plotting import figure,show
from bokeh.io import output_file
output_file('项目10.html')
from bokeh.models import ColumnDataSource
from bokeh.models import HoverTool
source = ColumnDataSource(data = df4)
hover4 = HoverTool(tooltips = [('距离市中心距离','@index'),
('餐饮价格指标','@cyjg_pearson'),
('路网密度指标','@lwjl_pearson'),
('人口密度指标','@rkmd_pearson'),
('距离市中心距离','@zxjl_pearson')])
p4 = figure(plot_width = 900,plot_height = 400,title = '各项指标对于房价的相关性',
tools = [hover4,'box_select,reset,xwheel_zoom,pan,crosshair'])
p4.line(x = 'index', y = 'cyjg_pearson',line_color = 'red',line_alpha = 0.7,line_dash = [16,4],source = source,legend = '餐饮价格指标')
p4.circle(x = 'index', y = 'cyjg_pearson',source = source,size = 8,color ='red')
p4.line(x = 'index', y = 'lwjl_pearson',line_color = 'black',line_alpha = 0.7,line_dash = [16,4],source = source,legend = '路网密度指标')
p4.circle(x = 'index', y = 'lwjl_pearson',source = source,size = 8,color ='black')
p4.line(x = 'index', y = 'rkmd_pearson',line_color = 'green',line_alpha = 0.7,line_dash = [16,4],source = source,legend = '人口密度指标')
p4.circle(x = 'index', y = 'rkmd_pearson',source = source,size = 8,color ='green')
p4.line(x = 'index', y = 'zxjl_pearson',line_color = 'blue',line_alpha = 0.7,line_dash = [16,4],source = source,legend = '距离市中心距离')
p4.circle(x = 'index', y = 'zxjl_pearson',source = source,size = 8,color ='blue')
p4.legend.location ='center_right'
show(p4
print('finished')
结论:距离市中心10公里外的房价与路网密度、人口密度、中心距离的30公里内存在比较强相关的关系,20-30公里正好是上海的城郊分界线,餐饮价格与房价没有正相关关系!