前些天看到csdn上的一篇文章,利用python写爬虫来分析上海房价的。感觉挺有意思的。正好最最近写snake后台也写到了文章采集,我也来用php的爬虫来分析一下南京的房价。说干就开始吧。
本次爬虫的依赖文件: 首先是ares333大神的CURL类。我用的是初期的版本,这是[url]https://github.com/ares333/php-curlmulti[/url]大神的github项目地址,他写的curl确实很牛!
采集用的是phpQuery,不知道这个类的朋友,可以自行百度吧。
至于数据的来源,我选择安居客,数据量还是可以的,打开安居客选到南京的频道。开始分析他们的页面结构,至于怎么用phpQuery分析页面结构采集的方法,这里就不做详细的介绍了。分析好结构,好,开始建立数据表。首先建立区域表,房屋交易都是分版块的,版块表结构如下:
1 CREATE TABLE `area` ( 2 `id` int(11) NOT NULL AUTO_INCREMENT, 3 `name` varchar(155) NOT NULL COMMENT \'南京市区\', 4 `url` varchar(155) NOT NULL COMMENT \'房源区域连接\', 5 `pid` int(2) NOT NULL COMMENT \'分类\', 6 PRIMARY KEY (`id`) 7 ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
我是首先自己添加的一些区服的数据,其实可以采集这些,因为就那几个区,地址有限,就直接添加了。添加了是14条数据:
初始数据准备好了,就可以开始采集所有的区域版块入口地址了。贴上代码
area.php
1 <?php 2 // +---------------------------------------------------------------------- 3 // | 采集区域脚本 4 // +---------------------------------------------------------------------- 5 // | Author: NickBai <1902822973@qq.com> 6 // +---------------------------------------------------------------------- 7 set_time_limit(0); 8 require \'init.php\'; 9 10 //根据大区信息前往抓取 11 $sql = "select * from `area`"; 12 $area = $db->query( $sql )->fetchAll( PDO::FETCH_ASSOC ); 13 14 foreach($area as $key=>$vo){ 15 16 $url = $vo[\'url\']; 17 $result = $curl->read($url); 18 19 $charset = preg_match("/<meta.+?charset=[^\w]?([-\w]+)/i", $result[\'content\'], $temp) ? strtolower( $temp[1] ) : ""; 20 phpQuery::$defaultCharset = $charset; //设置默认编码 21 22 $html = phpQuery::newDocumentHTML( $result[\'content\'] ); 23 24 $span = $html[\'.items .sub-items a\']; 25 26 $st = $db->prepare("insert into area(name,url,pid) values(?,?,?)"); 27 foreach($span as $v){ 28 $v = pq( $v ); 29 30 //为方便分页抓取,先加入分页规则 31 $href = trim( $v->attr(\'href\') ) . \'p*/#filtersort\'; 32 $st->execute([ trim( $v->text() ), $href, $vo[\'id\']]); 33 } 34 }
采集出的单条数据如下:
15 百家湖 http://nanjing.anjuke.com/sale/baijiahu/p*/#filtersort 1
数据地址都有了,而且页面地址我加了*,这样就可以替换了,打开程序就能开始采集每个模块下的其他分页的书据了。最重要的主程序就要开始了;
新建hdetail表来记录采集来的房屋数信息:
1 CREATE TABLE `hdetail` ( 2 `id` int(11) NOT NULL AUTO_INCREMENT, 3 `pid` int(5) NOT NULL COMMENT \'区域id\', 4 `square` int(10) DEFAULT NULL COMMENT \'面积\', 5 `housetype` varchar(55) DEFAULT \'\' COMMENT \'房屋类型\', 6 `price` int(10) DEFAULT \'0\' COMMENT \'单价\', 7 `allprice` int(10) DEFAULT \'0\' COMMENT \'总价\', 8 `name` varchar(155) DEFAULT \'\' COMMENT \'小区名称\', 9 `addr` varchar(155) DEFAULT \'\' COMMENT \'小区地址\', 10 PRIMARY KEY (`id`) 11 ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
数据库有了,那么主程序奉上。house.php
1 <?php 2 // +---------------------------------------------------------------------- 3 // | 采集各区具体房源信息 4 // +---------------------------------------------------------------------- 5 // | Author: NickBai <1902822973@qq.com> 6 // +---------------------------------------------------------------------- 7 set_time_limit(0); 8 require \'init.php\'; 9 10 //查询各板块数据 11 $sql = "select * from `area` where id > 14"; 12 $allarea = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 13 //http://nanjing.anjuke.com/sale/页面不存在时,会跳转到首页 14 15 foreach($allarea as $key=>$vo){ 16 17 $url = $vo[\'url\']; 18 $i = 1; 19 20 while ( true ){ 21 22 $urls = str_replace( "*" , $i, $url); 23 $result = $curl->read( $urls ); 24 if( "http://nanjing.anjuke.com/sale/" == $result[\'info\'][\'url\'] ){ 25 break; 26 } 27 28 $charset = preg_match("/<meta.+?charset=[^\w]?([-\w]+)/i", $result[\'content\'], $temp) ? strtolower( $temp[1] ) : ""; 29 phpQuery::$defaultCharset = $charset; //设置默认编码 30 31 $html = phpQuery::newDocumentHTML( $result[\'content\'] ); 32 33 $div = $html[\'#houselist-mod li .house-details\']; 34 $isGet = count( $div->elements ); //未采集到内容跳出,视为结束 35 if( 0 == $isGet ){ 36 break; 37 } 38 39 foreach($div as $v){ 40 41 $sql = "insert into hdetail(pid,square,housetype,price,allprice,name,addr) "; 42 $pid = $vo[\'id\']; 43 $square = rtrim( trim( pq($v)->find("div:eq(1) span:eq(0)")->text() ), "平方米"); 44 $htype = trim( pq($v)->find("div:eq(1) span:eq(1)")->text() ); 45 $price = rtrim ( trim( pq($v)->find("div:eq(1) span:eq(2)")->text() ), "元/m²"); 46 $area = explode(" ", trim( pq($v)->find("div:eq(2) span")->text() ) ); 47 48 $name = str_replace( chr(194) . chr(160), "", array_shift($area) ); //utf-8中的空格无法用trim去除,所以采用此方法 49 $addr = rtrim( ltrim (trim( array_pop($area) ) , "["), "]" ); 50 $allprice = trim( pq($v)->siblings(".pro-price")->find("span strong")->text() ); 51 52 $sql .= " value( ". $pid .",". $square .", \'". $htype ."\' ,". $price .",". $allprice .", \'". $name ."\' ,\'". $addr ."\' )"; 53 $db->query($sql); 54 } 55 56 echo mb_convert_encoding($vo[\'name\'], "gbk", "utf-8") . " PAGE : ". $i . PHP_EOL; 57 $i++; 58 59 } 60 61 } 62 [/code] 63 跳过前面的大区,逐个采集。建议用cmd模式运行这个脚本。因为时间较长,所以用浏览器会导致卡死现象。至于不知道怎么用cmd命令执行php的,自己百度吧。 64 [img]/Uploads/editor/2016-08-17/57b3d927dd54e.jpg[/img] 65 如果觉得慢的话,你们可以复制几分house.php文件,修改 66 [code] 67 $sql = "select * from `area` where id > 14"; 68 [/code] 69 根据id进行截取,多打开几个cmd执行,就变成多进程模式了。 70 [img]/Uploads/editor/2016-08-17/57b3d978afae8.jpg[/img] 71 下面就是等待了,我是8.16号采集的,一共采集了311226条数据。好了现在数有了,就可以开始分析了。我分析的代码如下: 72 [code] 73 <?php 74 require "init.php"; 75 76 $data = unserialize( file_get_contents(\'./data/nj.data\') ); 77 if( empty( $data ) ){ 78 79 //全南京 80 $sql = "select avg(price) price from hdetail"; 81 $nanjing = intval( $db->query($sql)->fetch( PDO::FETCH_ASSOC )[\'price\'] ); 82 83 //其余数据 84 $data = [ 85 $nanjing, 86 getOtherPrice(\'2,3,4,5,6,7,8,10\'), 87 getOtherPrice(\'1\'), 88 getOtherPrice(\'2\'), 89 getOtherPrice(\'3\'), 90 getOtherPrice(\'4\'), 91 getOtherPrice(\'5\'), 92 getOtherPrice(\'6\'), 93 getOtherPrice(\'7\'), 94 getOtherPrice(\'8\'), 95 getOtherPrice(\'9\'), 96 getOtherPrice(\'10\'), 97 getOtherPrice(\'11\'), 98 getOtherPrice(\'12\'), 99 getOtherPrice(\'13\') 100 ]; 101 102 //添加缓存 103 file_put_contents(\'./data/nj.data\', serialize( $data )); 104 } 105 106 //均价最高TOP10 107 $sql = "select avg(price) price,name from hdetail GROUP BY name ORDER BY price desc limit 10"; 108 $res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 109 $x = ""; 110 $y = ""; 111 foreach($res as $vo){ 112 $x .= "\'" . $vo[\'name\'] . "\',"; 113 $y .= intval( $vo[\'price\'] ). ","; 114 } 115 116 //均价最低TOP10 117 $sql = "select avg(price) price,name from hdetail GROUP BY name ORDER BY price asc limit 10"; 118 $res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 119 $xl = ""; 120 $yl = ""; 121 foreach($res as $vo){ 122 $xl .= "\'" . $vo[\'name\'] . "\',"; 123 $yl .= intval( $vo[\'price\'] ). ","; 124 } 125 126 //交易房型数据 127 $sql = "select count(0) allnum, housetype from hdetail GROUP BY housetype order by allnum desc"; 128 $res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 129 $htype = ""; 130 foreach($res as $vo){ 131 $htype .= "[ \'" . $vo[\'housetype\'] . "\', " .$vo[\'allnum\']. "],"; 132 } 133 134 $htype = rtrim($htype, \',\'); 135 136 //交易的房屋面积数据 137 $square = [\'50平米以下\', \'50-70平米\', \'70-90平米\', \'90-120平米\', \'120-150平米\', \'150-200平米\', \'200-300平米\', \'300平米以上\']; 138 $sql = "select count(0) allnum, square from hdetail GROUP BY square"; 139 $squ = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 140 141 $p50 = 0; 142 $p70 = 0; 143 $p90 = 0; 144 $p120 = 0; 145 $p150 = 0; 146 $p200 = 0; 147 $p250 = 0; 148 $p300 = 0; 149 150 foreach($squ as $key=>$vo){ 151 if( $vo[\'square\'] < 50 ){ 152 $p50 += $vo[\'allnum\']; 153 } 154 if( $vo[\'square\'] >= 50 && $vo[\'square\'] < 70 ){ 155 $p70 += $vo[\'allnum\']; 156 } 157 if( $vo[\'square\'] >= 70 && $vo[\'square\'] < 90 ){ 158 $p90 += $vo[\'allnum\']; 159 } 160 if( $vo[\'square\'] >= 90 && $vo[\'square\'] < 120 ){ 161 $p120 += $vo[\'allnum\']; 162 } 163 if( $vo[\'square\'] >= 120 && $vo[\'square\'] < 150 ){ 164 $p150 += $vo[\'allnum\']; 165 } 166 if( $vo[\'square\'] >= 150 && $vo[\'square\'] < 200 ){ 167 $p200 += $vo[\'allnum\']; 168 } 169 if( $vo[\'square\'] >= 200 && $vo[\'square\'] < 300 ){ 170 $p250 += $vo[\'allnum\']; 171 } 172 if( $vo[\'square\'] >= 300 ){ 173 $p300 += $vo[\'allnum\']; 174 } 175 } 176 177 $num = [ $p50, $p70, $p90, $p120, $p150, $p200, $p250, $p300 ]; 178 179 $sqStr = ""; 180 foreach($square as $key=>$vo){ 181 $sqStr .= "[ \'" . $vo . "\', " .$num[$key]. "],"; 182 } 183 184 185 //根据获取ids字符串获取对应的均价信息 186 function getOtherPrice($str){ 187 global $db; 188 189 $sql = "select id from area where pid in(" . $str . ")"; 190 $city = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 191 $ids = ""; 192 foreach($city as $v){ 193 $ids .= $v[\'id\'] . ","; 194 } 195 $sql = "select avg(price) price from hdetail where pid in (".rtrim($ids, ",").")"; 196 $price = intval( $db->query($sql)->fetch( PDO::FETCH_ASSOC )[\'price\'] ); 197 198 return $price; 199 } 200 201 ?> 202 <!DOCTYPE html> 203 <html> 204 <head> 205 <meta charset="utf-8"> 206 <meta name="viewport" content="width=device-width, initial-scale=1.0"> 207 <title>南京房价分析</title> 208 <link rel="shortcut icon" href="favicon.ico"> <link href="css/bootstrap.min.css?v=3.3.6" rel="stylesheet"> 209 <link href="css/font-awesome.min.css?v=4.4.0" rel="stylesheet"> 210 <link href="css/animate.min.css" rel="stylesheet"> 211 <link href="css/style.min.css?v=4.1.0" rel="stylesheet"> 212 </head> 213 <body class="gray-bg"> 214 <div class="wrapper wrapper-content"> 215 <div class="row"> 216 <div class="col-sm-12"> 217 <div class="row"> 218 <div class="col-sm-12"> 219 <div class="ibox float-e-margins"> 220 <div class="ibox-title"> 221 <h5>全南京以及各区二手房均价</h5> 222 <div class="ibox-tools"> 223 <a class="collapse-link"> 224 <i class="fa fa-chevron-up"></i> 225 </a> 226 <a class="close-link"> 227 <i class="fa fa-times"></i> 228 </a> 229 </div> 230 </div> 231 <div class="ibox-content"> 232 <div id="container"></div> 233 </div> 234 </div> 235 </div> 236 </div> 237 </div> 238 </div> 239 <div class="row"> 240 <div class="col-sm-6"> 241 <div class="row"> 242 <div class="col-sm-12"> 243 <div class="ibox float-e-margins"> 244 <div class="ibox-title"> 245 <h5>均价最高的小区TOP10</h5> 246 <div class="ibox-tools"> 247 <a class="collapse-link"> 248 <i class="fa fa-chevron-up"></i> 249 </a> 250 <a class="close-link"> 251 <i class="fa fa-times"></i> 252 </a> 253 </div> 254 </div> 255 <div class="ibox-content"> 256 <div id="avgpriceh"></div> 257 </div> 258 </div> 259 </div> 260 </div> 261 </div> 262 <div class="col-sm-6"> 263 <div class="row"> 264 <div class="col-sm-12"> 265 <div class="ibox float-e-margins"> 266 <div class="ibox-title"> 267 <h5>均价最低的小区TOP10</h5> 268 <div class="ibox-tools"> 269 <a class="collapse-link"> 270 <i class="fa fa-chevron-up"></i> 271 </a> 272 <a class="close-link"> 273 <i class="fa fa-times"></i> 274 </a> 275 </div> 276 </div> 277 <div class="ibox-content"> 278 <div id="avgpricel"></div> 279 </div> 280 </div> 281 </div> 282 </div> 283 </div> 284 </div> 285 286 <div class="row"> 287 <div class="col-sm-6"> 288 <div class="row"> 289 <div class="col-sm-12"> 290 <div class="ibox float-e-margins"> 291 <div class="ibox-title"> 292 <h5>交易房型比例</h5> 293 <div class="ibox-tools"> 294 <a class="collapse-link"> 295 <i class="fa fa-chevron-up"></i> 296 </a> 297 <a class="close-link"> 298 <i class="fa fa-times"></i> 299 </a> 300 </div> 301 </div> 302 <div class="ibox-content"> 303 <div id="htype"></div> 304 </div> 305 </div> 306 </div> 307 </div> 308 </div> 309 <div class="col-sm-6"> 310 <div class="row"> 311 <div class="col-sm-12"> 312 <div class="ibox float-e-margins"> 313 <div class="ibox-title"> 314 <h5>交易房屋面积比例</h5> 315 <div class="ibox-tools"> 316 <a class="collapse-link"> 317 <i class="fa fa-chevron-up"></i> 318 </a> 319 <a class="close-link"> 320 <i class="fa fa-times"></i> 321 </a> 322 </div> 323 </div> 324 <div class="ibox-content"> 325 <div id="square"></div> 326 </div> 327 </div> 328 </div> 329 </div> 330 </div> 331 </div> 332 333 </div> 334 <script type="text/javascript" src="js/jquery.min.js?v=2.1.4"></script> 335 <script type="text/javascript" src="js/bootstrap.min.js?v=3.3.6"></script> 336 <script type="text/javascript" src="http://cdn.hcharts.cn/highcharts/highcharts.js"></script> 337 <script type="text/javascript"> 338 $(function () { 339 $(\'#container\').highcharts({ 340 chart: { 341 type: \'column\' 342 }, 343 title: { 344 text: \'全南京以及各区二手房均价\' 345 }, 346 subtitle: { 347 text: \'来源于安居客8.16的数据\' 348 }, 349 xAxis: { 350 categories: [\'全南京\',\'江南八区\',\'江宁区\',\'鼓楼区\',\'白下区\',\'玄武区\',\'建邺区\',\'秦淮区\',\'下关区\',\'雨花台区\',\'浦口区\',\'栖霞区\',\'六合区\', 351 \'溧水区\',\'高淳区\',\'大厂\'], 352 crosshair: true 353 }, 354 yAxis: { 355 min: 0, 356 title: { 357 text: \'元/m²\' 358 } 359 }, 360 tooltip: { 361 headerFormat: \'<span style="font-size:10px">{point.key}</span><table>\', 362 pointFormat: \'<tr><td style="color:{series.color};padding:0">{series.name}: </td>\' + 363 \'<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>\', 364 footerFormat: \'</table>\', 365 shared: true, 366 useHTML: true 367 }, 368 plotOptions: { 369 column: { 370 pointPadding: 0.2, 371 borderWidth: 0, 372 dataLabels:{ 373 enabled:true// dataLabels设为true 374 } 375 } 376 }, 377 series: [{ 378 name: \'平均房价\', 379 data: [<?php echo implode(\',\', $data); ?>] 380 }] 381 }); 382 383 //均价最高top10 384 $(\'#avgpriceh\').highcharts({ 385 chart: { 386 type: \'column\' 387 }, 388 title: { 389 text: \'均价最高的小区TOP10\' 390 }, 391 subtitle: { 392 text: \'来源于安居客8.16的数据\' 393 }, 394 xAxis: { 395 categories: [<?=$x; ?>], 396 crosshair: true 397 }, 398 yAxis: { 399 min: 0, 400 title: { 401 text: \'元/m²\' 402 } 403 }, 404 tooltip: { 405 headerFormat: \'<span style="font-size:10px">{point.key}</span><table>\', 406 pointFormat: \'<tr><td style="color:{series.color};padding:0">{series.name}: </td>\' + 407 \'<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>\', 408 footerFormat: \'</table>\', 409 shared: true, 410 useHTML: true 411 }, 412 plotOptions: { 413 column: { 414 pointPadding: 0.2, 415 borderWidth: 0, 416 dataLabels:{ 417 enabled:true// dataLabels设为true 418 } 419 } 420 }, 421 series: [{ 422 name: \'平均房价\', 423 data: [<?=$y; ?>] 424 }] 425 }); 426 427 //均价最低top10 428 $(\'#avgpricel\').highcharts({ 429 chart: { 430 type: \'column\' 431 }, 432 title: { 433 text: \'均价最低的小区TOP10\' 434 }, 435 subtitle: { 436 text: \'来源于安居客8.16的数据\' 437 }, 438 xAxis: { 439 categories: [<?=$xl; ?>], 440 crosshair: true 441 }, 442 yAxis: { 443 min: 0, 444 title: { 445 text: \'元/m²\' 446 } 447 }, 448 tooltip: { 449 headerFormat: \'<span style="font-size:10px">{point.key}</span><table>\', 450 pointFormat: \'<tr><td style="color:{series.color};padding:0">{series.name}: </td>\' + 451 \'<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>\', 452 footerFormat: \'</table>\', 453 shared: true, 454 useHTML: true 455 }, 456 plotOptions: { 457 column: { 458 pointPadding: 0.2, 459 borderWidth: 0, 460 dataLabels:{ 461 enabled:true// dataLabels设为true 462 } 463 } 464 }, 465 series: [{ 466 name: \'平均房价\', 467 data: [<?=$yl; ?>] 468 }] 469 }); 470 471 // Radialize the colors 472 Highcharts.getOptions().colors = Highcharts.map(Highcharts.getOptions().colors, function (color) { 473 return { 474 radialGradient: { cx: 0.5, cy: 0.3, r: 0.7 }, 475 stops: [ 476 [0, color], 477 [1, Highcharts.Color(color).brighten(-0.3).get(\'rgb\')] // darken 478 ] 479 }; 480 }); 481 //房型类型 482 $(\'#htype\').highcharts({ 483 chart: { 484 plotBackgroundColor: null, 485 plotBorderWidth: null, 486 plotShadow: false 487 }, 488 title: { 489 text: \'交易的二手房型比例\' 490 }, 491 tooltip: { 492 pointFormat: \'{series.name}: <b>{point.percentage:.1f}%</b>\' 493 }, 494 plotOptions: { 495 pie: { 496 allowPointSelect: true, 497 cursor: \'pointer\', 498 dataLabels: { 499 enabled: true, 500 format: \'<b>{point.name}</b>: {point.percentage:.1f} %\', 501 style: { 502 color: (Highcharts.theme && Highcharts.theme.contrastTextColor) || \'black\' 503 }, 504 connectorColor: \'silver\' 505 } 506 } 507 }, 508 series: [{ 509 type: \'pie\', 510 name: \'Browser share\', 511 data: [ 512 <?=$htype; ?> 513 ] 514 }] 515 }); 516 517 //房型面积类型 518 $(\'#square\').highcharts({ 519 chart: { 520 plotBackgroundColor: null, 521 plotBorderWidth: null, 522 plotShadow: false 523 }, 524 title: { 525 text: \'交易的二手房面积比例\' 526 }, 527 tooltip: { 528 pointFormat: \'{series.name}: <b>{point.percentage:.1f}%</b>\' 529 }, 530 plotOptions: { 531 pie: { 532 allowPointSelect: true, 533 cursor: \'pointer\', 534 dataLabels: { 535 enabled: true, 536 format: \'<b>{point.name}</b>: {point.percentage:.1f} %\', 537 style: { 538 color: (Highcharts.theme && Highcharts.theme.contrastTextColor) || \'black\' 539 }, 540 connectorColor: \'silver\' 541 } 542 } 543 }, 544 series: [{ 545 type: \'pie\', 546 name: \'Browser share\', 547 data: [ 548 <?=$sqStr; ?> 549 ] 550 }] 551 }); 552 553 }); 554 </script> 555 </body> 556 </html>
具体的截图效果,大家可以参考我的http://www.thinkphp.cn/code/2239.html这篇文章,源码在那边页游呈现