从网页中抓取源代码 <script> 标签答案

【问题标题】：Scraping source code <script> tag from a web page从网页中抓取源代码 <script> 标签
【发布时间】：2017-09-12 03:15:08
【问题描述】：

我正在寻找一种方法来抓取一些源代码。我需要的信息在与此类似的标签内。

<script>
.......
var playerIdMap = {};
playerIdMap['4'] = '614';
playerIdMap['5'] = '84';
playerIdMap['6'] = '65';
playerIdMap['7'] = '701';
getPlayerIdMap = function() { return playerIdMap; };   // global
}
enclosePlayerMap();
</script>

我正在尝试获取 playerIdMap 数字的内容，例如：4 和 614，或者整行。

【问题讨论】：

您是否尝试过阅读 HTML 文件，检查每一行是否包含“playerIdMap”，然后保存那些有的？获取 playerIdMap 数组键和值的正则表达式怎么样？你甚至可以爆炸 playerIdMap（虽然那是无效的）。有很多方法。
我在前端看不到这些。我以前刮过图像等，但从未尝试过任何属于页面源且不可见的东西。我将如何表达？谢谢
试试这个线程：stackoverflow.com/questions/584826/… 它描述了许多可以实现此目的的方法。

标签： php screen-scraping

【解决方案1】：

Edit-2

完整的 PHP 代码灵感来自 How to get data from API - php - curl 的代码

<?php
/**
 * Handles making a cURL request
 *
 * @param string $url         URL to call out to for information.
 * @param bool   $callDetails Optional condition to allow for extended
 *   information return including error and getinfo details.
 *
 * @return array $returnGroup cURL response and optional details.
 */
function makeRequest($url, $callDetails = false)
{
  // Set handle
  $ch = curl_init($url);

  // Set options
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  // Execute curl handle add results to data return array.
  $result = curl_exec($ch);
  $returnGroup = ['curlResult' => $result,];

  // If details of curl execution are asked for add them to return group.
  if ($callDetails) {
    $returnGroup['info'] = curl_getinfo($ch);
    $returnGroup['errno'] = curl_errno($ch);
    $returnGroup['error'] = curl_error($ch);
  }

  // Close cURL and return response.
  curl_close($ch);
  return $returnGroup;
}

$url = "http://www.bullshooterlive.com/my-stats/999/";
$response = makeRequest($url, true);

$re = '/playerIdMap\[\'(?P<id>\d+)\']\s+=\s+\'(?P<value>\d+)\'/';

preg_match_all($re, $response['curlResult'], $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

//var_dump($response);

Edit-1

抱歉没有意识到你问了 PHP 问题。不知道为什么我在这里假设scrapy。无论如何下面的php代码应该会有所帮助

$re = '/playerIdMap\[\'(?P<id>\d+)\']\s+=\s+\'(?P<value>\d+)\'/';
$str = '<script>
.......
var playerIdMap = {};
playerIdMap[\'4\'] = \'614\';
playerIdMap[\'5\'] = \'84\';
playerIdMap[\'6\'] = \'65\';
playerIdMap[\'7\'] = \'701\';
getPlayerIdMap = function() { return playerIdMap; };   // global
}
enclosePlayerMap();
</script>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

上一个答案

你可以使用类似下面的东西

>>> data = """
... <script>
... .......
... var playerIdMap = {};
... playerIdMap['4'] = '614';
... playerIdMap['5'] = '84';
... playerIdMap['6'] = '65';
... playerIdMap['7'] = '701';
... getPlayerIdMap = function() { return playerIdMap; };   // global
... }
... enclosePlayerMap();
... </script>
... """
>>> import re
>>>
>>> regex = r"playerIdMap\['(?P<id>\d+)']\s+=\s+'(?P<value>\d+)'"
>>> re.findall(regex, data)
[('4', '614'), ('5', '84'), ('6', '65'), ('7', '701')]

您需要使用下面的脚本标签

data = response.xpath("//script[contains(text(),'getPlayerIdMap')]").extract_first() 

import re
regex = r"playerIdMap\['(?P<id>\d+)']\s+=\s+'(?P<value>\d+)'"
print(re.findall(regex, data))
[('4', '614'), ('5', '84'), ('6', '65'), ('7', '701')]

【讨论】：

ahh dang，抱歉很久没用了。如果可能的话，我什至不知道如何在我的 php 代码中实现它。
@KJThaDon，请查看更新后的答案。我的错，我不知何故认为这是一个草率的问题，这就是我发布 python 代码的原因
谢谢，虽然在某些方面我有点迷茫。它似乎做了我想做的事情，它会创建一个数组？我已经尝试添加我的 url 来获取代码，但是得到一个错误 preg_match_all() 最多需要 5 个参数，给定 6 这是代码：paste.ee/p/MCjKN
哇，我觉得这很完美，非常感谢！非常易读
让我们continue this discussion in chat.