【问题标题】:Need help to pulling content from PHP Page with Dom / Regex需要帮助使用 Dom / Regex 从 PHP 页面中提取内容
【发布时间】:2018-01-23 20:16:21
【问题描述】:

到目前为止,这是我的代码:

<?php
$start = date("d/m/y", strtotime('today'));
$end = date("d/m/y", strtotime('tomorrow'));

$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0"
));
$context = stream_context_create($opts);
$url = "http://www.hot.net.il/PageHandlers/LineUpAdvanceSearch.aspx?text=&channel=506&genre=-1&ageRating=-1&publishYear=-1&productionCountry=-1&startDate=$start&endDate=$end&pageSize=1";
$data = file_get_contents($url, false, $context);

$re = '/LineUpId=(.+\d)/';
preg_match($re, $data, $matches);

$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0"
));
$context = stream_context_create($opts);
$url = "http://www.hot.net.il/PageHandlers//LineUpDetails.aspx?lcid=1037&luid=$matches[1]";
$data = file_get_contents($url, false, $context);
echo $data;
?> 

我正在尝试为单频道和当前节目准备电视指南,

部分 HTML 页面:

<div class="GuideLineUpDetailsCenter">
    <a class="LineUpbold">Name of the Show</a>
    <br>
    <div class="LineUpDetailsTime">2018 22:45 - 23:30</div>
    <br>
    <div class="show">Information about the program</div>
    <br>
    <div class="LineUpbold">+14</div>
    <br>
</div>

我想提取内容并执行以下操作:

回显 $LineUpbold;

回显 $LineUpDetailsTime;

回声 $show;

回显 $LineUpbold;

【问题讨论】:

    标签: php regex dom


    【解决方案1】:

    改用DOM 解析器和适当的xpath 查询:

    <?php
    
    $data = <<<DATA
    <div class="GuideLineUpDetailsCenter">
        <a class="LineUpbold">Name of the Show</a>
        <br>
        <div class="LineUpDetailsTime">2018 22:45 - 23:30</div>
        <br>
        <div class="show">Information about the program</div>
        <br>
        <div class="LineUpbold">+14</div>
        <br>
    </div>
    DATA;
    
    # set up the dom
    $dom = new DOMDocument();
    $dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    
    # set up the xpath
    $xpath = new DOMXPath($dom);
    
    foreach ($xpath->query("//div[@class = 'GuideLineUpDetailsCenter']") as $container) {
        $name = $xpath->query("a[@class = 'LineUpbold']/text()", $container)->item(0);
        echo $name->nodeValue;
    
        $details = $xpath->query("div[@class = 'LineUpDetailsTime']/text()", $container)->item(0);
        echo $details->nodeValue;
    
        # and so on...
    
    }
    

    代码加载您的字符串,使用 GuideLineUpDetailsCenter 类搜索 divs,循环遍历它们并尝试在每个 div 中找到合适的子代。

    【讨论】:

    • 谢谢,工作:) 但我得到的文字是一个符号,有什么问题? "ר××××× ×××××ס 7 11. ××××¤× ×××ש××ש×, 23 ×× ××ר, 2018 22:45 - 23 :30 " 内容一般不是英文的,我应该在请求中添加 UTF-8 吗?
    • @dizzy:你能提供一个链接吗?我很确定这些是一些希伯来字符。
    • 是的,它的希伯来语hot.net.il/PageHandlers//…
    • 它适用于:$dom->loadHTML(mb_convert_encoding($data, 'HTML-ENTITIES', 'UTF-8'));
    猜你喜欢
    • 2014-10-31
    • 1970-01-01
    • 2023-04-02
    • 2018-05-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-05-16
    相关资源
    最近更新 更多