【问题标题】:Python regex ignore new linePython正则表达式忽略新行
【发布时间】:2016-03-08 05:13:10
【问题描述】:

我的网页是这样的

<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>
<td valign="top">

    <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
        <tr>
            <td colspan="2">
                <div align="center">
                <a href="/title/name.php" target="_blank">
                <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                </a>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
        </tr>
        <tr>
            <td><span class="style10">Cat1 :</span></td>
            <td>1st name</td>
        </tr>
        <tr>
            <td width="32%"><span class="style10">Cat2 :</span></td>
            <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
        </tr>
        <tr>
            <td><span class="style10">cat4 :</span></td>
            <td>Bla bla</td>
        </tr>
        <tr>
            <td><span class="style10">Cat3 :</span></td>
            <td>thirdName2</td>
        </tr>
    </table>

</td>

我想使用 python 正则表达式从该站点获取某些值。 在&lt;div align="center"&gt; 之后,我喜欢从&lt;h1 align="center"&gt;&lt;a href="./title.php?titleid=12"&gt;Title - secondname&lt;/a&gt;&lt;/h1&gt; 获取href 值:“/title/name.php”和img src:“./movie/image.jpg”和Title - secondname

我试过这个: regex = 'class="main_tb3"*\n&lt;a href="(.+?)" target="_blank"&gt;\n&lt;img src="(.+?)"'

请帮帮我

【问题讨论】:

标签: php python html regex beautifulsoup


【解决方案1】:

你可以使用下面的正则表达式

对于href值:&lt;a href="(.*?)"

图片来源:&lt;img src="(.*?)"

标题:titleid=12"&gt;(.*?)&lt;

【讨论】:

    【解决方案2】:

    您会发现安装 BeautifulSoup 之类的东西要简单得多:

    from bs4 import BeautifulSoup
    
    html = """
    <td valign="top">
    
        <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
            <tr>
                <td colspan="2">
                    <div align="center">
                    <a href="/title/name.php" target="_blank">
                    <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                    </a>
                    </div>
                </td>
            </tr>
            <tr>
                <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
            </tr>
            <tr>
                <td><span class="style10">Cat1 :</span></td>
                <td>1st name</td>
            </tr>
            <tr>
                <td width="32%"><span class="style10">Cat2 :</span></td>
                <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
            </tr>
            <tr>
                <td><span class="style10">cat4 :</span></td>
                <td>Bla bla</td>
            </tr>
            <tr>
                <td><span class="style10">Cat3 :</span></td>
                <td>thirdName2</td>
            </tr>
        </table>
    
    </td>
    <td valign="top">
    
        <table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
            <tr>
                <td colspan="2">
                    <div align="center">
                    <a href="/title/name.php" target="_blank">
                    <img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
                    </a>
                    </div>
                </td>
            </tr>
            <tr>
                <td colspan="2"><h1 align="center"><a href="./title.php?titleid=12">Title - secondname</a></h1></td>
            </tr>
            <tr>
                <td><span class="style10">Cat1 :</span></td>
                <td>1st name</td>
            </tr>
            <tr>
                <td width="32%"><span class="style10">Cat2 :</span></td>
                <td width="68%"><b><i><a href="./secondname.php" target="_blank">secondname</a></i></b></td>
            </tr>
            <tr>
                <td><span class="style10">cat4 :</span></td>
                <td>Bla bla</td>
            </tr>
            <tr>
                <td><span class="style10">Cat3 :</span></td>
                <td>thirdName2</td>
            </tr>
        </table>
    
    </td>"""
    
    soup = BeautifulSoup(html)
    
    for table in soup.find_all("table", class_="main_tb3"):
        print table.find('a').get('href')
        print table.find('h1').text
    

    对于您提供的 HTML,这将打印以下内容:

    /title/name.php
    Title - secondname
    /title/name.php
    Title - secondname
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-02-08
      相关资源
      最近更新 更多