【问题标题】:html scraping in either batch or powershell [closed]批处理或powershell中的html抓取[关闭]
【发布时间】:2018-11-11 19:13:56
【问题描述】:

我需要从 .url 文件中抓取网站的 html,然后找到某一行,并将其下方的每一行抓取到某一点。 html代码示例如下:

</p><ul><li>(None)</li></ul><h2><span style="font-size:18px;">Authorized Administrators and Users</span></h2><pre><b>Authorized Administrators&#58;</b>
jim (you)
    password&#58; (blank/none)
bob
    password&#58; Littl3@birD
batman
    password&#58; 3ndur4N(e&amp;home
dab
    password&#58; captain

<b>Authorized Users&#58;</b>
bag
crab
oliver
james
scott
john
apple
</pre><h2><span style="font-size:18px;">Competition Guidelines</span></h2>

我需要把所有的授权管理员放到一个 txt 文件中,把授权用户放到一个 txt 文件中,并且都放到另一个 txt 文件中。这可以通过批处理和 powershell 来完成吗?

【问题讨论】:

    标签: html powershell batch-file web-scraping


    【解决方案1】:

    这是我想要得到你想要的东西的尝试。

    $url        = '<THE URL TAKEN FROM THE .URL SHORTCUT FILE>'
    $outputPath = '<THE PATH WHERE YOU WANT THE CSV FILES TO BE CREATED>'
    
    # get the content of the web page
    $html = (Invoke-WebRequest -Uri $url).Content
    
    # load the assembly to de-entify the HTML content
    Add-Type -AssemblyName System.Web
    $html = [System.Web.HttpUtility]::HtmlDecode($html)
    
    # get the Authorized Admins block
    if ($html -match '(?s)<b>Authorized Administrators:</b>(.+)<b>') {
        $adminblock = $matches[1].Trim()
        # inside this text block, get the admin usernames and passwords
        $admins = @()
        $regex = [regex] '(?m)^(?<name>.+)\s*password:\s+(?<password>.+)'
        $match = $regex.Match($adminblock)
        while ($match.Success) {
            $admins += [PSCustomObject]@{
                'Name'     = $($match.Groups['name'].Value -replace '\(you\)', '').Trim()
                'Type'     = 'Admin'
                # comment out this next property if you don't want passwords in the output
                'Password' = $match.Groups['password'].Value.Trim()    
            }
            $match = $match.NextMatch()
        } 
    
    } else {
        Write-Warning "Could not find 'Authorized Administrators' text block."
    }
    
    # get the Authorized Users block
    if ($html -match '(?s)<b>Authorized Users:</b>(.+)</pre>') {
        $userblock = $matches[1].Trim()
        # inside this text block, get the authorized usernames
        $users = @()
        $regex = [regex] '(?m)(?<name>.+)'
        $match = $regex.Match($userblock)
        while ($match.Success) {
            $users += [PSCustomObject]@{
                'Name' = $match.Groups['name'].Value.Trim()
                'Type' = 'User'
            }
            $match = $match.NextMatch()
        } 
    } else {
        Write-Warning "Could not find 'Authorized Users' text block."
    }
    
    # write the csv files
    $admins | Export-Csv -Path $(Join-Path -Path $outputPath -ChildPath 'admins.csv') -NoTypeInformation -Force
    $users | Export-Csv -Path $(Join-Path -Path $outputPath -ChildPath 'users.csv') -NoTypeInformation -Force
    ($admins + $users) | Export-Csv -Path $(Join-Path -Path $outputPath -ChildPath 'adminsandusers.csv') -NoTypeInformation -Force
    

    完成后,您将拥有三个 CSV 文件:

    admins.csv

    Name   Type  Password      
    ----   ----  --------      
    jim    Admin (blank/none)  
    bob    Admin Littl3@birD   
    batman Admin 3ndur4N(e&home
    dab    Admin captain 
    

    users.csv

    Name   Type
    ----   ----
    bag    User
    crab   User
    oliver User
    james  User
    scott  User
    john   User
    apple  User
    

    adminsandusers.csv

    Name   Type  Password      
    ----   ----  --------      
    jim    Admin (blank/none)  
    bob    Admin Littl3@birD   
    batman Admin 3ndur4N(e&home
    dab    Admin captain       
    bag    User                
    crab   User                
    oliver User                
    james  User                
    scott  User                
    john   User                
    apple  User 
    

    【讨论】:

      【解决方案2】:

      我相信这个答案显示了有用的技术,并且我已经验证它可以在规定的约束范围内与示例输入一起使用。如果您不同意,请告诉我们(用文字),以便改进答案。

      通常,如前所述,最好使用专用的 HTML 解析器,但鉴于输入中易于识别的封闭标签(假设不会有任何变化),您可以使用基于正则表达式的解决方案。

      这是一个基于正则表达式的 PSv4+ 解决方案,但请注意,它依赖于包含空格(换行符、前导空格)的输入,正如您的问题所示:

      # $html is assumed to contain the input HTML text (can be a full document).
      $admins, $users = (
        # Split the HTML text into the sections of interest.
        $html -split
          '\A.*<b>Authorized Administrators&#58;</b>|<b>Authorized Users&#58;</b>' `
          -ne '' `
          -replace '<.*'
      ).ForEach({
        # Extract admin lines and user lines each, as an array.
        , ($_ -split '\r?\n' -ne '')
      })
      
      # Clean up the $admins array and transform the username-password pairs
      # into custom objects with .username and .password properties.
      $admins = $admins -split '\s+password&#58;\s+' -ne ''
      $i = 0;
      $admins.ForEach({ 
        if ($i++ % 2 -eq 0) { $co = [pscustomobject] @{ username = $_; password = '' } } 
        else { $co.password = $_; $co } 
      })
      
      # Create custom objects with the same structure for the users.
      $users = $users.ForEach({
        [pscustomobject] @{ username = $_; password = '' }
      })
      
      # Output to CSV files.
      $admins | Export-Csv admins.csv
      $users | Export-Csv users.csv
      $admins + $users | Export-Csv all.csv
      

      假设您的问题没有满足要求,则对所需的输出格式进行了假设(并且诸如 &amp;amp; 之类的 HTML 实体未解码)。

      【讨论】:

        【解决方案3】:

        这真的很丑陋,而且非常脆弱。一个好的 HTML 解析器将是一个更好的方法来做到这一点。

        但是,假设您没有为此获得资源,这里有一种获取数据的方法。如果你真的想再生成两个文件 [Admin & User],你可以从这个对象中做到这一点......

        # fake reading in a text file
        #    in real life, use Get-Content
        $InStuff = @'
        </p><ul><li>(None)</li></ul><h2><span style="font-size:18px;">Authorized Administrators and Users</span></h2><pre><b>Authorized Administrators&#58;</b>
        jim (you)
            password&#58; (blank/none)
        bob
            password&#58; Littl3@birD
        batman
            password&#58; 3ndur4N(e&amp;home
        dab
            password&#58; captain
        
        <b>Authorized Users&#58;</b>
        bag
        crab
        oliver
        james
        scott
        john
        apple
        </pre><h2><span style="font-size:18px;">Competition Guidelines</span></h2>
        '@ -split [environment]::NewLine
        
        $CleanedInStuff = $InStuff.
            Where({
                $_ -notmatch '^</' -and
                $_ -notmatch '^ ' -and
                $_
                })
        
        $UserType = 'Administrator'
        $UserInfo = foreach ($CIS_Item in $CleanedInStuff)
            {
            if ($CIS_Item.StartsWith('<b>'))
                {
                $UserType = 'User'
                continue
                }
            [PSCustomObject]@{
                Name = $CIS_Item.Trim()
                UserType = $UserType
                }
            }
        
        # on screen
        $UserInfo
        
        # to CSV    
        $UserInfo |
            Export-Csv -LiteralPath "$env:TEMP\LandonBB.csv" -NoTypeInformation
        

        屏幕输出...

        Name      UserType     
        ----      --------     
        jim (you) Administrator
        bob       Administrator
        batman    Administrator
        dab       Administrator
        bag       User         
        crab      User         
        oliver    User         
        james     User         
        scott     User         
        john      User         
        apple     User
        

        CSV 文件内容...

        "Name","UserType"
        "jim (you)","Administrator"
        "bob","Administrator"
        "batman","Administrator"
        "dab","Administrator"
        "bag","User"
        "crab","User"
        "oliver","User"
        "james","User"
        "scott","User"
        "john","User"
        "apple","User"
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2020-08-22
          • 2021-03-05
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2013-06-18
          相关资源
          最近更新 更多