试图从网页中提取信息答案

【问题标题】：Trying to pull information from webpage试图从网页中提取信息
【发布时间】：2018-10-16 06:33:56
【问题描述】：

我正在尝试从网站中提取数据。在我的示例中，我正在 Armorgames.com 上搜索搜索词 idle。从那里我想提取每个游戏的名称并将其放入 csv 文件以供以后使用。我的代码：

$SearchResult = Invoke-WebRequest 'http://armorgames.com/search?type=games&q=idle' 
($SearchResult.ParsedHtml.getElementsByTagName('H5') | Where { $_.pathname -like '/play*'})

很遗憾，这不会输出任何结果。我可以使用以下方式查看属性名称：

$SearchResult.ParsedHtml.getElementsByTagName('H5')

使用标签“a”我可以找到路径名包含“play”的游戏。但是我无法过滤结果，然后将结果输出到文件

【问题讨论】：

标签： powershell html-object powershell-v6.0

【解决方案1】：

$SearchResult.ParsedHtml.getElementsByTagName('a') | where-Object -Property pathname -Like 'play/*'

# select property pathname
$SearchResult.ParsedHtml.getElementsByTagName('a') | 
    Where-Object -Property pathname -Like 'play/*' |
        Select-Object -Property pathname

# select property title
$SearchResult.ParsedHtml.getElementsByTagName('a') | 
    Where-Object -Property pathname -Like 'play/*' |
        Select-Object -Property title -Unique

【讨论】：

【解决方案2】：

PowerShell Core (v6.0) 兼容的网络抓取代码，也应该适用于 Windows PowerShell，依赖于 regex with the -match operator（因为 ParsedHtml 属性在 Core 上不可用）：

$SearchResult = Invoke-WebRequest 'http://armorgames.com/search?type=games&q=idle'
$GameNames = ($SearchResult.Content.split('<') | 
    where {$_ -match '^a href.*play.*\ title=.*>[A-Z].*'}) -replace '.*>'
$GameNames

输出如下：

Artist Idle
Hero Simulator: Idle Adventures
Idle Farmer
Idle Online Universe
Idle Sword
Idle Web Tycoon
Legendary Journey Idle
NGU IDLE
Religious Idle
Zombidle

现在您已经有了所需名称的数组，您应该可以使用所需的任何其他信息创建一个 CSV。

【讨论】：