【问题标题】:Extract long attribute value with multiple lines提取多行长属性值
【发布时间】:2020-06-18 04:24:30
【问题描述】:

我想从网站工具提示中抓取文本。它的 HTML 代码作为属性 oldtitle 的值嵌入。我尝试了 css 和 xpath 选择器——例如//span[@class="spanTip underLine"]/@oldtitle -- 但没有成功。是不是因为属性值被分成了多行?有没有办法使用选择器来做到这一点?

<tr><td>
<span class="spanTip underLine" style="cursor:default;" data-hasqtip="true" oldtitle="<table class='qtipTable'>
<tbody>
<tr>
<td><span class='label'>A:</span>1</td>
<td><span class='label'>B:</span>2</td>
</tr>
<tr>
<td><span class='label'>C:</span>3</td>
<td><span class='label'>D:</span>4</td>
</tr>
<tr>
<td><span class='label'>E:</span></td>

</tr>
<tr>
<td><span class='label'>F:</span>5</td>
<td><span class='label'>G:</span>6</td>
</tr>
</tbody>
</table>" title="" aria-describedby="qtip-5">Item with Tooltip</span>
</td>
</tr>

【问题讨论】:

  • 你用什么软件刮?
  • 我正在使用scrapy。 :)

标签: xpath web-scraping css-selectors


【解决方案1】:

这是一个使用 Python 的 BeautifulSoup 的解决方案。获取参数oldtitle=中的HTML:

txt = '''<tr><td>
<span class="spanTip underLine" style="cursor:default;" data-hasqtip="true" oldtitle="<table class='qtipTable'>
<tbody>
<tr>
<td><span class='label'>A:</span>1</td>
<td><span class='label'>B:</span>2</td>
</tr>
<tr>
<td><span class='label'>C:</span>3</td>
<td><span class='label'>D:</span>4</td>
</tr>
<tr>
<td><span class='label'>E:</span></td>

</tr>
<tr>
<td><span class='label'>F:</span>5</td>
<td><span class='label'>G:</span>6</td>
</tr>
</tbody>
</table>" title="" aria-describedby="qtip-5">Item with Tooltip</span>
</td>
</tr>'''

soup = BeautifulSoup(txt, 'html.parser')

inner_soup = BeautifulSoup( soup.select_one('span.spanTip[oldtitle]')['oldtitle'], 'html.parser' )

print(inner_soup)

打印:

<table class="qtipTable">
<tbody>
<tr>
<td><span class="label">A:</span>1</td>
<td><span class="label">B:</span>2</td>
</tr>
<tr>
<td><span class="label">C:</span>3</td>
<td><span class="label">D:</span>4</td>
</tr>
<tr>
<td><span class="label">E:</span></td>
</tr>
<tr>
<td><span class="label">F:</span>5</td>
<td><span class="label">G:</span>6</td>
</tr>
</tbody>
</table>

然后您可以正常使用inner_soup,例如:

for tr in inner_soup.select('table tr'):
    print([td.get_text(strip=True) for td in tr.select('td')])

打印:

['A:1', 'B:2']
['C:3', 'D:4']
['E:']
['F:5', 'G:6']

【讨论】:

  • 我应该考虑使用 BeautifulSoup!成功了!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2013-07-26
  • 2010-12-13
  • 1970-01-01
  • 2016-10-06
  • 2015-03-07
  • 2011-02-06
相关资源
最近更新 更多