【发布时间】:2026-02-15 03:45:02
【问题描述】:
数据样本包含单词(orth)和类别(prop key="sense:ukb:unitsstr")。我想提取数据对,例如 orth 和 prop key="sense:ukb:unitsstr 作为数据帧的一行。但是,有些单词可能没有任何 prop 数据,就像最后两条记录一样。 然后我想将它们视为 NA。
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
<chunk id="ch1" type="p">
<sentence id="s1">
<tok>
<orth>ktoś</orth>
<lex disamb="1"><base>ktoś</base><ctag>subst:sg:nom:m1</ctag></lex>
<prop key="polarity">0</prop>
<prop key="sense:ukb:syns_id">11511</prop>
<prop key="sense:ukb:syns_rank">11511/128.6156573170 243094/95.1234745165</prop>
<prop key="sense:ukb:unitsstr">ktoś.2(15:os)</prop>
</tok>
<tok>
<orth>go</orth>
<lex disamb="1"><base>go</base><ctag>subst:sg:nom:n</ctag></lex>
<prop key="polarity">0</prop>
<prop key="sense:ukb:syns_id">47620</prop>
<prop key="sense:ukb:syns_rank">47620/108.9010709884 234524/90.4766173102</prop>
<prop key="sense:ukb:unitsstr">go.1(2:czy)</prop>
</tok>
<tok>
<orth>krokodyl</orth>
<lex disamb="1"><base>krokodyl</base><ctag>subst:sg:nom:m2</ctag></lex>
<prop key="polarity">0</prop>
<prop key="sense:ukb:syns_id">12879</prop>
<prop key="sense:ukb:syns_rank">12879/40.5162836207 254796/35.9915058408 7063215/33.3657479890 7063214/26.6770712118 7063217/25.5775738130 7063213/23.6851347572 7063212/23.6300037076</prop>
<prop key="sense:ukb:unitsstr">krokodyl.1(21:zw) krokodyl_właściwy.1(21:zw)</prop>
</tok>
<tok>
<orth>się</orth>
<lex disamb="1"><base>się</base><ctag>qub</ctag></lex>
</tok>
<tok>
<orth>ja</orth>
<lex disamb="1"><base>ja</base><ctag>ppron12:sg:nom:m1:pri</ctag></lex>
</tok>
我以为我可以通过一些 xml 路径行来获取它,但我卡住了:
doc = xmlTreeParse("statsUCZESTxfreqkeyword xml.txt",useInternal = TRUE)
top = xmlRoot(doc)
xmlName(top)
names(top)
names( top[[ 1 ]] )
sent <- top[[ 1 ]] [[ "sentence" ]]
names(sent)
names(sent[[1]])
xmlSApply(sent[[1]], xmlValue)
xmlSApply(sent, function(x) xmlSApply(x, xmlValue))
nodes = getNodeSet(top, "//prop[@key='sense:ukb:unitsstr']")
lapply(nodes, function(x) xmlSApply(x, xmlValue)) # 152 words have prop
xmlSApply(sent, function(x) xmlSApply(x, xmlValue))
【问题讨论】:
-
很遗憾没有或者我不知道如何使用它。