【问题标题】:In Rapidminer extract hashtags from content在 Rapidminer 中从内容中提取主题标签
【发布时间】:2016-03-16 00:00:17
【问题描述】:
【问题讨论】:
标签:
regex
hashtag
rapidminer
【解决方案1】:
这两个文件都是 XML 格式,但不是标准的 RapidMiner 格式,其中一个看起来有点像 MS Word,对吗?
无论如何,请随时以其他格式重新发布数据,但我认为这可能会有所帮助。
首先确保您拥有 RapidMiner 的文本处理扩展。
接下来使用 Process Documents from Data 并在其中使用以下 3 个运算符:Transform Cases、Cut Document 和 Combine Documents。这些对 CSV 中的每个示例执行的操作是使文本小写,从文本中单独提取主题标签,然后将它们组合到一个新文档中(以防一段文本中有多个主题标签)。
我使用的正则表达式是 (?i)#[0-9a-z_]*,这只是为了速度,但它应该能涵盖我能想到的所有情况。
此过程的输出是整个语料库中的单词列表计数,告诉您主题标签在文档中出现了多少次。那应该让你开始。
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.0.001" expanded="true" height="68" name="Read CSV" width="90" x="45" y="136">
<parameter key="csv_file" value="myCSV"/>
<parameter key="column_separators" value=","/>
<list key="annotations"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="myTextColum.true.text.regular"/>
<parameter key="1" value="anotherColumn.true.nominal.regular"/>
</list>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="179" y="85">
<parameter key="vector_creation" value="Term Occurrences"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:transform_cases" compatibility="7.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34">
<description align="center" color="transparent" colored="false" width="126">Makes everything lowercase</description>
</operator>
<operator activated="true" class="text:cut_document" compatibility="7.0.000" expanded="true" height="68" name="Cut Document" width="90" x="179" y="34">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries">
<parameter key="hashtags" value="(?i)#[0-9a-z_]*"/>
</list>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<process expanded="true">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Gets rid of everything but the hashtags</description>
</operator>
<operator activated="true" class="text:combine_documents" compatibility="7.0.000" expanded="true" height="82" name="Combine Documents" width="90" x="313" y="34"/>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Cut Document" to_port="document"/>
<connect from_op="Cut Document" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
<connect from_op="Combine Documents" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="21"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>