【发布时间】:2017-06-29 15:05:33
【问题描述】:
如何将查询结果发送到具有保留层次结构的列的数据框?像这样的列:
type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords|
我有一个包含大约 1,000,000 个 JSOn 文档的 elasticSearch。
我想将此数据集用于 Python 的自然语言处理 (NLP)。
有人可以帮助我了解如何将数据从 elasticsearch 获取到 Python 中,并将数据从 Python 写回 elasticsearch。
非常感谢,因为我无法对我拥有的数据集执行任何 NLP,因为我无法让它与 Python 连接。
elasticsearch的索引结构是这样的:
我想在层次结构中输入一个新索引,就像“大学信息”一样,称为“流程信息”
这个新索引将根据我给出的一组关键字来索引数据集——就像“universityKeywords”一样,每个 jason 文件都应该存储标签使用的一组关键字。
我想将数据集标记为“流程信息” - 在名为的 json 文件上放置 4 个标签或类别 - 应用程序、优惠、注册、要求基于 json 文件中的关键字 post-title 和 post text
"educationforumsenriched2": {
"mappings": {
"whirlpool": {
"properties": {
"CourseInfo": {
"properties": {
"courses": {
"type": "string",
"index": "not_analyzed"
},
"subjectKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"SentimentInfo": {
"properties": {
"SentiStrength": {
"type": "float"
},
"SentiWordNet": {
"type": "float"
}
}
},
"UniversityInfo": {
"properties": {
"universities": {
"type": "string",
"index": "not_analyzed"
},
"universityKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"postDate": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"postID": {
"type": "integer"
},
"postText": {
"type": "string"
},
"references": {
"type": "string"
},
"threadID": {
"type": "integer"
},
"threadTitle": {
"type": "string"
}
}
},
"atarnotes": {
"properties": {
"CourseInfo": {
"properties": {
"courses": {
"type": "string",
"index": "not_analyzed"
},
"subjectKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"SentimentInfo": {
"properties": {
"SentiStrength": {
"type": "float"
},
"SentiWordNet": {
"type": "float"
}
}
},
"UniversityInfo": {
"properties": {
"universities": {
"type": "string",
"index": "not_analyzed"
},
"universityKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"discussionTitle": {
"type": "string"
},
"postDate": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"postID": {
"type": "integer"
},
"postText": {
"type": "string"
},
"query": {
"properties": {
"match_all": {
"type": "object"
}
}
},
"threadID": {
"type": "integer"
},
"threadTitle": {
"type": "string"
}
}
}
}
}
}
这是我用来在 java 中创建进程信息标签的代码——我想在 Python 中做同样的事情
processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications")));
processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering")));
processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled")));
processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require")));
【问题讨论】:
-
pyelasticsearch?我已经安装了软件包-但无法弄清楚如何将此数据集获取到 python。一个小例子将非常有用。这是我的elasticsearch索引的映射结构:
-
"educationforumsenriched2": { "mappings": { "whirlpool": { "properties": { "CourseInfo": {..
-
ingest pipeline 怎么样?
标签: python elasticsearch