从 ElasticSearch-JSON 文件中获取数据到 Python答案

【问题标题】：getting data into Python from ElasticSearch-JSON files从 ElasticSearch-JSON 文件中获取数据到 Python
【发布时间】：2017-06-29 15:05:33
【问题描述】：

如何将查询结果发送到具有保留层次结构的列的数据框？像这样的列：

type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords|

我有一个包含大约 1,000,000 个 JSOn 文档的 elasticSearch。我想将此数据集用于 Python 的自然语言处理 (NLP)。有人可以帮助我了解如何将数据从 elasticsearch 获取到 Python 中，并将数据从 Python 写回 elasticsearch。非常感谢，因为我无法对我拥有的数据集执行任何 NLP，因为我无法让它与 Python 连接。 elasticsearch的索引结构是这样的：
我想在层次结构中输入一个新索引，就像“大学信息”一样，称为“流程信息” 这个新索引将根据我给出的一组关键字来索引数据集——就像“universityKeywords”一样，每个 jason 文件都应该存储标签使用的一组关键字。我想将数据集标记为“流程信息” - 在名为的 json 文件上放置 4 个标签或类别 - 应用程序、优惠、注册、要求基于 json 文件中的关键字 post-title 和 post text

 "educationforumsenriched2": {
          "mappings": {
             "whirlpool": {
                "properties": {
                   "CourseInfo": {
                      "properties": {
                         "courses": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "subjectKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "SentimentInfo": {
                      "properties": {
                         "SentiStrength": {
                            "type": "float"
                         },
                         "SentiWordNet": {
                            "type": "float"
                         }
                      }
                   },
                   "UniversityInfo": {
                      "properties": {
                         "universities": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "universityKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "postDate": {
                      "type": "date",
                      "format": "strict_date_optional_time||epoch_millis"
                   },
                   "postID": {
                      "type": "integer"
                   },
                   "postText": {
                      "type": "string"
                   },
                   "references": {
                      "type": "string"
                   },
                   "threadID": {
                      "type": "integer"
                   },
                   "threadTitle": {
                      "type": "string"
                   }
                }
             },
             "atarnotes": {
                "properties": {
                   "CourseInfo": {
                      "properties": {
                         "courses": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "subjectKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "SentimentInfo": {
                      "properties": {
                         "SentiStrength": {
                            "type": "float"
                         },
                         "SentiWordNet": {
                            "type": "float"
                         }
                      }
                   },
                   "UniversityInfo": {
                      "properties": {
                         "universities": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "universityKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "discussionTitle": {
                      "type": "string"
                   },
                   "postDate": {
                      "type": "date",
                      "format": "strict_date_optional_time||epoch_millis"
                   },
                   "postID": {
                      "type": "integer"
                   },
                   "postText": {
                      "type": "string"
                   },
                   "query": {
                      "properties": {
                         "match_all": {
                            "type": "object"
                         }
                      }
                   },
                   "threadID": {
                      "type": "integer"
                   },
                   "threadTitle": {
                      "type": "string"
                   }
                }
             }
          }
       }
    }

这是我用来在 java 中创建进程信息标签的代码——我想在 Python 中做同样的事情

 processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications")));
        processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering")));
        processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled")));
        processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require")));

【问题讨论】：

Python Elasticsearch Client?
pyelasticsearch?我已经安装了软件包-但无法弄清楚如何将此数据集获取到 python。一个小例子将非常有用。这是我的elasticsearch索引的映射结构：
"educationforumsenriched2": { "mappings": { "whirlpool": { "properties": { "CourseInfo": {..
ingest pipeline 怎么样？

标签： python elasticsearch

【解决方案1】：

使用elasticsearch python client，一旦你建立了一个成功的连接，你只需要提供DSL查询和你想要搜索的索引来检索所需的信息，例如，如果你有一个查询：

GET educationforumsenriched2/_search
{
    "query": {
        "match" : {
            "CourseInfo.subjectKeywords" : "foo"
        }
    }
}

Python 中的等价物是：

from elasticsearch import Elasticsearch

es = Elasticsearch({"host": "localhost", "port": 9200}) #many other settings are available if using https and so on

query = {
        "query": {
            "match" : {
                "CourseInfo.subjectKeywords" : "foo"
            }
        }
    }
res = es.search(index="educationforumsenriched2", body=query)

#do some processing

#create new document in ES
es.create(index="educationforumsenriched2", body=new_doc_after_processing)

编辑：只是考虑一下，但如果您的处理不是太复杂，您也可以考虑构建一个ingest pipeline

【讨论】：

谢谢你。但是我怎样才能将 es 中的结果放入一个结构中，比如一个数据框，其中包含有问题的字段的列编辑
@BAstu 我们在谈论什么样的数据框，pandas 数据框？火花数据框？也许这个问题可以帮助：stackoverflow.com/questions/25186148/…
是的 Pandas 数据框。谢谢