【问题标题】:Generating synonyms or similar words using BERT word embeddings使用 BERT 词嵌入生成同义词或相似词
【发布时间】:2020-06-18 04:54:21
【问题描述】:

我想使用 BERT 词嵌入生成同义词或类似词。 我开始使用 BERT 来做这件事。 后面的软件集成,必须用JAVA来完成,所以我选择了easy-bert (https://github.com/robrua/easy-bert)。

看来我可以通过这种方式获得词嵌入:

try(Bert bert = Bert.load(new File("com/robrua/nlp/easy-bert/bert-uncased-L-12-H-768-A-12"))) {
    float[][] embedding = bert.embedTokens("A sequence");
    float[][][] embeddings = bert.embedTokens("Multiple", "Sequences");

}

你知道我如何从这些词嵌入中获取相似词吗?

感谢您的帮助!

【问题讨论】:

  • 您找到适合您的方法了吗?我对类似的东西感兴趣。

标签: nlp word-embedding


【解决方案1】:

本主题的类似任务(词汇替换)属于 LS07 和 LS14。 一位研究人员使用 BERT 在这些基准测试中实现了 SOTA。 你会有兴趣阅读这篇论文。 https://www.aclweb.org/anthology/P19-1328.pdf

作者说如下。

将 dropout 应用于目标词的嵌入以进行部分屏蔽 这个词,让 BERT 能够平衡地考虑目标 用于提议替代候选者的词的语义和上下文,以及 然后根据他们的替代影响验证候选人 关于句子的全局上下文表示。”

我不知道如何重现相同的结果,因为该实现不向公众开放。但这里有提示 - 嵌入 dropout 可用于生成替代候选者。

【讨论】:

    【解决方案2】:

    我开发了一种使用 Luminoso 的方法。我为他们工作,所以这有点像广告,但它完全符合您的要求。

    https://www.luminoso.com/search

    Luminoso 非常擅长理解产品评论、产品描述、调查结果和故障单等对话文本。它不需要任何类型的培训或本体构建,并且会围绕您的语言构建语言模型。您将页面的文本输入 Luminoso,它会为您的文本中使用的概念生成一组同义词。

    作为一个示例项目,我使用 Amazon.com 美容产品构建了一个搜索。我将复制几个围绕三个概念自动生成的同义词。该数据集生成了 17851 个同义词。

    scent, rose-like, not sickeningly, not nauseating, not overwhelming, herb-y, no sweetness, cucumber-y, not too citrus-y, no gardenia, not lemony, pachouli, vanilla-like, fragarance, not spicy, flowerly, musk, perfume-like, floraly, not cloyingly => scent
    recommend, recommende, advice, suggestion, highly recommend, suggest, recommeded, recommendation, recommend this product, reccommended, advise, suggest, indicated, suggestion, advice, agree, recommend, say, considering, mentioned => recommend
    bottle, no sprayer, 8-oz, beaker, decanter, push-down, dispenser, pipet, pint, not the bottle, no dropper, keg, gallon, jug, pump-top, liter, half-full, decant, tumbler, vial => bottle
    eczema, non-steroidal, ulcerative, dematitis, ecsema, Elidel, dermititis, inflammation, pityriasis, hydrocortizone, dyshidrotic, chickenpox, Stelatopia, perioral, rosacea, dry skin, nummular, ecxema, mild-moderate, ezcema => eczema
    

    此搜索索引中有 800k 产品,因此结果也很大,但这也适用于小型数据集。

    除了同义词格式之外,您还可以将其直接放入 elasticsearch 并将特定页面的同义词与该页面相关联。

    这是使用相同技术增强的 Elasticsearch 索引示例。它的拨号非常高,因此添加了太多概念,但只是为了向您展示它如何找到概念之间的关系。

    {"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "130414089X"}}
    {"title": "New Benefit Waterproof Automatic Eyeliner Pen - Black - BAD Gal Liner", "text": "Length : 13.5 cm\nColor: Black\n100% Brand new and unused.\nSmudge free.\nFine-tip. Easy to blend and smooth to apply\nCan make fine and bold eyeline with new texture and furnishing.\nProvide rich and consistant colour\nLongwearing and waterproof\nFregrance Free", "primary_concepts": ["not overpoweringly", "concoction", "equipped", "fine-tip", "water-resistant", "luxuriant", "make", "fixture", "☆", "not lengthen", "washable", "not too heady", "blendable", "doesn't collect", "shade", "niche", "supple", "smudge-proof", "sumptuous", "movable", "black", "over-apply", "quick", "silky", "colored", "sweatproof", "opacity", "accomodate", "fuchsia", "furnishes", "meld", "sturdily", "smear", "inch", "mid-back", "chin-length", "smudge", "alredy", "not cheaply", "long-wearing", "eyeline", "texture", "steady", "no-name", "audacious", "easy", "edgy", "is:A", "marketers", "greys", "decadent", "applicable", "Crease-free", "magenta", "free", "itIn", "stay-true", "racy", "application", "glides", "smooth", "sleek", "taupe", "grainy", "dark", "wealthy", "JP7506CF", "gray", "grayish", "width", "newness", "purfumes", "Lancme", "blackish", "easily", "doesn't smudge", "maroon", "blend", "convenient", "smoother", "Moschino", "long-wear", "mauve", "medium-length", "no raccoon", "revamp", "demure", "richly", "white", "brand", "offers", "lenght", "soft", "doesn't smear", "provide", "provides", "unusable", "eye-liner", "unopened", "straightforward", "silky-smooth", "uniting", "compactness", "bold", "fearless", "mix", "indulgent", "brash", "serviceable", "unmarked", "not musky", "constructed", "racoon", "smoothly", "sealant", "merged", "boldness", "reuse", "unused", "long", "Kors", "effortless", "luscious", "stain", "rich", "discard", "richness", "opulent", "short", "consistency", "fine", "sents", "newfound", "fade-resistant", "mixture", "hue", "sassy", "apply", "fragnance", "heathy", "adventurous", "not enthusiastic", "longwearing", "fregrance", "non-waterproof", "empty", "lashline", "simple", "newly", "you'r", "combined", "no musk", "mingle", "waterproof", "painless", "pinkish", "thickness", "clump-free", "gos", "consistant", "color", "smoothness", "name-brand", "new", "smudgeproof", "yaaay", "water-proof", "eyemakeup", "not instant", "spidery", "furnish", "tint", "product", "reapply", "not black", "no globs", "imitators", "blot", "cinch", "uncomplicated", "untouched", "length"], "related_concepts": ["eyeliner", "no goofs", "doesn't smear", "pen", "hundreds"]}
    {"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "130414643X"}}
    {"title": "Goodskin Labs Eyliplex-2 Eye Life and Circle Reducer - 10ml", "text": "Eyliplex-2 is a dual solution that focuses on the problematic eye area. This breakthrough, 24-hour system from the scientists at good skin pharmacy visibly tightens eye areas while reducing dark circles. 0.34 oz. each. 64% of subjects reported younger looking eyes immediately and a 20% reduction in the appearance of dark circles in clinical studies.", "primary_concepts": ["coloration", "Laboratories", "oncology", "cornea", "undereye", "eye", "immediately", "☆", "teen", "dry-skin", "good", "eyelids", "puffiness", "behold", "research", "temperamental", "dermatological", "breakthrough", "study", "store", "nice", "lasik", "instantaneously", "teenaged", "multi", "rheostat", "dermatology", "chemist", "invisibly", "PhD", "pharmacy", "alredy", "not cheaply", "optional", "pharmacist", "Obagi-C", "topic", "supermarket", "reversible", "studies", "Younger", "medically", "report", "thermo", "tightness", "dual", "eliminate", "researcher", "Minimization", "cutaneous", "hydration", "O2", "taupe", "increase", "moisturization", "dark", "preliminary", "excellent", "Quad", "well", "appearance", "dusky", "quickly", "instantly", "CVS", "Dermal", "great", "revolutionary", "biologist", "epidermis", "blackish", "disclosed", "problem", "youngsters", "murky", "scientific", "teenager", "oz", "dark circles", "clinically", "emphasis", "absorption", "skin", "loosen", "intractable", "technological", "reduction", "clinician", "nutritional", "forthwith", "grocer", "scientifically", "swiftly", "examination", "state-of-the-art", "not acne prone", "zone", "decrease", "younger-looking", "excellently", "troublesome", "system", "radius", "tighten", "FDA", "decent", "noticeably", "WD-40", "clearer", "scientist", "saggy", "significantly", "improvement", "Teamine", "interchangeable", "visible", "visable", "no fine line", "shortly", "minimize", "survey", "problematic", "young", "glance", "racoon", "vicinity", "youthful", "exacerbated", "focal", "region", "groundbreaking", "reddish", "focus", "reduce", "increments", "nad", "fasten", "area", "soon", "complexion", "squinting", "look", "grocery", "eyliplex-2", "Eyliplex-2", "subsequently", "even-toned", "bothersome", "eyes", "mitigate", "markedly", "philosophy:you", "difficult", "darkish", "bluish", "satisfactory", "darken", "epidermal", "lessen", "appearence", "ocular", "ergonomically", "diminished", "progression", "purplish", "sun-damaged", "Cellex-C", "visibly", "diagnosis", "drugstore", "under-eye", "apothecary", ":-D", "terrific", "clinical", "oz.", "Endocrinology", "time-released", "Nouriva", "tight", "adolescent", "subject", "eyeballs", "sking", "Pro-Retinol", "aggravate", "younger", "shortcomings", "solution", "assess", "promptly", "teenage", "Kinetin", "24-hour", "Mart", "youth", "visibility", "scientists", "taut", "better", "eyesight", "no dark circles", "not reduce", "photoaging", "Pending"], "related_concepts": ["A22", "A82", "Amazon", "daytime", "HK", "nighttime", "smell", "dark circles", "purchased"]}
    {"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "1304146537"}}
    

    Luminoso 使用来自它开发的 ConceptNet 的词嵌入,该技术超越了 ConceptNet 为您提供的技术。我有偏见,但每次我通过它运行数据时,我都会感到惊讶。不是免费的,但它确实适用于对数据进行绝对零预训练,实际上没有什么是免费的。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-10-14
      • 2018-05-18
      • 2021-01-16
      • 1970-01-01
      • 1970-01-01
      • 2019-07-19
      • 2023-02-23
      • 2021-12-31
      相关资源
      最近更新 更多