【问题标题】:Group or sort list/array by number of matching keywords按匹配关键字的数量对列表/数组进行分组或排序
【发布时间】:2016-10-16 23:03:42
【问题描述】:

在 C# 中,根据每个元素之间匹配关键字的数量,对字符串数组或 List 的元素进行分组或排序的一种好的、有效的方法是什么?具有最匹配关键字的元素应彼此靠近放置。

例如,如果集合是:

string[] movies = {
    "Star Wars Episode IV A New Hope",
    "Force of Hunger",
    "The Hunger Games Mockingjay",
    "Jaws 2",
    "The Shawshank Redemption",
    "Hunger Pain",
    "The Hunger Games",
    "Jaws: The Revenge",
    "The Hunger Games Catching Fire",
    "Rogue One A Star Wars Story",
    "Aqua Teen Hunger Force",
    "The Force Awakens Star Wars",
};

那么处理后的结果应该有点类似:

{
    "The Hunger Games Mockingjay",
    "The Hunger Games Catching Fire",
    "The Hunger Games",

    "Aqua Teen Hunger Force",
    "Force of Hunger",

    "Rogue One A Star Wars Story",
    "The Force Awakens Star Wars"
    "Star Wars Episode IV A New Hope",

    "Jaws: The Revenge",
    "Jaws 2",

    "Hunger Pain",

    "The Shawshank Redemption",
};

【问题讨论】:

  • 我本可以考虑字母数字排序,但这不是您的要求,您需要特定的分组,需要自定义编码

标签: c# arrays algorithm linq


【解决方案1】:

这是我将采取的方法:

  1. 将每个标题分解为一组规范化的单词,不包括“a”、“an”和“the”等“噪音”单词。
  2. 找出每对词集的交集(共性)。
  3. 将每个标题添加到由标题键入的交集集的字典中。将每个交集添加到该标题的集合中。
  4. 最后,按照交叉点大小(最大的在前)对字典进行排序,然后按交叉点中的单词,最后按标题,得到最终的标题列表。

下面是它在代码中的样子:

using System;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    public static void Main()
    {
        string[] movies = {
            "Star Wars Episode IV A New Hope",
            "Force of Hunger",
            "The Hunger Games Mockingjay",
            "Jaws 2",
            "The Shawshank Redemption",
            "Hunger Pain",
            "The Hunger Games",
            "Jaws: The Revenge",
            "The Hunger Games Catching Fire",
            "Rogue One A Star Wars Story",
            "Aqua Teen Hunger Force",
            "The Force Awakens Star Wars",
        };

        List<HashSet<string>> titleWords = movies
            .Select(m => new HashSet<string>(
                m.Split(new char[] { ' ', ':' }, StringSplitOptions.RemoveEmptyEntries)
                .Select(w => w.ToLower())
                .Where(w => w != "a" && w != "an" && w != "the")))
            .ToList();

        var titles = new Dictionary<string, SortedSet<Commonality>>();
        for (int i = 0; i < titleWords.Count; i++)
        {
            for (int j = i + 1; j < titleWords.Count; j++)
            {
                var wordsInCommon = titleWords[i]
                    .Intersect(titleWords[j])
                    .OrderBy(w => w)
                    .ToList();
                Commonality c = new Commonality(wordsInCommon);
                AddCommonalities(titles, movies[i], c);
                AddCommonalities(titles, movies[j], c);
            }
        }

        string[] groupedTitles = titles
            .OrderBy(k => k.Value.First())
            .ThenBy(k => k.Key)
            .Select(k => k.Key)
            .ToArray();

        Console.WriteLine(string.Join("\r\n", groupedTitles));
    }

    private static void AddCommonalities(Dictionary<string, SortedSet<Commonality>> dict, string title, Commonality c)
    {
        SortedSet<Commonality> commonalities;
        if (!dict.TryGetValue(title, out commonalities))
        {
            commonalities = new SortedSet<Commonality>();
            dict.Add(title, commonalities);
        }
        commonalities.Add(c);
    }
}

class Commonality : IComparable<Commonality>
{
    public string JoinedWords { get; private set; }
    public int WordCount { get; private set; }

    public Commonality(List<string> words)
    {
        JoinedWords = string.Join(" ", words);
        WordCount = words.Count;
    }

    public override bool Equals(object obj)
    {
        Commonality that = obj as Commonality;
        return (that != null && that.JoinedWords == JoinedWords);
    }

    public override int GetHashCode()
    {
        return JoinedWords.GetHashCode();
    }

    public int CompareTo(Commonality other)
    {
        int r = other.WordCount - WordCount;
        if (r == 0) return string.CompareOrdinal(JoinedWords, other.JoinedWords);
        return r;
    }

    public override string ToString()
    {
        return WordCount + " " + JoinedWords;
    }
}

输出:

Aqua Teen Hunger Force
Force of Hunger
The Hunger Games
The Hunger Games Catching Fire
The Hunger Games Mockingjay
Rogue One A Star Wars Story
Star Wars Episode IV A New Hope
The Force Awakens Star Wars
Hunger Pain
Jaws 2
Jaws: The Revenge
The Shawshank Redemption

小提琴:https://dotnetfiddle.net/ksMMY6

【讨论】:

  • 谢谢,太好了。与此同时,我自己走到了路口(也玩了 Jaccard Similarity),但你对commonalities 的实现非常酷。无论如何要优化它并加速它(PLINQ 或 Parallel.For 用于这两个 for 循环),特别是如果源数据集更大?
【解决方案2】:
        string[] movies = {
                "Star Wars Episode IV A New Hope",
                "Force of Hunger",
                "The Hunger Games Mockingjay",
                "Jaws 2",
                "The Shawshank Redemption",
                "Hunger Pain",
                "The Hunger Games",
                "Jaws: The Revenge",
                "The Hunger Games Catching Fire",
                "Rogue One A Star Wars Story",
                "Aqua Teen Hunger Force",
                "The Force Awakens Star Wars",
            };

        string[] kw = { "Star", "Wars", "Force", "Hunger", "Games", "The", "Jaws" };


        var group  = movies.GroupBy(p => kw.Count(k => p.Contains(k))).OrderByDescending(p=> p.Key);
        StringBuilder sb = new StringBuilder();

        foreach (var g in group)
        {
          sb.AppendLine("Group : " + g.Key);
            foreach (var s in g)
            {
                sb.AppendLine(s);
            }
        }

结果会是

   Group : 4
   The Force Awakens Star Wars
   Group : 3
   The Hunger Games Mockingjay
   The Hunger Games
   The Hunger Games Catching Fire
   Group : 2
   Star Wars Episode IV A New Hope
   Force of Hunger
   Jaws: The Revenge
   Rogue One A Star Wars Story
   Aqua Teen Hunger Force
   Group : 1
   Jaws 2
   The Shawshank Redemption
   Hunger Pain

【讨论】:

  • 感谢您试用。好像您使用了预定义的关键字,这是不合适的。也许我不清楚,但关键字只是指数组每个元素中的单词包。
  • 我认为这是一个很好的答案,即使不完整,因为问题中没有定义模式来运行标准数据分组操作,请先定义,否则需要自定义逻辑 +1
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2016-01-24
  • 1970-01-01
  • 1970-01-01
  • 2023-03-18
  • 1970-01-01
  • 1970-01-01
  • 2016-04-17
相关资源
最近更新 更多