3
votes

I know there are tons of similar questions on SO regarding this subject, but I couldn't quite find the answer I was looking for. Here's my requirement.

I have a long list of strings (easily upwards of 50,000 or even 100K items) in which I need to find the duplicate items. But just finding duplicates won't do; what I really want to do is go through the list and add an increment index at the end of each item to indicate the number of times an item repeats. To better illustrate let me take an example. My list actually contains paths, so the example roughly resembles that.

My original List:

AAA\BBB
AAA\CCC
AAA\CCC
BBB\XXX
BBB
BBB\XXX
BBB\XXX

My adjusted list with indices added:

AAA\BBB[1]
AAA\CCC[1]
AAA\CCC[2]
BBB\XXX[1]
BBB[1]
BBB\XXX[2]
BBB\XXX[3]

First I tried the following method using Linq:

List<string> originalList = new List<string>();
List<string> duplicateItems = new List<string>();

// pathList is a simple List<string> that contains my paths.
foreach (string item in pathList)
{
    // Do some stuff here and pick 'item' only if it fits some criteria.
    if (IsValid(item))
    {
        originalList.Add(item);
        int occurences = originalList.Where(x => x.Equals(item)).Count();
        duplicateItems.Add(item + "[" + occurences + "]");
    }
}

This works just fine and gives me the desired result. The problem is it's painfully slow given that my list can contain 100K items. So I looked around and learned that HashSet could be a possible alternative that's potentially more efficient. But I can't quite figure out how I would get my exact desired result using that.

I could try something like this, I guess:

HashSet<string> originalList = new HashSet<string>();
List<string> duplicateItems = new List<string>();

foreach (string item in pathList)
{
    // Do some stuff here and pick 'item' only if it fits some criteria.
    if (IsValid(item))
    {
        if (!originalList.Add(item))
        {
            duplicateItems.Add(item + "[" + ??? + "]");
        }
    }
}

Later I could add "[1]" to all items in the HashSet, but how do I get the indices right (marked by the universal sign of confusion, ???, above) when adding an item to my duplicate list? I can't keep a reference int that I can pass to my method as there could be hundreds of different repeating items, each repeating different number of times as in my example.

Could I still use HashSet, or is there a better way of accomplishing my goal? Even a slight pointer in the right direction would be a great help.

7
Do you want them all smashed into one list at the end?maccettura
Preferably, yes. But I'd consider other alternatives too if it's not too slow and not too many lists.Sach
Does the order of the elements in the result list matter?NineBerry
You can't use a HashSet<string> for your original list because HashSet<T> doesn't store duplicates.itsme86
@NineBerry no it doesn't.Sach

7 Answers

10
votes

Since you ask for fastest, the best IMO would be to use foreach loop and counting Dictionary<string, int>. It has the same time complexity as HashSet and uses much less memory than LINQ GroupBy:

var counts = new Dictionary<string, int>(pathList.Count); // specify max capacity to avoid rehashing
foreach (string item in pathList)
{
    // Do some stuff here and pick 'item' only if it fits some criteria.
    if (IsValid(item))
    {
        int count;
        counts.TryGetValue(item, out count);
        counts[item] = ++count;
        duplicateItems.Add(item + "[" + count + "]");
    }
}
3
votes

You could try this, although I have not performance tested it yet:

List<string> originalList = new List<string>()
{
    @"AAA\BBB",
    @"AAA\CCC",
    @"AAA\CCC",
    @"BBB\XXX",
    @"BBB",
    @"BBB\XXX",
    @"BBB\XXX"
};
List<string> outputList = new List<string>();

foreach(var g in originalList.GroupBy(x => x).Select(x => x.ToList()))
{   
    var index = 1;  
    foreach(var item in g)
    {
        outputList.Add(string.Format("{0}[{1}]", item, index++));
    }
}

Fiddle here

1
votes

What about this?

    static IEnumerable<string> MyCounter(IEnumerable<string> data)
    {
        var myDic = new Dictionary<string, int>();
        foreach (var d in data)
        {
            if (!myDic.ContainsKey(d))
                myDic[d] = 1;
            else
                myDic[d] = myDic[d] + 1 ;
            yield return d +"[" + myDic[d] + "]";
        }
    }
1
votes

You could iterate over the list and use a dictionary to get the count, like this:

private int GetCount(IDictionary<string, int> counts, string item)
{
  int count;
  if (!counts.TryGetValue(item, out count))
    count = 0;
  count++;
  counts[item] = count;
  return count;
}

private IEnumerable<string> GetItems(IEnumerable<string> items)
{
  // Initialize dict for counts with appropriate comparison
  var counts = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
  foreach(var item in items)
    yield return string.Format("{0}[{1}]", item, GetCount(counts, item));
}
1
votes

You can use this crisp and crunchy code:

public static void Main()
{
    var originalList  = new List<string>()
    {
        @"AAA\BBB",
        @"AAA\CCC",
        @"AAA\CCC",
        @"BBB\XXX",
        @"BBB",
        @"BBB\XXX",
        @"BBB\XXX"
    };

    var outputList = originalList.GroupBy(x => x).SelectMany(x => x.Select((y, i) => string.Format("{0}[{1}]", y, i + 1)));     

    Console.WriteLine(string.Join("\n", outputList));
}
0
votes

You could just use Group() to pull the strings together and then project those groups using a combination of value and count.

Given your list of strings:

var listOfStrings;
var grouped = listOfStrings.GroupBy(x => x);
var groupedCount = grouped.Select(x => new {key = x.Key, count = group.Count()});
0
votes

Using a HashSet

Note: Dump() is a LinqPad method that prints the results to the screen — substitute as necessary.

void Main()
{
    var list = new List<string> {"hello", "doctor", "name", "continue", "yesterday", "tomorrow", "HELLO"};
    
    //case-insensitive string compare
    list.HasDuplicates(StringComparer.OrdinalIgnoreCase).Dump();

    //case-sensitive string compare
    list.HasDuplicates().Dump();

    //integer compare
    var list2 = new List<int> { 1,2,3,4,5,2 };
    list2.HasDuplicates().Dump();
}

public static class Test
{
    public static bool HasDuplicates<T>(this IList<T> list, StringComparer stringComparer = null)
    {
        if (typeof(T) == typeof(string))
        {
            var hash = new HashSet<string>(list.Count, stringComparer);
            foreach (var val in list) if (!hash.Add(val?.ToString())) break;
            return hash.Count != list.Count;
        }
        else
        {
            var hash = new HashSet<T>(list.Count);
            foreach (var val in list) if (!hash.Add(val)) break;
            return hash.Count != list.Count;
        }
    }
}

/*

output:

True
False
True
*/