Not to compete with other answers, but I just could not help sharing a Compile
- based solution. The solution is based on building a binary search tree, and then checking for every number in the list, whether its index in the list is the one used in building the b-tree. If yes, it is the original number, if no - it is a duplicate. What makes this solution interesting for me is that it shows a way to emulate "pass-by-reference" with Compile
. The point is that, if we inline compiled functions into other Compiled functions (and that can be achieved with an "InlineCompiledFunctions" option), we can refer in inner functions to the variables defined in outer function scope (because of the way inlining works). This is not a true pass-by-reference, but it still allows to combine functions from smaller blocks, without efficiency penalty (this is more in the spirit of macro-expnsion). I don't think this is documented, and have no idea whether this will stay in future versions. Anyways, here is the code:
(* A function to build a binary tree *)
Block[{leftchildren , rightchildren},
makeBSearchTree =
Compile[{{lst, _Real, 1}},
Module[{len = Length[lst], ctr = 1, currentRoot = 1},
leftchildren = rightchildren = Table[0, {Length[lst]}];
For[ctr = 1, ctr <= len, ctr++,
For[currentRoot = 1, lst[[ctr]] != lst[[currentRoot]],(*
nothing *),
If[lst[[ctr]] < lst[[currentRoot]],
If[leftchildren[[currentRoot]] == 0,
leftchildren[[currentRoot]] = ctr;
Break[],
(* else *)
currentRoot = leftchildren[[currentRoot]] ],
(* else *)
If[rightchildren[[currentRoot]] == 0,
rightchildren[[currentRoot]] = ctr;
Break[],
(* else *)
currentRoot = rightchildren[[currentRoot]]]]]];
], {{leftchildren, _Integer, 1}, {rightchildren, _Integer, 1}},
CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True}]];
(* A function to query the binary tree and check for a duplicate *)
Block[{leftchildren , rightchildren, lst},
isDuplicate =
Compile[{{index, _Integer}},
Module[{currentRoot = 1, result = True},
While[True,
Which[
lst[[index]] == lst[[currentRoot]],
result = index != currentRoot;
Break[],
lst[[index]] < lst[[currentRoot]],
currentRoot = leftchildren[[currentRoot]],
True,
currentRoot = rightchildren[[currentRoot]]
]];
result
],
{{leftchildren, _Integer, 1}, {rightchildren, _Integer,
1}, {lst, _Real, 1}},
CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True}
]];
(* The main function *)
Clear[deldup];
deldup =
Compile[{{lst, _Real, 1}},
Module[{len = Length[lst], leftchildren , rightchildren ,
nodup = Table[0., {Length[lst]}], ndctr = 0, ctr = 1},
makeBSearchTree[lst];
For[ctr = 1, ctr <= len, ctr++,
If[! isDuplicate [ctr],
++ndctr;
nodup[[ndctr]] = lst[[ctr]]
]];
Take[nodup, ndctr]], CompilationTarget -> "C",
"RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True,
"InlineCompiledFunctions" -> True,
"InlineExternalDefinitions" -> True}];
Here are some tests:
In[61]:= intTst = N@RandomInteger[{0,500000},1000000];
In[62]:= (res1 = deldup[intTst ])//Short//Timing
Out[62]= {1.141,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}
In[63]:= (res2 = Tally[intTst,Equal][[All,1]])//Short//Timing
Out[63]= {0.64,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}
In[64]:= res1==res2
Out[64]= True
Not as fast as the Tally
version, but also Equal
- based, and as I said, my point was to illustrate an interesting (IMO) technique.