12
votes

Is there a function that searches a sequence of elements for a subsequence? I am looking for an analogue of StringPosition for Lists. In my current application I am working with integer lists, but I'd be interested in a general FindSequence[list, pattern, n] function which will find the first n occurrences of pattern in list.


Here's a toy example:

Generate some data:

In[1]:= $HistoryLength = 0    
Out[1]= 0

In[2]:= Timing[digits = First[RealDigits[\[Pi], 2, 10000000]];]    
Out[2]= {26.5, Null}

Let's convert it to a string so we can compare to StringPosition. This is very slow an memory hungry. (The memory is freed when the evaluation finishes.)

In[3]:= Timing[str = StringJoin[ToString /@ digits];]    
Out[3]= {43.813, Null}

I am looking for this subsequence:

In[4]:= patt = {1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 
   1, 0, 1, 1};

In[5]:= strpatt = StringJoin[ToString /@ patt];

Searching the string is very fast:

In[6]:= StringPosition[str, strpatt] // Timing    
Out[6]= {1.047, {{5737922, 5737943}}}

This is a simple implementation of searching for numerical arrays. It's slower than StringPosition:

In[7]:= Timing[
           corr = ListCorrelate[patt, digits];
           Select[Flatten@Position[corr, patt.patt], 
             digits[[# ;; # + Length[patt] - 1]] === patt &]
        ]

Out[7]= {2.234, {5737922}}

Summary:

  1. Is there a builtin that searches lists for subsequences?
  2. If there isn't, what is a fast and elegant implementation for numeric lists (my practical problem)?
  3. What about generic lists that can contain anything? (There are two possibilities here: "static" patterns only such as {1,0,1}, or general ones like {1,_,1}, though these latter ones may introduce complications.)

I expect this will have many solutions, some fast, some more elegant, some more general :-)


Related questions:

Interesting reading:


EDIT:

I just found the undocumented LongestCommonSubsequencePositions. LongestCommonSubsequencePositions[a, b] will find the longest common subsequence of the lists a and b, and return position of its first occurrence only in both a and b. (The documented LongestCommonSubsequence, which I was not aware of, will only return the subsequence itself, not its position.)

It is slower than the alternatives above, but it works on general lists that can contain any expression.

In[57]:= LongestCommonSubsequencePositions[digits, patt] // Timing
Out[57]= {5.25, {{5737922, 5737943}, {1, 22}}}
3

3 Answers

16
votes

You can use ReplaceList with a "prefix" and "suffix" of ___ and match the whole list. This gives you all the replacements that can be made (as opposed to Replace). The position of your pattern is then simply the length of the prefix + 1. It's pretty fast as well:

In[40]:= Timing[ReplaceList[digits, Join[{pre___}, patt, {___}] :> Length[{pre}]
   + 1]]

Out[40]= {1.3059, {5737922}}

Edit: figured it's slightly more elegant to use a delayed rule than to map Length afterwards.

4
votes

Please have a look at functions seqPos (general lists) and seqposC (integer lists, compiled), which do exactly what you ask for, and are fast. I used them in this answer (for the question you actually linked to).

Here are the timing results for various solutions:

In[15]:= seqPos[digits, patt] // Timing
Out[15]= {1.297, {5737922}}

In[16]:= seqposC[digits, patt] // Timing
Out[16]= {0.125, {5737922}}

In[17]:= 
Timing[corr = ListCorrelate[patt, digits];
      Select[Flatten@Position[corr, patt.patt], 
         digits[[# ;; # + Length[patt] - 1]] === patt &]]

Out[17]= {0.844, {5737922}}

In[18]:= Timing[
    ReplaceList[digits, Join[{pre__}, patt, {___}] :> Length[{pre}] + 1]]

Out[18]= {0.953, {5737922}}

In[19]:= AbsoluteTiming[cf[digits, patt]]
Out[19]= {3.1914063, 5737922}

These indicate that your approach with ListCorrelate is not bad at all. My first function seqPos (it is actually due to Norbert Pozar) is a bit slower but then it is completely general, while seqposC is much faster.

2
votes

Here is a compiled version, that avoids the String conversion but is not faster.

cf = Compile[{{in, _Integer, 1}, {patt, _Integer, 1}},
  Block[{lp, res},
   lp = Length[patt];
   res = 0;
   Do[
    If[Total[Abs[in[[i ;; i + lp - 1]] - patt]] == 0,
      res = i; Break[]];
    , {i, 1, Length[in] - lp}];
   res
   ]
  , CompilationTarget -> "C", RuntimeOptions -> "Speed"]


AbsoluteTiming[cf[digits, patt]]