1
votes

I need to find a faster way to find the swaps in a 8-11 character string, in the following manner:

Given a string 'STDILGNLYE', find all the single letter swaps for the letters:

list_AA = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M',
           'F', 'P', 'S', 'T', 'W', 'Y', 'V']

i.e, for each letter in the string, substitute each letter in the original string with one in list_aa. The output would be:

ATDILGNLYE
RTDILGNLYE
NTDILGNLYE
...
SADILGNLYE
SRDILGNLYE
SNDILGNLYE
...
...
STDILGNLYV

For a total of 200 new strings (20 swaps per each position in the string). What I have so far:

def _create_swaps(original_str):
    list_peps = []
    for i in range(len(original_str)):
        for k in range(len(list_AA)):
            list_peps.append(_insert_aa(original_str, i, list_aa[k]))

    #remove original string
    return [i for i in list_peps if i != original_str]


def _insert_aa(string, index, aa):
    list_string_elements = list(string)
    del list_string_elements[index]
    hash_string.insert(index, aa)
    return "".join(hash_string)

Since this needs to be repeated ~10**6 times, it is the slowest step in a larger project. Is there a way of finding such swaps in a faster manner (by eliminating the "".join, insert, steps/ by finding swaps on the fly)?

For reference:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
185275200  330.286    0.000  429.295    0.000 models.py:233(_insert_aa)
975240     147.322    0.000  616.979    0.001 models.py:225(_create_swaps)
185280201/185280197   59.137    0.000   59.138    0.000 {method 'join' of 'str' objects}
185275208  39.875    0.000   39.875    0.000 {method 'insert' of 'list' objects}
975240     21.027    0.000   21.027    0.000 models.py:231(<listcomp>)
186746064  18.516    0.000   18.516    0.000 {method 'append' of 'list' objects}
2
Do you need to emit all the generated strings, or just count them?Steve
@Steve I need all the strings. As you can see from the return call from _create_swaps, it returns all created strings except the original one.Carlo Mazzaferro
You may want to try to figure out a way get replace one of the operations with a map()...see this article on loop efficiency...of course profiling is always better than theory though...Albert Rothman

2 Answers

2
votes

This is is a cleaner version of what you're looking for, even though you've already selected an answer (it's not the most pythonic).

You shouldn't ever use range to get the index of an iterable, you should use enumerate if you want to be pythonic about it.

>>> def swaps(s, lst):
...   for index, _ in enumerate(s):
...     for letter in lst:
...       temp = list(s)
...       temp[index] = letter
...       yield ''.join(temp)
...
>>> list_AA = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
>>> s = 'STDILGNLYE'
>>>
>>> for _ in swaps(s, list_AA):
...   print _
...
ATDILGNLYE
RTDILGNLYE
NTDILGNLYE
..........
GTDILGNLYE
HTDILGNLYE
ITDILGNLYE

Also, a simplistic approach in python3:

>>> def swaps(s, lst):
...   for i, _ in enumerate(s):
...     yield from ['%s%s%s' % (s[:i], x, s[i+1:]) for x in lst]
...
>>> swaps(s,list_AA)
<generator object swaps at 0x10c9205c8>
>>> a=_
>>> next(a)
'ATDILGNLYE'
>>> next(a)
'RTDILGNLYE'
>>> next(a)
'NTDILGNLYE'
>>> next(a)
'DTDILGNLYE'

edit: compromising solution on speed/readability

def swap3(s, lst):
    for i, _ in enumerate(s):
        head, tail = s[:i], s[i+1:]
        yield from ['%s%s%s'%(head,c,tail) for c in lst]

And heres a benchmark test of all three:

s='STDILGNLYE'
list_AA=['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F',
        'P', 'S', 'T', 'W', 'Y', 'V']

# the correct sample size
list_new = list_AA * (10**6 // len(list_AA))

def swaps0(string, replacements):
    for i in range(len(string)):
        head = string[:i]
        tail = string[i+1:]
        for letter in replacements:
            yield head + letter + tail

def swaps1(s, lst):
  for i, _ in enumerate(s):
    yield from ['%s%s%s' % (s[:i], x, s[i+1:]) for x in lst]

def swaps2(s, lst):
  for index, _ in enumerate(s):
    for letter in lst:
      temp = list(s)
      temp[index] = letter
      yield ''.join(temp)

timeit [_ for _ in swaps0(s, list_new)]
timeit [_ for _ in swaps1(s, list_new)]
timeit [_ for _ in swaps2(s, list_new)]


In [9]: timeit [_ for _ in swaps0(s, list_new)]
1 loop, best of 3: 2.61 s per loop
In [10]: timeit [_ for _ in swaps1(s, list_new)]
1 loop, best of 3: 6.57 s per loop
In [11]: timeit [_ for _ in swaps2(s, list_new)]
1 loop, best of 3: 8.61 s per loop

Is it worth it? I'd say it depends on how much larger you expect this sample size to grow and how often you're going to run the code.

If the code isn't going to be running frequently (say, hundreds of times per hour) and the sample sample size isn't going to grow exponentially (to the order of 1050 or 10100) then I'd say go for readability.

If this is going to be computed extremely often with increasing sample size, go for performance.

Finally, we're left with a compromising solution combining enumerate with the head/tail splitting:

def swap3(s, lst):
    for i, _ in enumerate(s):
        head, tail = s[:i], s[i+1:]
        yield from ['%s%s%s'%(head,c,tail) for c in lst]

In [16]: timeit [_ for _ in swap3(s, list_new)]
1 loop, best of 3: 3.99 s per loop
1
votes

This should be faster:

def _insert_aa(string, index, aa):
    return string[0:index] + aa + string[index+1:]

EDIT: You can slice the head and tail only once and reuse like this:

def generate_all_variants(string, replacements):
    for i in range(len(string)):
        head = string[:i]
        tail = string[i+1:]
        for letter in replacements:
            yield head + letter + tail

for variant in generate_all_variants("abcd",  ['1', '2', '3']):
    print(variant)