Batch to remove only duplicate segments from strings

Question

There is a fast script or command in batch/powershell to analyse only the duplicate and variable segments in all lines of a txt file and remove them? Example:

input file1.txt:

abcde11234452232131
abcde6176413190830
abcde6278647822786
abcde676122249819113

output file1.txt:

11234452232131
6176413190830
6278647822786
676122249819113

input file2.txt:

11234452232131xyz
6176413190830xyz
6278647822786xyz
676122249819113xyz

output file2.txt:

11234452232131
6176413190830
6278647822786
676122249819113

My script:

@echo off & setlocal enabledelayedexpansion

:startline

set /p first=<#SHA1.txt

set status=notequal

for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~0,1!"=="!second:~0,1!" (set status=equal) else (set status=notequal & goto break)
)

if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto startline
)

:break

:endline

set /p first=<#SHA1.txt

set status=notequal

for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~-1!"=="!second:~-1!" (set status=equal) else (set status=notequal & goto break)
)

if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~0,-1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto endline
)

:break

exit

I think that this script is slow to run in multiple files.

At first, this "question" is far too broad (read the tour and How to Ask to learn why). Second, it is quite unclear: 1. do the lines always have the same lengths (per input file)? 2. does the duplicate part always appear at the beginning or at the end of the lines? 3. could there appear multiple duplicate parts? — aschipfl
In addition, you're tag spamming. Pick the language you want to use, make an effort to write a solution to the problem in that language, and then come here and ask about a specific issue you've run into, include your code, and ask a specific question related to that code. — Ken White
So the character position can not be considered for searching the duplicate parts, right? Anyway, if you have problems with your script, please share it (edit the question for that). Perhaps it would be even better if you posted your script on Code Review... — aschipfl
You have two labels called :break in your code, which is a bad idea... better call them differently, so it is obvious where execution continues after goto. Anyway, the major problem in your code is that you are reading the input file multiple times, which makes it slow; also goto loops are quite slow... — aschipfl
Wouldn't Remove common prefix and/or suffix of unknown length from a list of strings be a title better describing your task? — user6811411

aschipfl aschipfl · Accepted Answer · 2018-06-29T02:43:19

What about this (see the explanatory :: comment):

@echo off
::This script assumes that the lines of the input file (provided as command line argument)
::do not contain any of the characters `^`, `!`, and `"`. The lines may be of different
::lengths, empty lines are ignored though.
::The script processes the input file in two phase:
::1. let us call this the analysis phase, which consists of the following steps:
::    * read the first line of the file, store the string and determine its length;
::    * read the second line, walk through all characters beginning from the left and from
::      the right side within the same loop, find the character indexes that point to the
::      first left-most and the last right-most character that do not equal the respective
::      ones in the string from the first line, and store the retreived indexes;
::    * read the remaining lines, and for each one, extract the prefix and the suffix that
::      is indicated by the respective stored indexes and compare them with the respective
::      prefix and suffix from the first line; if both are equal, exit with the loop here
::      and continue with the next line; otherwise, walk through all characters beginning
::      before the previous left-most and after the previous right-most character indexes
::      towards the respective ends of the string, find the character indexes that again
::      point to the first left-most and the last right-most character that do not equal
::      the respective ones in the string from the first line, and update the previously
::      stored indexes accordingly;
::2. let us call this the execution phase, which reads the input file again, extracts the
::   portion of each line that is indicated by the two computed indexes and returns it;
::The output is displayed in the console; to write it to a file, use redirection (`>`).
setlocal EnableDelayedExpansion

set "MIN=" & set "MAX=" & set /A "ROW=0"
for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
    set /A "ROW+=1" & set "STR=%%L"
    if !ROW! equ 1 (
        call :LENGTH LEN "%%L"
        set "SAV=%%L"
    ) else if !ROW! equ 2 (
        set /A "IDX=LEN-1"
        for /L %%I in (0,1,!IDX!) do (
            if not defined MIN (
                if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
            )
            if not defined MAX (
                set /A "IDX=%%I+1"
                for %%J in (!IDX!) do (
                    if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
                )
            )
        )
        if not defined MIN set /A "MIN=LEN, MAX=-LEN"
    ) else (
        set "NXT=#"
        if !MIN! gtr 0 for %%I in (!MIN!) do if not "!STR:~,%%I!"=="!SAV:~,%%I!" set "NXT="
        if !MAX! lss 0 for %%J in (!MAX!) do if not "!STR:~%%J!"=="!SAV:~%%J!" set "NXT="
        if not defined NXT (
            if !MAX! lss -!MIN! (set /A "IDX=1-MAX") else (set /A "IDX=MIN-1")
            for /L %%I in (!IDX!,-1,0) do (
                if %%I lss !MIN! (
                    if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
                )
                if -%%I geq !MAX! (
                    set /A "IDX=%%I+1"
                    for %%J in (!IDX!) do (
                        if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
                    )
                )
            )
        )
    )
)
if defined MAX if !MAX! equ 0 set "MAX=8192"
for /F "tokens=1,2" %%I in ("%MIN% %MAX%") do (
    for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
        set "STR=%%L"
        echo(!STR:~%%I,%%J!
    )
)

endlocal
exit /B


:LENGTH  <rtn_length>  <val_string>
    ::Function to determine the length of a string.
    ::PARAMETERS:
    ::  <rtn_length>  variable to receive the resulting string length;
    ::  <val_string>  string value to determine the length of;
    set "STR=%~2"
    setlocal EnableDelayedExpansion
    set /A "LEN=1"
    if defined STR (
        for %%C in (4096 2048 1024 512 256 128 64 32 16 8 4 2 1) do (
            if not "!STR:~%%C!"=="" set /A "LEN+=%%C" & set "STR=!STR:~%%C!"
        )
    ) else set /A "LEN=0"
    endlocal & set "%~1=%LEN%"
    exit /B

This could maybe be improved further, depending also on the data:

if the length of the first line is fixed, or the line lengths vary in a quite small range, you could avoid the :LENGTH sub-routine call and use a constant value instead; if there is a known maximum length of the common prefix/suffix, the line length is even not needed at all;
instead of reading the file twice (due to the two-pass algorithm), you could read it into memory at the beginning and use these data later; for huge files this might be a bad idea though;
I used several for /L loops to walk through certan character ranges, whose bodies are skipped by some if conditions due to lack of while loops or something like exit for; I could have left them using goto, but then I needed to put these loops in separate sub-routines to not break the outer loops; anyway, for [/L] loops finish iterating in the background even when broken by goto, although faster than executing the body; so together with the slow call and goto, I doublt that I would have gained much speed; depending on the data, pure goto loops could be more efficient as they can be left without any remaining background processing, but of course they also needed to be placed in their own sub-routines;

Batch to remove only duplicate segments from strings

4 Answers