1
votes

There is a fast script or command in batch/powershell to analyse only the duplicate and variable segments in all lines of a txt file and remove them? Example:

input file1.txt:

abcde11234452232131
abcde6176413190830
abcde6278647822786
abcde676122249819113

output file1.txt:

11234452232131
6176413190830
6278647822786
676122249819113

input file2.txt:

11234452232131xyz
6176413190830xyz
6278647822786xyz
676122249819113xyz

output file2.txt:

11234452232131
6176413190830
6278647822786
676122249819113

My script:

@echo off & setlocal enabledelayedexpansion

:startline

set /p first=<#SHA1.txt

set status=notequal

for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~0,1!"=="!second:~0,1!" (set status=equal) else (set status=notequal & goto break)
)

if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto startline
)

:break

:endline

set /p first=<#SHA1.txt

set status=notequal

for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
if "!first:~-1!"=="!second:~-1!" (set status=equal) else (set status=notequal & goto break)
)

if "!status!"=="equal" (
for /f "delims=" %%a in (#SHA1.txt) do (
set second=%%a
echo !second:~0,-1!>>#SHA1.tmp
)
if exist #SHA1.tmp (del #SHA1.txt & ren #SHA1.tmp #SHA1.txt)
goto endline
)

:break

exit

I think that this script is slow to run in multiple files.

4
At first, this "question" is far too broad (read the tour and How to Ask to learn why). Second, it is quite unclear: 1. do the lines always have the same lengths (per input file)? 2. does the duplicate part always appear at the beginning or at the end of the lines? 3. could there appear multiple duplicate parts? - aschipfl
In addition, you're tag spamming. Pick the language you want to use, make an effort to write a solution to the problem in that language, and then come here and ask about a specific issue you've run into, include your code, and ask a specific question related to that code. - Ken White
So the character position can not be considered for searching the duplicate parts, right? Anyway, if you have problems with your script, please share it (edit the question for that). Perhaps it would be even better if you posted your script on Code Review... - aschipfl
You have two labels called :break in your code, which is a bad idea... better call them differently, so it is obvious where execution continues after goto. Anyway, the major problem in your code is that you are reading the input file multiple times, which makes it slow; also goto loops are quite slow... - aschipfl
Wouldn't Remove common prefix and/or suffix of unknown length from a list of strings be a title better describing your task? - user6811411

4 Answers

2
votes

What about this (see the explanatory :: comment):

@echo off
::This script assumes that the lines of the input file (provided as command line argument)
::do not contain any of the characters `^`, `!`, and `"`. The lines may be of different
::lengths, empty lines are ignored though.
::The script processes the input file in two phase:
::1. let us call this the analysis phase, which consists of the following steps:
::    * read the first line of the file, store the string and determine its length;
::    * read the second line, walk through all characters beginning from the left and from
::      the right side within the same loop, find the character indexes that point to the
::      first left-most and the last right-most character that do not equal the respective
::      ones in the string from the first line, and store the retreived indexes;
::    * read the remaining lines, and for each one, extract the prefix and the suffix that
::      is indicated by the respective stored indexes and compare them with the respective
::      prefix and suffix from the first line; if both are equal, exit with the loop here
::      and continue with the next line; otherwise, walk through all characters beginning
::      before the previous left-most and after the previous right-most character indexes
::      towards the respective ends of the string, find the character indexes that again
::      point to the first left-most and the last right-most character that do not equal
::      the respective ones in the string from the first line, and update the previously
::      stored indexes accordingly;
::2. let us call this the execution phase, which reads the input file again, extracts the
::   portion of each line that is indicated by the two computed indexes and returns it;
::The output is displayed in the console; to write it to a file, use redirection (`>`).
setlocal EnableDelayedExpansion

set "MIN=" & set "MAX=" & set /A "ROW=0"
for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
    set /A "ROW+=1" & set "STR=%%L"
    if !ROW! equ 1 (
        call :LENGTH LEN "%%L"
        set "SAV=%%L"
    ) else if !ROW! equ 2 (
        set /A "IDX=LEN-1"
        for /L %%I in (0,1,!IDX!) do (
            if not defined MIN (
                if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
            )
            if not defined MAX (
                set /A "IDX=%%I+1"
                for %%J in (!IDX!) do (
                    if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
                )
            )
        )
        if not defined MIN set /A "MIN=LEN, MAX=-LEN"
    ) else (
        set "NXT=#"
        if !MIN! gtr 0 for %%I in (!MIN!) do if not "!STR:~,%%I!"=="!SAV:~,%%I!" set "NXT="
        if !MAX! lss 0 for %%J in (!MAX!) do if not "!STR:~%%J!"=="!SAV:~%%J!" set "NXT="
        if not defined NXT (
            if !MAX! lss -!MIN! (set /A "IDX=1-MAX") else (set /A "IDX=MIN-1")
            for /L %%I in (!IDX!,-1,0) do (
                if %%I lss !MIN! (
                    if not "!STR:~%%I,1!"=="!SAV:~%%I,1!" set /A "MIN=%%I"
                )
                if -%%I geq !MAX! (
                    set /A "IDX=%%I+1"
                    for %%J in (!IDX!) do (
                        if not "!STR:~-%%J,1!"=="!SAV:~-%%J,1!" set /A "MAX=1-%%J"
                    )
                )
            )
        )
    )
)
if defined MAX if !MAX! equ 0 set "MAX=8192"
for /F "tokens=1,2" %%I in ("%MIN% %MAX%") do (
    for /F usebackq^ delims^=^ eol^= %%L in ("%~1") do (
        set "STR=%%L"
        echo(!STR:~%%I,%%J!
    )
)

endlocal
exit /B


:LENGTH  <rtn_length>  <val_string>
    ::Function to determine the length of a string.
    ::PARAMETERS:
    ::  <rtn_length>  variable to receive the resulting string length;
    ::  <val_string>  string value to determine the length of;
    set "STR=%~2"
    setlocal EnableDelayedExpansion
    set /A "LEN=1"
    if defined STR (
        for %%C in (4096 2048 1024 512 256 128 64 32 16 8 4 2 1) do (
            if not "!STR:~%%C!"=="" set /A "LEN+=%%C" & set "STR=!STR:~%%C!"
        )
    ) else set /A "LEN=0"
    endlocal & set "%~1=%LEN%"
    exit /B

This could maybe be improved further, depending also on the data:

  • if the length of the first line is fixed, or the line lengths vary in a quite small range, you could avoid the :LENGTH sub-routine call and use a constant value instead; if there is a known maximum length of the common prefix/suffix, the line length is even not needed at all;
  • instead of reading the file twice (due to the two-pass algorithm), you could read it into memory at the beginning and use these data later; for huge files this might be a bad idea though;
  • I used several for /L loops to walk through certan character ranges, whose bodies are skipped by some if conditions due to lack of while loops or something like exit for; I could have left them using goto, but then I needed to put these loops in separate sub-routines to not break the outer loops; anyway, for [/L] loops finish iterating in the background even when broken by goto, although faster than executing the body; so together with the slow call and goto, I doublt that I would have gained much speed; depending on the data, pure goto loops could be more efficient as they can be left without any remaining background processing, but of course they also needed to be placed in their own sub-routines;
2
votes

Remove common prefix and/or suffix of unknown length from a list of strings

This batch takes a quite simplistic (and probaply inefficient) approach

  • It reads the first line and iterates with a growing prefix over the first 30 chararcters
  • uses findstr to match the lines | pipes the result to find to get a line count
  • if the line count doesn't match the files total lines the prefix got to long and
    the batch continues to the next step.
  • the same procedure is then used for the suffix
  • finally the lines are truncated (even prefix and suffix at the same time)

Pass the file name to process as argument, otherwise file1.txt is default.

:: Q:\Test\2018\06\29\SO_51093137.cmd
@echo off & setlocal enabledelayedexpansion
Set "File=%~1"
If not defined File Set "File=file1.txt"
Echo Processing %File%

:: get number of lines
for /f %%i in ('Find /V /C "" ^<"%File%"') Do Set Lines=%%i
Echo #Lines is %Lines%

:: get 1st line
Set /P "Line1=" < "%File%"
Echo Line1 is %Line1%

:: Iterate Prefixlength until Prefix doesn't match all lines
For /L %%i in (1,1,30) Do (
    For /F %%A in ('
        Findstr /B /L "!Line1:~0,%%i!" "%File%" ^|Find  /C /V "" '
    ) Do Set "EQ=%%A"
    If %Lines% neq !EQ! (Set /A "PrefixLength=%%i -1" & Goto :Break1)
)
:Break1
Echo PrefixLength is %PrefixLength%

:: Iterate Suffixlength until Suffix doesn't match all lines
For /L %%i in (-1,-1,-30) Do (
    For /F %%A in ('
        Findstr /E /L "!Line1:~%%i!" "%File%" ^|Find  /C /V "" '
    ) Do Set "EQ=%%A"
    If %Lines% neq !EQ! (Set /A "SuffixLength=%%i +1" & Goto :Break2)
)
:Break2

Echo SuffixLength is %SuffixLength%
Echo ============
For /f "usebackqDelims=" %%A in ("%File%") Do (
    Set "Line=%%A"
    If %SuffixLength%==0 (
        Echo=!Line:~%PrefixLength%!
    ) Else (
        Echo=!Line:~%PrefixLength%,%SuffixLength%!
    )
)

Sample output:

> SO_51093137.cmd file2.txt
Processing file2.txt
#Lines is 4
Line1 is 11234452232131xyz
PrefixLength is 0
SuffixLength is -3
============
11234452232131
6176413190830
6278647822786
676122249819113
1
votes

Following is probably overcomplicating things but it pushed my limit so a great learning experience for me.

$file1 = @(
    ,'abcde11234452232131' 
    ,'abcde6176413190830'
    ,'abcde6278647822786'
    ,'abcde676122249819113'
)

function Test-EqualChar
{
    param (
        [Scriptblock] $Expression,
        [Object[]] $Sequence,
        [int] $i
    )
    !(($Sequence[1..($Sequence.Length -1)] | % {(&$Expression $_ $i) -eq ($Sequence[0][$i])}) -contains $False)
}

$OneChar = {param($x, $i) $x[$i]}
$start = for($i=0;$i -lt ($file1 | % {$_.Length} | Measure -Minimum | Select -ExpandProperty Minimum);$i++) {
    if (!(Test-EqualChar $OneChar $file1 $i)) {$i; break}
}
$file1 | % {$_.Substring($start, $_.Length-$start)}

I'll leave it as an excercise to work out reversing (or padding) the strings to remove equal characters from the end of strings

1
votes

This solution uses a different approach. IMHO this is the fastest way to process a file.

@echo off
setlocal EnableDelayedExpansion

if "%~1" equ "" echo Usage: %0 filename & goto :EOF
if not exist "%~1" echo File not found: "%~1" & goto :EOF

rem Read first two lines and get their base 0 lengths
( set /P "line1=" & set /P "line2=" ) < %1
call :StrLen0Var len1=line1
call :StrLen0Var len2=line2

rem Extract the largest *duplicate segment* from first two lines
set "maxDupSegLen=0"
for /L %%i in (0,1,%len1%) do (
   for /L %%j in (0,1,%len2%) do (
      if "!line1:~%%i,1!" equ "!line2:~%%j,1!" (
         rem New duplicate segment, get its length and keep the largest one
         set /A "maxLen=len1-%%i+1, maxLen2=len2-%%j+1"
         if !maxLen2! gtr !maxLen! set "maxLen=!maxLen2!"
         for /L %%l in (1,1,!maxLen!) do (
            if "!line1:~%%i,%%l!" equ "!line2:~%%j,%%l!" set "dupSegLen=%%l"
         )
         if !dupSegLen! geq !maxDupSegLen! (
            set /A "maxDupSegLen=dupSegLen, maxDupSegPos=%%i"
         )
      )
   )
)
set "dupSeg=!line1:~%maxDupSegPos%,%maxDupSegLen%!"

rem Process the file removing duplicate segments
for /F "delims=" %%a in (%1) do (
   set "line=%%a"
   echo !line:%dupSeg%=!
)

goto :EOF


Get the length base 0 of a variable

:StrLen0Var len= var
setlocal EnableDelayedExpansion
set "str=!%2!"
set "len=0"
for /L %%a in (12,-1,0) do (
   set /A "newLen=len+(1<<%%a)"
   for %%b in (!newLen!) do if "!str:~%%b,1!" neq "" set "len=%%b"
)
endlocal & set "%1=%len%"

input1.txt:

abcde11234452232131
abcde6176413190830
abcde6278647822786
abcde676122249819113

output:

11234452232131
6176413190830
6278647822786
676122249819113

input2.txt:

11234452232131xyz
6176413190830xyz
6278647822786xyz
676122249819113xyz

output:

11234452232131
6176413190830
6278647822786
676122249819113

"The rows have variable length and multiple duplicate parts may occur".

input3.txt:

abcde11234452232131
6176abcde4131908abcde30
6278647abcde822786
676122249819113abcde

output:

11234452232131
6176413190830
6278647822786
676122249819113