0
votes

I want to match two string which differ only in element and newlines

$string1 = "perl is <match>scripting language</match>";
$string2 = "perl<TAG> is<TAG> scr<TAG>ipt<TAG>inglanguage";

Note: spaces and <TAG> and newline can come anywhere in string2. space may or may not present in string2 for e.g. in above instance in $string2 spaces between words scripting language is missing. we have to ignore space,tags,newline while matching string1 against string2. <match> tag in string1 indicates the data to be matched against string2

output required :
whole content of string2 in addition with <match> tag.
perl<TAG> is<TAG> <match>scr<TAG>ipt<TAG>inglanguage</match>

Code i tried :

while($string =~ /<match>(.*?)<\/match>/gs)
{
   my $data_to_match = $1;
   $data_to_match = add_pat($data_to_match);

   $string2 =~ s{($data_to_match)}
   {
      "<match>$&<\/match>"
   }esi;
}

sub add_pat
{
   my ($data) = (@_);
   my @array = split//,$data;

   foreach my $each(@array)
   {
       $each = quotemeta $each;
       $each = '(?:(<TAG>|\s)+)?'.$each.'(?:(<TAG>|\s)+)?';
   }

   $data = join '',@array;
   return $data;
}

Problem : since space is missing in string2 it is not matching.i tried making space optional while appending pattern to each character. but making space optional. $string pattern goes on running.

In reality, i have large string to match. these space is causing problem..Please suggest

2
Could you remove all the tags and spaces from the two strings and then just check if they are equal? s/</?.*?>//g; s/\s+//g; - hmatt1
@Matt if can;t remove tags because we want it to be retain in final output - vivek
@vivekpro Int that case you can use Myforwiks answer and just copy the strings before. If the copies with all the stuff removed match, then the original strings fulfill your requirement. - DeVadder
@DeVadder in Myforwiks answer all the tags were removed..but tags cannot be removed we want to retain in final output - vivek

2 Answers

1
votes

Use regular expressions to remove all the characters that you wish to ignore from both of the strings. Then compare the remaining values of the two strings.

So you will end up both strings, for example:

'perlisscriptinglanguage' and 'perlisscriptinglanguage'

If you want you can also upper/lower case them to match too.

If they match then just return the original string 2.

0
votes

I think its weird that you are expected to "match". but $string2, if you take out the tags, doesnt match the original string.

Anyway, since your code is tolerant of Additional spaces and tags in $string2, then you can wipe all spaces (and tags if applicable) from $string1.

I added $data_to_match =~ s/ +//; before your call to add_pat. That didnt quite work because this line "$each = '(?:(|\s)+)?'.$each.'(?:(|\s)+)?';" adds the (?:(|\s)+)?' even before your first letter of the match from $string1. You actually have a lot of redundant TAG patterns, you add one to the front and back of each letter. I dont know what quotemeta does so im not sure how to fix the code there. I just added
$data_to_match =~ s/\Q(?:(<TAG>|\s)+)?\E//; line after the call to add_pat to strip off the first TAG pattern from the front of the pattern. otherwise it'll match wrong and output this 'perl < TAG> is< match>< TAG> scr< TAG>ipt< TAG>inglanguage< /match>'

Really you should only be putting one "(?:(|\s)+)?" inbetween each letter of the $string1 match, and more importantly; you should not be putting "(?:(|\s)+)?" before the first letter or after the last letter.