0
votes

Hello everyone and good day! I have the following question: I have a word list that consists of normal words as well as artificially generated words.

example:

Ford
09mKGmaePnCmjkxm
Opel
0AACyvG0FtRHAU7i
Audi
0AR6V7cCy2phgXcv
BMW
0bDOlBY5VGAe5Vai
Alfa-Romeo
Mercedes
Pegout-323
0BDTwSCCrCy4VgEc
0cmolI8g4CerXKaH
0dL2m36014PmOetH
0dqjCZU7ZeRuovFF
0ekelbAnWcGC1c7n
Lada 2109
Lada 2106
0ER4tS8jhESXuISp
0Gao8qHgbEyZ06Bh
0j1pjZBAW2avxU6Z
0j5zBVhdPDyaVoZL
Toyouta
0Jn0qoKdnM6neGdx
0KlzXttiw81AvU2C
0kXzuEtHxiWfECw7
mitsubisi
0l8qW9Uv0V1DZPei
0LJQxUNuEp42txme
jeep
0m8G1GUytcETbtWv
0MexVW3TQ2sRqLjr

I want to remove all artificially generated words from this list. I have converted such words to REGEX and saved them in a new file "Generic.txt":

[0-9][0-9][a-z][A-Z][A-Z][a-z][a-z][a-z][A-Z][a-z][A-Z][a-z][a-z][a-z][a-z][a-z]
[0-9][A-Z][A-Z][A-Z][a-z][a-z][A-Z][0-9][A-Z][a-z][A-Z][A-Z][A-Z][A-Z][0-9][a-z]
[0-9][A-Z][A-Z][0-9][A-Z][0-9][a-z][A-Z][a-z][0-9][a-z][a-z][a-z][A-Z][a-z][a-z]
[0-9][a-z][A-Z][A-Z][a-z][A-Z][A-Z][0-9][A-Z][A-Z][A-Z][a-z][0-9][A-Z][a-z][a-z]
[0-9][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][A-Z][a-z][A-Z][a-z][0-9][A-Z][a-z][A-Z][a-z]
[0-9][a-z][a-z][a-z][a-z][A-Z][0-9][a-z][0-9][A-Z][a-z][a-z][A-Z][A-Z][a-z][A-Z]
[0-9][a-z][A-Z][0-9][a-z][0-9][0-9][0-9][0-9][0-9][A-Z][a-z][A-Z][a-z][a-z][A-Z]
[0-9][a-z][a-z][a-z][A-Z][A-Z][A-Z][0-9][A-Z][a-z][A-Z][a-z][a-z][a-z][A-Z][A-Z]
[0-9][a-z][a-z][a-z][a-z][a-z][A-Z][a-z][A-Z][a-z][A-Z][A-Z][0-9][a-z][0-9][a-z]
[0-9][A-Z][A-Z][0-9][a-z][A-Z][0-9][a-z][a-z][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][a-z]
[0-9][A-Z][a-z][a-z][0-9][a-z][A-Z][a-z][a-z][A-Z][a-z][A-Z][0-9][0-9][A-Z][a-z]
[0-9][a-z][0-9][a-z][a-z][A-Z][A-Z][A-Z][A-Z][0-9][a-z][a-z][a-z][A-Z][0-9][A-Z]
[0-9][a-z][0-9][a-z][A-Z][A-Z][a-z][a-z][A-Z][A-Z][a-z][a-z][A-Z][a-z][A-Z][A-Z]
[0-9][A-Z][a-z][0-9][a-z][a-z][A-Z][a-z][a-z][A-Z][0-9][a-z][a-z][A-Z][a-z][a-z]
[0-9][A-Z][a-z][a-z][A-Z][a-z][a-z][a-z][a-z][0-9][0-9][A-Z][a-z][A-Z][0-9][A-Z]
[0-9][a-z][A-Z][a-z][a-z][A-Z][a-z][A-Z][a-z][a-z][A-Z][a-z][A-Z][A-Z][a-z][0-9]
[0-9][a-z][0-9][a-z][A-Z][0-9][A-Z][a-z][0-9][A-Z][0-9][A-Z][A-Z][A-Z][a-z][a-z]
[0-9][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][a-z][A-Z][a-z][0-9][0-9][a-z][a-z][a-z][a-z]
[0-9][a-z][0-9][A-Z][0-9][A-Z][A-Z][a-z][a-z][a-z][A-Z][A-Z][a-z][a-z][A-Z][a-z]
[0-9][A-Z][a-z][a-z][A-Z][A-Z][0-9][A-Z][A-Z][0-9][a-z][A-Z][a-z][A-Z][a-z][a-z]

Now I want to delete from the word list "base.txt" all words that match this regex. They can also be larger than 16 characters! I use the following command:

LC_ALL=C grep -F -f generic.txt base.txt > test.txt

Unfortunately I get no results, but also no error messages. What am I doing wrong? Basically I want grep to check the file "base.txt" for every line from the file "generic.txt" and extract these lines into a new file.

The following list should remain at the end:

Ford
Opel
Audi
BMW
Alfa-Romeo
Mercedes
Pegout-323
Lada 2109
Lada 2106
Toyouta
mitsubisi
jeep

TIA Sergio

2
Are the grossly misspelled car brands examples of machine-generated text, or the opposite?tripleee
All the obviously machine-generated tokens begin with 0 and have a fixed length; isn't that enough to exclude them? grep -Ev '^0[A-Za-z0-9]{15}$' base.txt (or maybe remove the initial 0 and change the repetition count from 15 to 16 to capture all strings without spaces or punctuation which are exactly 16 characters long).tripleee
I need solution for strictly defined regex lines. There are lines that also contain [a-z] or [A-Z] as a start. Or machine-generated text with more as 15 characters... :-(Master-Lomaster
"grep -v -E -f generic.txt base.txt > new.txt" Works!!!!Master-Lomaster

2 Answers

0
votes

Problem is the definition of a "word", meaning why should Ford be a valid word while e.g. F0rd is not? That said, for your given list, you could use

^[a-zA-Z]+(?:[- ]\w+)?$

See a demo on regex101.com.


^[0-9].{15}$(*SKIP)(*FAIL)|^.+

See another demo for this one on regex101.com.

0
votes

The immediate error is that the -F option disables regular expressions entirely, and requires the text to match the pattern literally. (So for example [0-9] matches the literal string [0-9] and no other strings.)

Probably a better approach entirely is to try to generalize this absurd list of patterns to a single pattern, or a very small list of patterns. How did you come up with this list?

For example

grep -E '^[A-Za-z0-9]{16}$' base.txt

seems to extract only the (apparent) generated patterns in your example.