I have to find all initializations (captial letter words, such as SAP, JSON or XML) in my plain text files. Is there any ready-made script for this? Ruby, Python, Perl - the language doesn't matter. So far, I've found nothing.
Regards,
Stefan
I have to find all initializations (captial letter words, such as SAP, JSON or XML) in my plain text files. Is there any ready-made script for this? Ruby, Python, Perl - the language doesn't matter. So far, I've found nothing.
Regards,
Stefan
A simpler version of Conspicuous Compiler's answer uses the -p
flag to cut out all that ugly loop code:
perl -p -e 'm/\b([[:upper:]]{2,})\b/' input.txt
Here's a Python 2.x solution that allows for digits (see example). Update: Code now works for Python 3.1, 3.0 and 2.1 to 2.6 inclusive.
dos-prompt>type find_acronyms.py
import re
try:
set
except NameError:
try:
from sets import Set as set # Python 2.3
except ImportError:
class set: # Python 2.2 and earlier
# VERY minimal implementation
def __init__(self):
self.d = {}
def add(self, element):
self.d[element] = None
def __str__(self):
return 'set(%s)' % self.d.keys()
word_regex = re.compile(r"\w{2,}", re.LOCALE)
# min length is 2 characters
def accumulate_acronyms(a_set, an_iterable):
# updates a_set in situ
for line in an_iterable:
for word in word_regex.findall(line):
if word.isupper() and "_" not in word:
a_set.add(word)
test_data = """
A BB CCC _DD EE_ a bb ccc k9 K9 A1
It's a CHARLIE FOXTROT, said MAJ Major Major USAAF RETD.
FBI CIA MI5 MI6 SDECE OGPU NKVD KGB FSB
BB CCC # duplicates
_ABC_DEF_GHI_ 123 666 # no acronyms here
"""
result = set()
accumulate_acronyms(result, test_data.splitlines())
print(result)
dos-prompt>\python26\python find_acronyms.py
set(['CIA', 'OGPU', 'BB', 'RETD', 'CHARLIE', 'FSB',
'NKVD', 'A1', 'SDECE', 'KGB', 'MI6', 'USAAF', 'K9', 'MAJ',
'MI5', 'FBI', 'CCC', 'FOXTROT'])
# Above output has had newlines inserted for ease of reading.
# Output from 3.0 & 3.1 differs slightly in presentation.
# Output from 2.1 differs in item order.