Improving Accuracy for a Substitution Cipher using Frequency Analysis

Question

So, in my C++ code, I take a text file in ordinary English, analyzed the frequency of the letters in the English Alphabet through the file and stored them into a vector. After getting the frequencies, I replaced each of the letters starting with the most frequent of them with the most frequent of the English Alphabet. I used strings like "ETAOINSHRDLUCMFWYPVBGKJQXZ" and "EOTHASINRDLUYMWFGCBPKVJQXZ" to represent the most frequent of the Alphabet, and then I go through the most frequent of the text one by one (sorted Vector by greater than comparison) and replace each of them with the letters in the strings above. Ultimately, the accuracy of such a naive approach is dependent on the size of the file; I wanna see if I can make it more accurate while maintaining this approach. Like, after I run through the text again to substitute in the new letters, I get a new file that has the new (not real) words in them. Due to the accuracy of such an approach as follows

E 326 E
O 288 T
A 271 A
T 257 O
I 243 I
R 235 N
N 208 S
S 205 H
L 140 R
D 129 D
M 112 L
U 110 U
H 107 C
C 103 M
G 92 F
P 91 W
Y 73 Y
W 58 P
B 53 V
F 51 B
K 29 G
V 22 K
X 15 J
J 6 Q
Q 6 X
Z 1 Z

for a text of moderate length, I get a resulting text that has words like

REANSISF FTARH  from LEARNING GOALS
REANS YTU A CAHGERR VY LINAS RIWTKAMA from Learn You a Haskell by Miran Lipovaca

Notice how some words were pretty close. Like learn or you or by. Somewhere along those lines I can maybe "bruteforce" my way to replacing those spellings with the actual word. How then, could I improve the accuracy so its at least 50% close to the original text? I just need ideas for the time being. Whether it be implementing a dictionary to find common letter patterns or using maps as dictionaries in C++, any advice would be appreciated. Thanks.

exactly as shown above. The left side shows the frequency of letters in the text, the middle number is the the frequency amount, and the right side is the string given for the most frequency of the English alphabet. It was a then a simple substitution for the letters in the text to the new letters. — Kthieu

MSalters MSalters · Accepted Answer · 2015-03-10T23:12:59

What you're basically discovering is that a "partially good" solution already returns those words correctly when just the letters used in the word are correctly substituted. It doesn't matter to much if you have the Q and X mixed up, which is a real risk since they're both rare.

So, as a measure of proximity, you can use how many of the words in the attempted decryption occur in a dictionary. You'll find that the words that do occur have a much higher occurrence of some letters, and those letters are probably correct. Just trying all 12 orders of "EATO" will get you lots of words.

But how do you generate a few more hypotheses? Your first attempt gets you one order. You can generate reasonable variations by swapping pairs of letters that are almost as common. Start with the most common letters, as that gets you more words.

Improving Accuracy for a Substitution Cipher using Frequency Analysis

2 Answers