Things Have History
Al-Kindi's frequency analysis, or how counting letters broke every substitution cipher

cryptography

Al-Kindi's frequency analysis, or how counting letters broke every substitution cipher

Listen · 3:40

The letter that appears most often in Arabic prose is alif. It turns up in the definite article, in verb endings, in the most common words; count the letters in any Arabic text longer than a paragraph and alif will be near the top. Around 850 CE, a scholar in Baghdad named Abu Yusuf al-Kindi realized this fact alone was enough to break almost any cipher ever devised.

Al-Kindi — known as “the philosopher of the Arabs” — worked at the House of Wisdom, the great Abbasid research and translation library in Baghdad, under Caliphs al-Ma’mun and al-Mu’tasim. He wrote over 290 works covering medicine, astronomy, mathematics, music, and philosophy. The treatise he produced on cryptanalysis, Risalah fi Istikhraj al-Mu’amma (“A Manuscript on Deciphering Cryptographic Messages”), is the oldest surviving work of systematic codebreaking — predating any other by at least three centuries (Wikipedia).

His method was elegant precisely because it was statistical. A substitution cipher works by replacing each letter with a symbol or a different letter: A becomes Q, B becomes X, and so on. For centuries this seemed impenetrable, because seeing QNXJ tells you nothing obvious. But al-Kindi spotted the flaw. The substitution scrambles which symbol represents which letter — but it cannot scramble how often each symbol appears. Whatever symbol stands for alif will appear roughly as often as alif does in ordinary Arabic text. Take a long enough ciphertext, count the symbol frequencies, compare them against the known letter frequencies of Arabic prose, and the cipher dissolves into arithmetic (Muslim Heritage).

His instructions in the treatise are almost shockingly plain: “One way to solve an encrypted message, if we know its language, is to find a different plaintext of the same language long enough to fill one sheet or so, and then we count the occurrences of each letter.” That’s it. Count two documents, match the patterns, read the message. The intensive linguistic study of the Quran had given Arab scholars unusually precise knowledge of Arabic letter frequencies — which made frequency analysis not just possible but practically routine in 9th-century Baghdad, long before it occurred to anyone in Europe (Simon Singh, Arab Code Breakers).

The manuscript then disappeared for over a millennium. It resurfaced when Professor Mohammed Mrayati, an engineer working with the United Nations in Lebanon, located it in the Sulaimaniyyah Ottoman Archive in Istanbul — unrecognized in the collection for centuries. The Arab Academy of Damascus published it in 1987. A technique that had arguably shaped the survival or death of military secrets for over a thousand years had been sitting, anonymous, in a Turkish library.

What al-Kindi unlocked was not just a method for reading other people’s mail. He had established that language carries statistical structure, and that structure cannot be fully concealed by substitution alone. Every cipher built on single-letter swaps — the Caesar cipher, centuries of diplomatic correspondence, royal intrigues from Baghdad to London — became vulnerable the moment someone learned to count. The same principle, scaled up by machines and higher mathematics, would eventually crack Enigma at Bletchley Park.

The House of Wisdom burned in the Mongol sack of Baghdad in 1258, and most of what al-Kindi wrote burned with it. The one that survived happened to be the most important one.

Sources

Spot a mistake?

Wrong date, broken citation, a fact that doesn't hold? Tell us. It lands in an inbox a human reads and the post can be pulled or corrected.