Capstone project for the GoDataDriven Accelerator Program.
This concludes a year of training in Data Science and Engineering capabilities.
This repo is created to be able to decipher text in real time.
How to run
Note: It can take some time for the image to build (downloading wordlist) and start running (pre-loading wordlists)
Pull from docker hub: The image automatically builds to the docker hub with tag: nielsdenissen/decipher-capstone:
docker pull nielsdenissen/decipher-capstone
docker run -p 8080:8080 nielsdenissen/decipher-capstone
Build locally: Docker should be installed to run this project. Once installed the following command builds the app and runs it:
Finally there are 2 ways to access the application, via a webpage or using an API.
You can visit a small frontend on http://localhost:8080/
Also there's an API endpoint available where you need to fill the following arguments:
- text: text you'd like deciphered
- language: language code this text is written in (English=en, Dutch=nl, German=de, etc..)
- You can also use
detectas a language causing the program to try to detect language (this is a bit slower though)
- You can also use
This section will explain how the engineering and science behind this application roughly work.
The application runs in a docker container. During build this container downloads the wordlists for the 3 most
common languages used in our example: English, Dutch and German. The application will download others as needed.
The application runs a tornado app serving an API and HTML pages on port 8080.
The Science side of things involve the downloading of wordlists for each language. These wordlists are gathered
First we'll need to index the words we've downloaded so they can be searched more easily.
These words are indexed using 3 characteristics of a word:
- Length of the word
- Amount of different letters
- Location of duplicate letters
Example: The word logging would result in key
(7, 5, [(2,3,6)])
- Firstly a check is done whether the words are already from the language, so no cipher was used.
Otherwise we're trying to decipher it. This is done by walking through each letter one by one and
try to decode it. We'll do this in a smart way, with the letters we know most about first. The philosophy behind this
is that letters are distributed over the encoded words and these words have a limited set of possibilities.
Picking a letter that occurs in a lot of words makes the possibilities smaller. We'll start with the
most promising letter:
- We order the set of encoded words based on the number of possibilities they have from least to most.
- We pick a number of encoded words we start with to decode. We start with a large set size of words since this
means higher accuracy (when decoding a letter, the more words you have as a reference the better).
- Next we walk through all characters and determine the score based on the number of possible encodings the words
have that this letter occurs in. This will give us a sense of complexity for decoding a letter
(no. possibilities word 1 no. possibilities word 2 ...). Those with lowest complexity will be easier to work out
as they have only a few options. So we start with those, decreasing the set size whenever the complexity gets too
- This gives us the current most promising letter along with a set of words that it is contained in. Finally we'll
try to decode the letter, by decoding the word set that contains this character:
- Recursively walk through all encoded words with their possible decodings, leading to intermediate ciphers.
- For all possibilities, keep track of score (number of correctly translated words).
- Choose a branch (combinations of decodings for encoded words) that resulted in highest score.
- Return cipher that complies with this branch (list of words with their decodings)
- Take out the decoding of the letter we were looking for and add that to the cipher found so far.
- Continue this process (we can use the cipher so far to heavily reduce possibilities),
lowering the minimum set size as needed (too many possibilities for all letters).
Finally return the found cipher and the deciphered text
Accuracy vs Speed
The algorithm can be adjusted in terms of the maximum complexity it allows when searching for solutions.
A higher complexity will search more elborately, therefore often find better solutions but take significantly longer.
Adjusting this can be done by filling in the
max_complexity variable of the
solve from the Solver class.
The solution should be able to handle multiple languages that use regular ascii characters. The language should be given via a language code.
The program will try to download the wordlist for the language itself, however, caching these will increase performance. For this purpose you
could run the download_wordlist script yourself upfront in the
data/ folder (example with dutch language):
sh download_wordlist.sh nl ./
First argument is the language code for which to download the language and the second the target directory