Have you played WORDLE yet? It’s a clever word game that’s basically Mastermind with 5-letter words.
Each day a new 5-letter word is chosen and you have 6 tries to try to guess it using 5-letter words. The colors of the tiles will change to show how close your guess was to the word.
My friends and I have been playing it for a few days and I’ve been doing pretty good, averaging around 3 or 4 guesses per word. We take screenshots of our guesses each day and compare them.
Naturally the question has come up of what the best starting word is in WORDLE? Which word is most likely to set you up for success?
In the Wheel of Fortune bonus round they start the contestants off with certain letters - RSTLNE. I always assumed this is because these were the most common letters in the English language. If you use those letters, you can form a five letter word that has unique letters in that set - RENTS.
But it turns out that in terms of letter frequency, these aren’t the most common, at least according to an analysis of all the words in the Concise Oxford Dictionary.
Another strategy would be to pick a 5-letter word that contains the most vowels, like ADIEU. I’ve had mixed results with this starting word, sometimes it’s great, sometimes it’s not so helpful.
I think we can do better. If you View Source on the WORDLE page (possibly a crime in some states), you’ll see that the HTML is pretty barebones and mostly just links to a single Javascript file.
We can run this through js-beautify like this:
js-beautify main.js > main-beautified.js
which results in this somewhat easier to read source code.
About halfway down the page, we find two giant arrays:
var Ls = ["cigar", "rebut", "sissy", ...]
As = ["aahed", "aalii", "aargh", ...]
Searching for how these two arrays of 5-letter words are used, it becomes clear that the first array, Ls, is a list of solutions. Each day a different solution is chosen from this array. The way that each day’s solution is chosen is simple. A starting date is specified: June 19, 2021. This corresponds to the first entry in the array, “cigar”. It takes the current date and determines how many days exist between the starting date and today’s date. For example, today is November 24, 2021, so 158 days exist between those two dates. The entry at the 158 index is “retch”, so today’s solution is “retch”. If more days have passed since the starting date than solutions exist, it will start to loop around again.
The second array, As, is a list of 5-letter words that the game thinks are valid, but will never be the solution for the day.
We can then extract these words and put them in a JSON file, with two keys, solutions
and herrings
.
Now that we have the data set, we can easily calculate the letter frequencies for the actual solutions:
var words = require('./words.json');
var solutions = words.solutions;
var letterExistsInWord = function(letter, word) {
return word.indexOf(letter) > -1;
}
var letterCounts = {};
// 97 is the ASCII character code for lower case a, this simply loops over every letter in the alphabet
// https://www.asciitable.com/
for (var charCode = 97; charCode < 123; charCode++) {
var letter = String.fromCharCode(charCode);
letterCounts[letter] = 0;
for (var solution of solutions) {
if (letterExistsInWord(letter, solution)) {
letterCounts[letter]++;
}
}
}
const letterFrequency = Object.fromEntries(
Object.entries(letterCounts).sort(([, a], [, b]) => b - a)
);
console.log(letterFrequency);
Note: Github Copilot generated a disturbing amount of this code with little prodding.
This results in:
{
e: 1056,
a: 909,
r: 837,
o: 673,
t: 667,
l: 648,
i: 647,
s: 618,
...
}
So in this game, RSTLNE are not the most common letters. Instead it’s more like EAROTL. Is LATER a good starting word?
Let’s approach it from another angle. If we loop over every 5-letter word that the game knows about and compared it against each solution, calculating the number of “in the word, right spot” and “in the word, wrong spot” results, this would tell us which of the words, on average, produces the most successful result.
var words = require('./words.json');
var solutions = words.solutions;
var herrings = words.herrings;
String.prototype.replaceAt = function (index, replacement) {
return this.substr(0, index) + replacement + this.substr(index + replacement.length);
}
var processGuess = function (guess, solution) {
var correctSpot = 0;
var wrongSpot = 0;
// deep copy
var usedSolution = (' ' + solution).slice(1);
for (var i = 0; i < guess.length; i++) {
if (guess[i] === solution[i]) {
correctSpot++;
usedSolution = usedSolution.replaceAt(i, '0');
} else {
for (var j = 0; j < solution.length; j++) {
if (i === j) {
continue;
}
if (usedSolution[j] !== '0' && guess[i] === solution[j]) {
usedSolution = usedSolution.replaceAt(j, '0');
wrongSpot++;
break;
}
}
}
}
return [correctSpot, wrongSpot];
};
var wordResults = {};
for (var solution of solutions) {
for (var guess of solutions) {
var results = processGuess(guess, solution);
if (wordResults[guess]) {
wordResults[guess][0] += results[0];
wordResults[guess][1] += results[1];
} else {
wordResults[guess] = results;
}
}
for (var guess of herrings) {
var results = processGuess(guess, solution);
if (wordResults[guess]) {
wordResults[guess][0] += results[0];
wordResults[guess][1] += results[1];
} else {
wordResults[guess] = results;
}
}
}
Determining how many wrong spots exist is a little tricky - let’s say that the guess is “array” and the solution is “ropes”. When determining if each letter in the guess exists but is in the wrong spot, the correct spot has to be marked as used. So when processing the second “r” in the guess, it should not count as a wrong spot match, because the “r” in ropes was used as a wrong spot for the first “r” in array. The way that I’m maintaining this is by marking the letter position with the character “0” in the solution to signify it has been used.
Running this code will calculate the right spot and wrong spot results for all the possible 5-letter words and store them in wordResults
, but it doesn’t sort it.
Now we have a bit of ambiguity of what we consider to be the “best” result, considering there are two sets of results for each word.
If we think the most right spots is the best, we can sort by those first and if they’re the same, sort by wrong spots:
const wordResultsFrequency = Object.fromEntries(
Object.entries(wordResults).sort(([, a], [, b]) => (b[0] - a[0] === 0) ? b[1] - a[1] : b[0] - a[0] )
);
console.log(wordResultsFrequency);
This results in:
{
saree: [ 1575, 2366 ],
sooey: [ 1571, 1459 ],
soree: [ 1550, 2155 ],
saine: [ 1542, 2238 ],
soare: [ 1528, 2565 ],
...
But maybe maximizing the most right spots isn’t the most important thing. Let’s say that they’re twice as important as wrong spots:
const wordResultsFrequency = Object.fromEntries(
Object.entries(wordResults).sort(([, a], [, b]) => (b[0]*2 + b[1]) - (a[0]*2 + a[1]))
);
console.log(wordResultsFrequency);
This instead results in:
{
soare: [ 1528, 2565 ],
saree: [ 1575, 2366 ],
seare: [ 1491, 2450 ],
stare: [ 1326, 2761 ],
roate: [ 1254, 2888 ],
...
So under this arbitrary definition of best, the best starting word in WORDLE is SOARE, which is an obsolete word that means “a young hawk”.