In the last week, I have started playing the online word game Wordle by Josh Wardle. I was lured in after getting curious about some strange Twitter status updates that showed rows of green, grey and yellow blocks. It turns out it’s a fun game, too.
The basic idea is to try to guess a five-letter word, and you get six guesses. Each day there is a new word, and everyone gets to guess the same one. After each guess (which must be an actual word), you get some information on how close the guess was because the letters in a guess are shown as green (correct letter in correct position), yellow (correct letter in incorrect position) or grey (incorrect letter). After you’ve finished guessing the word, you can share a status update that shows how well you went, in a way that doesn’t give away any information about the word. That’s what I was seeing on Twitter.
I’ve done it four times now, and a natural question is what word should be the first guess. At that point in time, there is no information about the daily word, so it makes sense to me that the first guess should be the same each day. However, what is the best word to use for that first guess?
The conclusion I’ve reached is that the best word should have five different letters, together which are the top five most likely letters to match in a word, i.e. maximise the chance of getting yellows. Additionally, those letters should ideally be in a position that is most likely to match the correct position, i.e. maximise the chance of getting greens.
To figure this out properly, I would need to know the word list being used by Wordle, which unfortunately I don’t. In fact, there may be two word lists: the word list used to allow guesses, and the word list used to pick the daily word. So, I’ll make a big assumption and use the Collins Scrabble Words from July 2019.
My tool of choice is going to be zsh on my MacBook Air. It doesn’t require anything sophisticated. Also, I’ve removed any extra headers from my word list, and run it through dos2unix to ensure proper end-of-line treatment.
First job is to extract just the 5 letter words:
% grep '^.....$' words.txt > words5.txt
%
Now we need to figure out how many words each letter of alphabet appears in:
% for letter in {A..Z}
for> do
for> echo $letter:`grep -c -i $letter words5.txt`
for> done | sort -t : -k 2 -n -r | head -n 10
S:5936
E:5705
A:5330
O:3911
R:3909
I:3589
L:3114
T:3033
N:2787
U:2436
%
That wasn’t very efficient, but it doesn’t need to be. We have our answer – the most popular letters are S, E, A, O and R. Putting these letters into a free, online anagram tool, it turns out that there are three words made up from these letters: AEROS, AROSE and SOARE.
Okay, so while only one of these is a word that you’d actually use, it turns out that Wordle accepts them all. It looks like Wordle might use the Scrabble word list for its guesses.
In any case, this looks like a pretty good set of letters, as the words in the word list are highly likely to have one of these letters:
% grep -c . words5.txt
12972
% grep -c -i -e A -e R -e O -e S -e E words5.txt
12395
%
Of the 12,972 words in the word list, 12,395 (96%) will have at least one letter match!
The next job is to figure out which of these three words is most likely to have letters in the same position as other words in the word list.
% grep -c -e A.... -e .E... -e ..R.. -e ...O. -e ....S words5.txt
6578
% grep -c -e A.... -e .R... -e ..O.. -e ...S. -e ....E words5.txt
3742
% grep -c -e S.... -e .O... -e ..A.. -e ...R. -e ....E words5.txt
5726
%
We have a winner! A letter in AEROS is in the right position for 6,578 words (51%).
So, it looks like using AEROS as your first guess in Wordle is a pretty good choice. Just, don’t tell anyone that’s what you’re doing, or if you share the standard Wordle status update, it will actually contain spoilers.
Hi Andrew,
As you’ve seen from me under different cover, we have a precise list of the 2,315 words used in the puzzle as solutions, allowing a precise ordering of the letter frequency distribution, which runs as follows:
EAROT LISNC UYDHP MGBFK WVZ XQJ
In terms of matches to the word list, owing to letter duplications (a five-letter word can have two or three letters the same) this re-orders slightly:
EAROT LISNU CYHDP GMBFK WVX ZQJ
There are several good combinations of letters, and surprisingly the one using the first five letters EAROT (oater, orate, roate) isn’t actually the best, although it makes 2,120 partial matches to the 2,315 words of the wordlist. The letter mix of EAROS (aeros, arose, soare) matches 2,132 words, and the mix of EARIS (aesir, arise, raise, serai) matches 2,147 words.
As far as exact letter matches go, the word soare has exact matches with 1,166 words, or over half of the wordlist. This is because S is the most frequent first letter found in English words by a considerable margin; and E is very frequently a final letter.
First letter frequencies:
SCBTP AFGDM RLWEH VONIU QKJ YZ (no X words!)
Last letter frequencies:
EYTRL HNDKA OPMGS CFWIB XZU (no words ending in J, Q, or V)
I haven’t looked at common digraphs except for adjacent double letters, which in order of most frequent to least are:
LEOST FRNDP MBGZC IV (and no double letters for any others)
At any rate, until someone does a much more exhaustive comparison of possible guessing words (a supplementary list of 10,657 words can be legitimately entered as guesses), soare would have strong claims as the most useful first word for Wordle.
Given that you shared S is so infrequently a last letter in the Wordle answer word list, I’ve moved from AEROS as my first guess to SOARE as my standard first guess. However, I’m now thinking that I should have a standard first and second guess. It’s not in the spirit of “hard mode”, but there’s a trade-off between the strategies of finishing the puzzle as quickly as possible (minimise average number of guesses) and finishing the puzzle as reliably as possible (maximise chance of guessing answer in six attempts). Having now failed to guess the answer one day, I’d rather do the latter strategy than the former.
Well, here we are a month later; the New York Times has bought Wordle for their Games website, and already they’ve tinkered with the wordlists slightly, making at least one common word unplayable in its traditional UK English spelling – without disclosing which one it is, it’s a word ending in ‘-bre’ which must instead be played as ‘-ber’. Twenty-five of the nearly thirteen thousand words were removed (for the most part these were no great loss).
What new things have we learned? I’ve refined some of those greps a little recently, having read the flaming manual. egrep is nice in that provided you’re not using shell variables, you can do multiple searches formatted so (here we output the number of words with exact matches, followed by the number of words with at least a partial match):
egrep -ci ‘S….|.O…|..A..|…R.|….E’ words5.txt; egrep -ci ‘S|O|A|R|E’ words5.txt
Another nice thing to do with follow-up guesses is to see how many words have not been matched at all by the words you’ve used, which can be implemented neatly with the -v flag which inverts the lines selected by the matched pattern:
egrep -cvi ‘S|O|A|R|E|C|L|I|N|T’ words5.txt
CLINT as a follow-up word for SOARE seems to be the strongest contender, though some other two-word combinations are quite nice: CRUET/SHALY; SLATE/COURD; STARE/COLIN. Why are the names ‘Clint’ and ‘Colin’ available as search words? Best not to ask!
There’s now been enough time for mathematical and computational research into the game to have been conducted.
With a brute-forcing method and three states per letter of an unmatched letter, an incorrectly located letter, and a matched letter, each guess partitions the word space into 3⁵–5 = 238 sub-groups (excluding 5 impossible states consisting of four exact matches plus a fifth misplaced letter), making the number of ways Wordle can be played very large: equal to 12,972ⁿ⨉(3⁵–5)ⁿ⁻¹ where n is the number of guesses n ≤ 6; at n = 4, 5, or 6 it might be thought finding optimal solutions is far too computationally expensive.
In fact, a lot of short-cutting has been possible to make the analysis tractable. Notably, in normal mode the game can be brute-forced for all possible 12,900+ words in six moves, but not five. In terms of the smaller subset of words that form the actual game solutions in Wordle, the fastest decision tree can reach all words in an average of 3.4212 guesses, but with a worst case of taking 5 moves; the best decision tree that can reach every word in 5 guesses in hard mode does so with an average of 3.5085 guesses. There is no decision tree that can go faster than 3.4212, or with 4 guesses as the worst case. A nice summary of some of this is found at: https://www.poirrier.ca/notes/wordle-optimal/
It’s pretty cool that people have found some optimal decision trees for Wordle, but I don’t see how that would be helpful in practice for most people doing their daily Wordle. It would be interesting to see work on some simple strategies people could use, that require remembering less than (say) 10 words.
I also find it interesting that if a larger dictionary is used (i.e. remove the assumption that the dictionary of Wordle answers is known), there is no optimal solution to solve it in 6 guesses in hard mode. There is probably another approach, where a dictionary of Wordle answers is generated from a wider set but according to constraints that match the existing Wordle answer set, e.g. commonly used words, no plurals, no rude words, etc. This would test whether strategies are robust to changes that NYT might make in future.
It also suggests to me that hard mode is too hard, if it is provably unsolvable (with a wider dictionary). Still, I find hard mode offers a good challenge when completing the game. Having a variant of hard mode that adds challenge but doesn’t make the game unsolvable would be good. Personally, I sometimes use a variant where I don’t follow hard mode on guess #4 only.