How to Find a Short Username

There are 11,881,376 five-letter combinations and I wanted one of them as my username.

Why five? Well, all of the two-, three- and four-letter ones are gone. Even with five, there are probably no combinations left that are also words in a dictionary. So finding a user name that is at least (semi-) pronounceable is as good as it can get.

The goal of the post is to document the process of me overcomplicating a simple matter and to showcase the tools that made it possible, so you can overcomplicate things too.

Certainly the most important tool is the username checker at KnowEm, which allows to check the availability of a username across multiple sites. However, since availability on Twitter usually implies availability on most other services, filling out Twitter’s signup form is slightly faster.

Randomness

The natural way to start is at random.org. It offers a random string generator, that can generate up to 10,000 random five-letter strings at a time.

This is obviously the most naive approach and most of the results won’t be pronounceable, but at least is is relatively easy to find a combination that is available (as of early 2015).

Basically I could have stopped there but having lots of tools available, doing something more advanced is was just too tempting.

UNIX words

If you are using a UNIX-based operating system there is a file containing a lot of English words on your computer. It’s located at either /usr/dict/words or /usr/share/dict/words.

As discussed earlier using dictionary words isn’t possible, but by applying a little transformation can turn them into something more likely to be available. The transformation is to remove all the vowels from the word which leaves us with something that is still kind-of pronounceable while also having a higher chance of not being taken.

At first I’ve implemented this in JavaScript, but I realized that transforming data like this is a nice fit for Clojure. Since I’m still in the process of learning Clojure I thought this would be some good practice too. In fact this was my first “real world” application of Clojure.

(ns usrnames)

(require '[clojure.string :as str])

(def file (slurp "/usr/share/dict/words" :encoding "ASCII"))
(def words (str/split-lines file))

(defn remove-vowels [word]
  (->>
    word
    str/lower-case
    (filter #(not= %1 \a))
    (filter #(not= %1 \e))
    (filter #(not= %1 \i))
    (filter #(not= %1 \o))
    (filter #(not= %1 \u))
    str/join))

(->>
  words
  (map remove-vowels)
  (map vector words) ; zip with original
  (filter (fn [[_ usrname]] (= 5 (count usrname))))
  (group-by (fn [[_ usrname]] usrname))
  (map (fn [[usrname group]]
         (str usrname " (" (str/join ", " (map first group)) ")")))
  (str/join "\n")
  (spit "usrnames.txt"))

Here are 10 random results from this approach (you can look at all 35,102 resulting names here):

trtht (tritheite)
jwstn (Jewstone)
hwkng (hawking)
plrdr (Pleurodira, pleurodire)
drsnz (deresinize)
nlwry (inlawry)
nrckn (unreckon)
bdlmc (Bedlamic)
mrmps (Mormoops)
clvry (Calvary, clovery)

The problem with the last approach is that words without vowels sound pretty leetspeak-ish and there’s also no good reason why a username should have no vowels in it.

Markov chains

Another way to generate pronounceable words is the use of Markov chains. A Markov chain is basically a probabilistic model for generating sequences. You can “train” the model with a sample and then have it generate sequences that are “similar” to that sample.

The way it works is actually really simple: We start with some letter x (not necessarily the letter x, x is a variable) and spin a roulette wheel that has 26 slices with the letters a through z on them. The size of each slice is proportional to the number of times the letter on that slice appeared after x in the sample data. Whichever letter results from the spin becomes the new x and the process starts anew until we’ve picked five letters.

Here is some Clojure that does the magic. It is a modified version of the code outlined in this blog post.

(ns markov-usrnames)

(require '[clojure.string :as str])

(def file (slurp "/usr/share/dict/words" :encoding "ASCII"))
(def words (str/split-lines file))

(defn generate-markov-nodes
  [words]
  (->>
    words
    (map str/lower-case)
    (str/join \space)
    (partition 2 1)
    (reduce
      (fn [acc [l next-l]] (update-in acc [l next-l] (fnil inc 0)))
      {})))

(defn wrand
  "given a vector of slice sizes, returns the index of a slice given a random
  spin of a roulette wheel with compartments proportional to slices."
  [slices]
  (let [total (reduce + slices)
        r (rand total)]
    (loop [i 0 sum 0]
      (if (< r (+ (slices i) sum))
        i
        (recur (inc i) (+ (slices i) sum))))))

(defn generate-usrname [nodes]
  (loop [node (nodes \space)
         acc []]
    (let [probabilities (vec (vals node))
          index (wrand probabilities)
          letter (nth (keys node) index)
          next-node (nodes letter)]
      (if (= 5 (count acc))
        (str/join acc)
        (if (= letter \space)
          (recur node acc)
          (recur next-node (conj acc letter)))))))

(def nodes (generate-markov-nodes words))

(->>
  (repeatedly (partial generate-usrname nodes))
  distinct
  (take 10000)
  (str/join "\n")
  (spit "markov-usrnames.txt"))

These are the first ten usernames generated by the above implementation:

jonin
lluly
sinda
gafro
apera
erenk
desio
hyman
woqur
flsmf

Looks pretty good. Unfortunately they are so good, that all of them except flsmf were already taken. Apparently too many people building Markov chains based on English dictionaries when signing up for Twitter… Obviously I needed to use a different dictionary!

A Japanese text written in Latin alphabet, as suggested here, yields more Japanese sounding results:

mekot
nikin
oraus
yayah
sarik
akuta
nyena
nomen
nzuis
orita

But again nine out of ten were already taken (only nzuis was available).

Further down the rabbit hole

At this point I decided that maybe being pronounceable was more important than being short. So I increased my target length to six letters, which gives us a staggering 297,034,400 additional possibilities.

The next thing I did was to modify the Markov chain implementation to consider the last two letters instead of just one previous letter when spinning the roulette wheel. This should increase the quality of the results a little bit (for example it makes triple letters much more unlikely).

Here are the modified functions:

(defn generate-markov-nodes
  [words]
  (->>
    words
    (map str/lower-case)
    (str/join "  ")
    (partition 3 1)
    (map #(list (take 2 %1) (nth %1 2)))
    (reduce
      (fn [acc [l next-l]] (update-in acc [l next-l] (fnil inc 0)))
      {})))

(defn generate-usrname [nodes]
  (loop [node (nodes (list \space \space))
         acc []]
    (let [probabilities (vec (vals node))
          index (wrand probabilities)
          letter (nth (keys node) index)
          next-node-key (list (or (last acc) \space) letter)
          next-node (nodes next-node-key)]
      (if (= 6 (count acc))
        (str/join acc)
        (if (= 1 (count node))
          (str/join acc)
          (if (= letter \space)
            (recur node acc)
            (recur next-node (conj acc letter))))))))

Long story short, here are the first 10 results (based on Japanese text):

kimaha
nariya
nisaki
inakak
osogir
nokono
bosiki
notoro
nosiko
utokok

While this is certainly a cool fake word generator, it didn’t help me with my username, since I wasn’t able to find one that I liked and that was also available from the ones that I generated.

Wrap up

As you’ve probably guessed from the URL non of the generated usernames ultimately made it. At some point I just concluded that my real name is as good as any and without the vowels it’s even just six letters long.

Anyway, what’s important is that I spent a considerable amount of time doing something utterly useless while playing with some cool tech and math. I might even have learned something.



© 2017. All rights reserved.

Powered by Hydejack v6.6.1