Standard artificial neural networks are prediction machines, that can learn how to map some input to some output, given enough examples of each. Recently, as people have figured out how to train deep (multi-layered) neural nets, very powerful models have been created, increasing the hype surrounding this so-called deep learning. In some sense the deepest of these models are Recurrent Neural Networks (RNNs), a class of neural nets that feed their state at the previous timestep into the current timestep. These recurrent connections make these models well suited for operating on sequences, like text.
￼We can show an RNN a bunch of sentences, and get it to predict the next word, given the previous words. So, given a string of words like “Which Disney Character Are __”, we want the network to produce a reasonable guess like “You”, rather than, say, “Spreadsheet”. If this model can learn to predict the next word with some accuracy, we get a language model that tells us something about the texts we trained it on. If we ask this model to guess the next word, and then add that word to the sequence and ask it for the next word after that, and so on, we can generate text of arbitrary length. During training, we tweak the weights of this network so as to minimize the prediction error, maximizing its ability to guess the right next word. Thus RNNs operate on the opposite principle of clickbait: What happens next may not surprise you.
I based this on Andrej Karpathy’s wonderful char-rnn library for Lua/Torch, but modified it to be more of a “word-rnn”, so it predicts word-by-word, rather than character-by-character. (Code will be put up on github soon. Here is the code.) Predicting word-by-word will use more memory, but means the model does not need to learn how to spell before it learns how to perform modern journalism. (It still needs to learn some notion of grammar.) Some more changes were useful for this particular use case. First, each input input word was represented as a dense vector of numbers. The hope is that having a continuous rather than discrete representation for words will allow the network to make better mistakes, as long as similar words get similar vectors. Second, the Adam optimizer was used for training. Third, the word vectors went through a particular training rigmarole: They received two stages of pretraining, and were then frozen in the final architecture – more details on this later in the article.
One Neat Trick Every 90s Connectionist Will Know
Whereas traditional neural nets are built around stacks of simple units that do a weighted sum followed by some simple non-linear function (like a tanh), we’ll use a more complicated unit called Long Short-Term Memory (LSTM). This is something two Germans came up with in the late 90s that makes it easier for RNNs to learn long-term dependencies through time. The LSTM units give the network memory cells with read, write and reset operations. These operations are differentiable, so that during training, the network can learn when it should remember data and when it should throw it away.
To generate clickbait, we’ll train such an RNN on ~2 000 000 headlines, scraped from Buzzfeed, Gawker, Jezebel, Huffington Post and Upworthy.
How realistic can we expect the output of this model to be? Even if it can learn to generate text with correct syntax and grammar, it surely can’t produce headlines that contain any new knowledge of the real world? It can’t do reporting? This may be true, but it’s not clear that clickbait needs to have any relation to the real world in order to be successful. When this work was begun, the top story on BuzzFeed was “50 Disney Channel Original Movies, Ranked By Feminism“. More recently they published “22 Faces Everyone Who Has Pooped Will Immediately Recognized“. It’s not clear that these headlines are much more than a semi-random concatenation of topics their userbase likes, and as seen in the latter case, 100% correct grammar is not a requirement.
The training converges after a few days of number crunching on a GTX980 GPU. Let’s take a look at the results.
With some experimentation, I ended with the following architecture and training procedure. The initial RNN had 2 recurrent layers, each containing 1200 LSTM units. Each word was represented as a 200 dimensional word vector, connected to the rest of the network via a tanh. These word vectors were initialized to the pretrained GloVe vectors released by its inventors, trained on 6 billion tokens from Wikipedia. GloVe, like word2vec, is a way of obtaining representations of words as vectors. These vectors were trained for a related task on a very big dataset, so they should provide a good initial representation for our words. During training, we can follow the gradient down into these word vectors and fine-tune the vector representations specifically for the task of generating clickbait, thus further improving the generalization accuracy of the complete model.
It turns out that if we then take the word vectors learned from this model of 2 recurrent layers, and stick them in an architecture with 3 recurrent layers, and then freeze them, we get even better performance. Trying to backpropagate into the word vectors through the 3 recurrent layers turned out to actually hurt performance.
To summarize the word vector story: Initially, some good guys at Standford invented GloVe, ran it over 6 billion tokens, and got a bunch of vectors. We then took these vectors, stuck them under 2 recurrent LSTM layers, and optimized them for generating clickbait. Finally we froze the vectors, and put them in a 3 LSTM layer architecture.
The network was trained with the Adam optimizer. I found this to be a Big Deal: It cut the training time almost in half, and found better optima, compared to using rmsprop with exponential decay. It’s possible that similar results could be obtained with rmsprop had I found a better learning and decay rate, but I’m very happy not having to do that tuning.
Building The Website
While many headlines produced from this model are good, some of them are rambling non-sense. To filter out the non-sense, we can do what Reddit does and crowd source the problem.
To this end, I created Click-o-Tron, possibly the first website in the world where all articles are written in their entirety by a Recurrent Neural Network. New articles are published every 20 minutes.
Any user can vote articles up and down. Each article gets an associated score determined by the number of votes and views the article has gotten. This score is then taken into account when ordering the front page. To get a trade-off between clickbaitiness and freshness, we can use the Hacker News algorithm:
In practice, this can look like the following in PostgreSQL:
CREATE FUNCTION hotness(articles) RETURNS double precision LANGUAGE sql STABLE AS $_$ SELECT $1.score / POW(1+EXTRACT(EPOCH FROM (NOW()-$1.publish_date))/(3*3600), 1.5) $_$;
The articles are a result of three seperate language models: One for the headlines, one for the article bodies, and one for the author name.
The article body neural network was seeded with the words from the headline, so that the body text has a chance to be thematically consistent with the headline. The headlines were not used during training.
For the author names, a character level LSTM-RNN was trained on a corpus of all first and last names in the US. It was then asked to produce a list of names. This list was then filtered so that the only remaining names were the ones where neither the first nor the last name was in the original corpus. This creates a nice list of plausible, yet original names, such as Flodrice Golpo and Richaldo Aariza.
Finally, each article’s picture is found by searching the Wikimedia API with the headline text, and selecting the images with a permissive license.
In total, this gives us an infinite source of useless journalism, available at no cost. If I remember correctly from economics class, this should drive the market value of useless journalism down to zero, forcing other producers of useless journalism to produce something else.