Which neural network architectures perform best at sentiment analysis?
This essay was my main project for my module on Machine Learning for Natural Language Processing at Cambridge. It assumes some familiarity with NLP and deep learning.
Over the last few years, deep neural networks have produced state-of-the-art results in many tasks associated with natural language processing. This comes in conjunction with their excellent results in other areas of machine learning, perhaps most notably computer vision. Different types of neural networks have been particularly successful in different areas. For example, CNNs are the tool of choice in image recognition problems; their internal structure has distinct parallels with the human visual system.
Two other neural network architectures have achieved particular success in NLP; these are recurrent neural nets (RNNs) and recursive neural nets (here abbreviated RSNNs). It seems at first glance that the structures of RNNs and RSNNs are better suited to processing language than CNNs; however, empirical results have been mixed. In this essay I analyse these empirical results to see what conclusions can be drawn. I chose to focus on sentiment analysis for a number of reasons. Firstly, it is a task with direct applicability, rather than an intermediate stage in language processing. Secondly, as a discrete categorisation task, results are not too subjective; and there are standard corpora against which performance has been measured. Thirdly (and admittedly a little vaguely), it is neither "too easy" nor "too hard": the sentences used generally have a clear intended sentiment, but to extract that it's necessary to deal with negations, qualifications, convoluted sentence structure, etc.
This essay is organised into three sections. Firstly, I explain the key differences between the three architectures mentioned above, and how those differences would theoretically be expected to influence performance. Secondly, I cite and explain various results which have been achieved in sentiment analysis using these neural networks. Lastly, I discuss how we should interpret these results.
Overview of Neural Network Architectures
Modern neural networks designed for NLP tasks generally use compositional representations. Individual words are represented as dense vectors in an embedded space, rather than using one-hot encodings or n-grams. These word vectors are then combined using some composition function to create internal node representations which lie in the same vector space (Young et al, 2017).
Word vectors can be learned via unsupervised training prior to the main training phase, and are usually distributional, i.e. based on the contexts in which words are found in the relevant corpus (Turney and Pantel, 2010). Performance has been improved by also taking into account morphological features (Luong et al, 2013). This preliminary learning is often followed by adjustments to improve performance on specific tasks. For example, words often appear in the same contexts as their antonyms, but in sentiment analysis it is particularly important to ensure that such opposing pairs are represented by different vectors (Socher et al, 2011a). It is possible to fix learned word vectors at any point, but in recent papers it is more common for words vectors to be adjusted along with the rest of the neural net as the main training occurs. It is also possible to initialise word vectors randomly so that they are learned throughout main training, but this generally harms performance.
Recurrent Neural Networks
RNNs are able to process ordered sequences of arbitrary length, such as words in a sentence. At each stage, the neural network takes as input the given word, and a hidden representation of all words so far. An output of this then becomes an input into the next stage. Theoretically, it should be possible for standard RNNs to process arbitrarily long sequences; however, in practice, they suffer from the 'vanishing gradient problem' which causes the beginnings of long input sequences to be forgotten. To combat this, almost all RNN implementations use LSTM (long short-term memory) units, which help propagate gradients further.
Another difficulty with standard RNNs is the fact that items later in the sequence cannot affect the classification of items earlier in the sequence. This is particularly problematic in the context of language, where the interpretation of the first few words of a sentence often throws up ambiguities which are resolved by later words. Examples include garden path sentences such as "Man who enjoys garden path sentences friends to listen to his puns", or more simply any sentence beginning with "This can". One solution is to use bi-directional RNNs, which pass both forwards and backwards through sentences. However, in sentiment analysis tasks where only a single classification of the entire sentence is required, this is less necessary.
Convolutional Neural Networks
CNNs only accept fixed-length inputs, which means some modifications are required to allow them to accept sentences. Historically, it was standard way to convert from the latter to the former using continuous bag-of-words (CBOW) or bag-of-ngrams models (Pang and Lee, 2008). For instance, the CNN input could be the sum of vectors representing each word in the sentence. However, the resulting loss of word order became too great a price to pay, and various alternatives have emerged (which I will explore in the next section).
CNNs are feed-forward, which means there are no links back from later layers to earlier layers. Internally, CNNs generally contain convolutional layers, pooling layers, and fully-connected layers. (Goldberg, 2015) summarises their advantages as follows: "Networks with convolutional and pooling layers are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input. For example, in a document classification task, a single key phrase (or an ngram) can help in determining the topic of the document."
In sentiment analysis, there are some cases where these strong local clues exist. For example, words such as "exhilarating" or "abhorrent" would almost always indicate positive and negative sentiment respectively, regardless of where in a sentence they are found. However, most words are able to indicate either positive or negative sentiment depending on their context, and in general we would expect CNNs layers to lose valuable contextual information. This effect may be lessened in longer documents, in which words with the same sentiment as the overall document usually end up predominating, so that order effects aren't as important (this also makes longer sentences more amenable to bag of words approaches).
Recursive Neural Networks
RSNNs only accept tree-structured inputs; a major reason why they seem promising in NLP tasks is because this structure matches the inherently recursive nature of linguistic syntax. It also means that sentences need to be preprocessed into trees by some parsing algorithm before being input to a RSNN (in the special case where the algorithm always returns an unbranched tree, RSNNs are equivalent to RNNs). This requirement may be disadvantageous, for example on inputs such as tweets which are not easily parsed. However, knowing the structure of a sentence is very useful in many cases. A sentence which is of the form (Phrase1 but Phrase2) usually has the same overall sentiment as Phrase2 - for instance, "The actors were brilliant but even they couldn't save this film." Similarly, negations reverse the sentiment of the phrase which follows them. Both of these inferences rely on knowing the scope to which these words apply - i.e. knowing where in the parse tree they are found.
Many of the adaptations which were designed for RNNs, such as bi-directionality and memory units, can also be used in RSNNs.
Experimental Results in Sentiment Analysis
In this section I describe key details of various architectures which have achieved state of the art results in sentiment classification, as well as a few less successful architectures for comparison. A major resource in sentiment classification is the Stanford Sentiment Treebank, introduced in (Socher et al, 2013); I will use performance on this as the main evaluative criterion.
Recursive Neural Networks
I will first discuss three algorithms published by Richard Socher for using RSNNs to classify sentiments. All three are composition-based: they start with dense embeddings of words, which propogate upwards through the parse tree of a sentence to give each internal node a representation with the same dimensions as the word representations. The basic technique in (Socher et al, 2011a) is simply to calculate the representation of a parent node by concatenating two vectors representing its child nodes, multiplying that by a weight matrix, then applying a nonlinearity (note that the weight matrix and nonlinearity are the same for all nodes). The downside of this method is that the representations of the child nodes only interact via the nonlinearity, which may be quite a simple function.
A more complicated algorithm in (Socher et al, 2012) uses a Matrix-Vector Recursive Neural Network (MV-RNN). This technique represents each node using both a vector and a matrix; instead of concatenating the vectors as above, the representation of a parent node with two children is calculated by multiplying the matrix of each child with the vector of the other (then applying the weight matrix and nonlinearity as usual). However, this results in a very large number of parameters, since the MV-RNN needs to learn a matrix for every word in the vocabulary.
The third architecture, and the one which achieved the best results on the Stanford Sentiment Treebank, is a Recursive Neural Tensor Network (RNTN). As with the first architecture, nodes are simply represented by vectors, and the same function is applied to calculate every parent node - however, this function includes a more complicated tensor product as well as a weight matrix and nonlinearity. This change significantly improved performance; on a difficult subset of the corpus which featured negated negative sentiments, RNTN accuracy was 20 percentage points higher than MV-RNNs, and over three times better than a reference implementation of Naive Bayes with bigrams. Overall accuracy was 80.7% for fine-grained sentiment labels and 85.4% positive/negative sentence classification, a 5 percentage point increase on the state of the art at the time (Socher et al, 2013).
I'll briefly mention one more model. (Kokkinos and Potamianos, 2017) use a bi-directional RSNN with gated recurrent units (GRUs) and a structural attention mechanism. This sophisticated setup was state of the art in early 2017, with 89.5% on the Stanford corpus. A brief explanation of the terms is warranted. Bi-directionality means that when calculating node activations, the standard propagation of information from the leaves (representing words) upwards through the tree structure is followed by a propagation of information downwards from the root (representing the whole sentence). (Irsoy and Cardie, 2013) GRUs serve a similar role to LSTMs in helping store information for longer (Chung et al, 2014). The structural attention mechanism aggregates the most informative nodes to form the sentence representation, and is a generalisation of (Luong et al, 2015).
Convolutional Neural Networks
(Kalchbrenner et al, 2014) discuss several adapted CNN architectures; firstly, Time-Delayed Neural Networks (TDNNs); next, Max-TDNNs; and last, their own Dynamic Convolutional Neural Network (DCNN). Each of these uses a one-dimensional convolution, which is applied over the time dimension of a sequence of inputs. For example, let the input be a sentence of length s, with each word represented as a vector of length d. A convolution multiplies each k-gram in the sentence by a d x k dimension filter matrix m. This results in a d x s dimension matrix (the sentence is padded with 0s so that each weight in the filter can reach each word, known as wide convolution).
However, this results in a matrix whose size varies based on input length. To make further processing easier, in the Max-TDNN architecture the convolution is immediately followed by a max-pooling layer. Specifically, for each of the d rows (each corresponding to one dimension in the vector space that the words are embedded in), only the highest value is retained. In the DCNN architecture this is further refined with "dynamic k-max pooling", which retains the top k values (where k depends on the length of the input sentence). This architecture outperformed the RSNNs previously discussed by 1.4 percentage points on the Stanford Sentiment Treebank. It also performed well in the Twitter sentiment dataset described in (Go et al, 2009).
I will briefly discuss a second (also highly-cited) paper, (Kim, 2014). This uses an architecture similar to Max-TDNN to push the state of the art on the Stanford Sentiment Treebank up by around another percentage point from Kalchbrenner et al. The main improvement in his system is starting from pre-trained word vectors, specifically Google's word2vec, which had been trained on 100 billion words. By contrast, in (Kalchbrenner et al, 2014), word vectors were randomly initialised.
While these architectures seem to have successfully circumvented some of the limitations of CNNs, it's worth noting that they still have drawbacks. While evaluating k-grams means that local word order matters, this architecture still can't model direct relationships between words more than k spaces away from each other, nor the absolute position of words in a sentence.
Recurrent Neural Networks
(Wang et al, 2015) introduced the use of LSTMs for twitter sentiment prediction. They used word2vec software to train word embeddings from the Stanford Twitter Sentiment vocabulary, and achieved very similar results to Kalchbrenner et al on that corpus.
Radford et al use a multiplicative LSTM (mLSTM) which processes text as UTF-8 encoded bytes and was trained on a corpus of 82 million Amazon reviews. While the main focus was creating a generic representation, it achieves 91.8% on the Stanford Sentiment corpus, beating the previous state of the art, 90.2% (Looks et al, 2017). This seems to be the current record. Notably, they found a single unit within the mLSTM model which directly corresponded to sentiment; simply observing that unit achieved a test accuracy almost as high as the whole network.
While RNNs are more sensitive to word order than CNNs, they have a bias towards later input items (Mikolov et al, 2011). For example, the mLSTM in (Radford et al, 2017) had a notable performance drop when moving from sentence to document datasets, which they hypothesised was because it focused more on the last few sentences.
- Recursive auto-encoders as explored in (Hermann and Blunsom, 2013) and (Socher et al, 2011b). The former uses an unusual combination of neural networks and formal compositional derivations.
- Dynamic Memory Networks, as in (Kumar et al, 2015), which achieved state of the art performance in question-answering in general, with sentiment analysis as a particular example.
- Dynamic Computation Graphs of (Looks et al, 2017), which may have been the first architecture to break 90% on the Stanford Sentiment Treebank.
The overwhelming impression from the previous section is that of a field in flux, with new techniques and architectures emerging almost on a weekly basis. Looking only at the last year or two, there is a comparative lack of CNNs amongst the highest-performing models. However, it should be noted that later models may perform better without their underlying architectures being any more promising, not least because they may have increased processing power and financial resources behind them (OpenAI spent a month training their model).
It is therefore difficult to tell whether this shift away from CNNs is a long-term trend or merely a brief fad. However, Geoffrey Hinton at least is pessimistic about the future of CNNs; he has written that "The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster."
Perhaps the most relevant previous work is the head-on comparison done by (Yin et al, 2017) between three architectures: CNN, LSTM and GRU. Their conclusion was that relative performance in different tasks "depends on how often the comprehension of global/long-range semantics is required." Sentiment analysis turned out to require more global semantics, therefore placing the CNN at a disadvantage. Final performance was around 86% for GRU, 84% for LSTM and 82% for CNN. (This contrasted with Answer Selection and Question Relation Matching, where the CNN beat the other two).
Another complicating factor is the question of how much progress in sentiment analysis has been driven not by improvements in overall architectures, but rather by improved word embeddings. (Le and Mikolov, 2014) achieve results very close to (Socher et al, 2013) simply by using a logistic regression on their "Paragraph Vector" representation. Meanwhile, (Hill et al, 2016) find that an embedding based on dictionary definitions has the overall best performance out of a number of strategies that they tested. The "Skip-thought vectors" introduced by (Kiros et al, 2015) also have very good overall performance.
This proliferation of different algorithms for computing word representations make it more difficult to compare different architectures directly. However, this is counterbalanced by the fact that the existence of standard corpora of word vectors such as word2vec and Glove allows results to be replicated while holding word embeddings constant.
Yin et al sidestep this issue by using no pretraining of word embeddings in any of their experiments. However, it's not clear that this creates a fair comparison either: it would give an advantage to architectures like MV-RNN which can't use pretraining as effectively.
To draw slightly less tentative conclusions, it may be instructive to consider these models in the context of human language processing. While the use of one-dimensional convolutions and pooling layers was a successful workaround to the problem of CNNs requiring fixed-length inputs, it is nevertheless clear that this is very different to the way that humans understand sentences: we do not consider each k-gram separately. Instead, when we hear language, we interpret the words sequentially, in the fashion of a RNN. If the example of human brains is still a useful guide to neural network architectures, then we have a little more reason to favour RNNs over CNNs.
Of course it is debatable whether biology is a good guide. Yet so far it has served fairly well: neural nets in general, and more specifically CNNs, were designed with biological inspiration in mind. Further, the capsule networks recently introduced by (Sabour et al, 2017) were quite explicitly motivated by the failings of CNNs in comparison with vision systems in humans and other animals (Hinton, 2014); if they are able to replicate early successes, that is another vindication of biologically-inspired design.
In arbitrating between RNNs and RSNNs, however, there are further considerations. Since linguistic syntax is recursively structured, it seems quite plausible that our brains use similarly recursive algorithms at some point to process it. However, we must take into account the fact that current RSNNs require sentences to be parsed before processing them. State of the art accuracy for parsers is around 94%, which is only a few percent higher than that for sentiment analysis. This may make parsing quality a limiting factor in future attempts to use RSNNs for sentiment analysis. Further, it suggests that RSNNs are also less biologically plausible than RNNs. Humans are able to understand sentences as we hear them, inferring syntax as we go along; while it's possible to imagine a RSNN system re-evaluating a sentence after each word is added, this is a somewhat ugly solution and surely not how our brains manage it.
Moving back to concrete experimental results: while (Kokkinos and Potamianos, 2017) achieve excellent results with their RSNN variant, they have already been beaten by RNNs which don't incorporate any of their most advantageous features (bi-directional + GRUs + attention). If the current state-of-the-art RNNs added these features, it is quite plausible that they would achieve even better results. RNNs may still face the problem of forgetfulness on long inputs - however, since sentence length is effectively bounded in the double digits for almost all practical purposes, this is not a major concern. While it's overly ambitious to make any concrete predictions, based on the analysis in this essay I am leaning towards the conclusion that RNNs using either GRUs or LSTMs will continue to have an advantage in sentiment analysis over rival architectures for the foreseeable future.
- Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555.
- A Go, R Bhayani, and L Huang. 2009. Twitter sentiment classification using distant supervision. Processing, pages 1–6.
- Yoav Goldberg. 2015. A Primer on Neural Network Models for Natural Language Processing.
- Karl Moritz Hermann and Phil Blunsom. 2013. The Role of Syntax in Vector Space Models of Compositional Semantics.
- Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning Distributed Representations of Sentences from Unlabelled Data.
- G Hinton. 2014. What is Wrong with Convolutional Neural Nets? https://www.youtube.com/watch?v=rTawFwUvnLE
- G Hinton. Online statements. https://www.reddit.com/r/MachineLearning/comments/21mo01/ama_geoffrey_hinton/clyj4jv
- Ozan Irsoy and Claire Cardie. 2013. Bidirectional recursive neural networks for token-level labeling with structure. CoRR, abs/1312.0493.
- N. Kalchbrenner, E. Grefenstette, and P. Blunsom, 2014, “A convolutional neural network for modelling sentences,” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Available: http://goo.gl/EsQCuC
- Kim, Y. 2014. Convolutional Neural Networks for Sentence Classiﬁcation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, Doha, Qatar. Association for Computational Linguistics.
- Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors.
- F Kokkinos and A Potamianos. 2017. Structural Attention Neural Networks for improved sentiment analysis.
- Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2015. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing.
- Q Le and T Mikolov. 2014. Distributed Representations of Sentences and Documents.
- Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. 2017. Deep Learning with Dynamic Computation Graphs.
- M Luong, R Socher and C Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology.
- Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2011. Extensions of recurrent neural network language model. In ICASSP, pages 5528–5531. IEEE.
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. "Effective approaches to attention-based neural machine translation". CoRR, abs/1508.04025.
- B. Pang and L. Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135.
- S Sabour, N Frosst, and G Hinton. 2017. Dynamic Routing Between Capsules.
- Socher, R., Lin, C. C.-Y., Ng, A. Y., and Manning, C. D. (2011a). Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In Getoor, L., and Sceffer, T. (Eds.), Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 129–136. Omnipress.
- R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. 2011b. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In EMNLP.
- R. Socher, B. Huval, C.D. Manning, and A.Y. Ng. 2012. Semantic compositionality through recursive matrix vector spaces. In EMNLP.
- Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- A Radford, R Jozefowicz, and I Sutskever. 2017. Learning to Generate Reviews and Discovering Sentiment.
- P. D. Turney and P. Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artiﬁcial Intelligence Research, 37:141–188
- Wang, X., Liu, Y., SUN, C., Wang, B., and Wang, X. (2015). Predicting Polarities of Tweets by Composing Word Embeddings with Long Short-Term Memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1343–1353, Beijing, China. Association for Computational Linguistics.
- Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schutze. 2017. Comparative Study of CNN and RNN for Natural Language Processing.
- Tom Young, Devamanyu Hazarikab, Soujanya Poriac, Erik Cambriad. 2017. Recent Trends in Deep Learning Based Natural Language Processing.