Thoughts on Learning the Course of Probabilistic Graphical Models

三日月綾香

简体中文版

P.S.: This article can be viewed as a temporary summary of my learning of machine learning.

When I was taking this course, I spent the whole semester struggling with it. I kept asking myself what was the point of spending countless times more than other courses to take this course that I might never use later. Since my goal is natural language processing, I know that the current approach is deep learning, solving problems with ‘pre-trained models + fine-tuning on downstream tasks’, and that the content of this course is outdated in the field of natural language processing.

However, it was this course that prompted me to think about how the field of machine learning has evolved to where it is today. Instead of just thinking about how to tune neural networks, I started to think about what the past approaches were like, why they were superseded by new ones, and where we are going from our current position.

I first began my journey into natural language processing because I thought languages were quite regular, which corresponds to the traditional approach of writing rules. As the models became more accurate, it became more apparent that there were always some spots that could not be covered by the rules. This is where the advantages of deep learning came into play. The shift from writing rules to deep learning is a shift in the way of thinking: we used to tell machines directly what the rules of languages are, but then we realised that this is impossible to describe in short. So we switched to studying ‘what structures are necessary for machines to learn languages’, and then let the machines learn it themselves.

The probabilistic graphical models taught in this course, on the other hand, belong to the traditional machine learning approach, which lies between rule-writing and deep learning in terms of the history of the entire field. Generally, approaches to natural language processing can be divided into two categories: rule-based and statistics-based. Writing rules is obviously a rule-based approach; while traditional machine learning is developed from statistics, and deep learning is developed from traditional machine learning, both of which are statistical-based approaches. The shift from rules to statistics is a compromise on the phenomenon that ‘human language is not always regular’, hence ‘every time I fire a linguist, the performance of our speech recognition system goes up’. So what is the motivation behind the shift from traditional machine learning to deep learning?

Upon consideration, I think the reason for this is that traditional machine learning is still about constructing models based on assumptions. The only thing that changed is to replace ‘grammatical rules’ with ‘mathematical rules’. In a rule-based approach, grammar rules are the assumptions. That is, the input is assumed to be grammatical and then the program processes it according to the rules. In a traditional machine learning approach, it is also assumed that the input has certain statistical properties and then the model is built based on these properties before producing a result. For example, in the POS-tagging task, it is assumed that the current POS tag is only related to the previous POS tag, and the current word is only related to the current POS tag, so that the Markov assumption and the output independence assumption can be met, the Hidden Markov model can be written, and then the Viterbi algorithm can be applied to produce the results. However, there are always exceptions to the assumption, where problems lie. I remember there is a question on the Internet about what is the difficulty of machine learning. The questioner said ‘I feel that the content of these algorithms is, to put it bluntly, not difficult at all, or even too easy’. The first impression upon reading this was that the person was very good at maths, but then I thought it was more than just maths. Perhaps the questioner finds the formulae simple, like log is simple. But when I try to apply the methods to natural language processing, I start to think ‘which grammar in human language corresponds to a log? If there is not, it is complex’.

The difference between deep learning and traditional machine learning is that it has shifted from bottom-up ‘hypothesis + model’ to top-down ‘intuition + structure’. Intuition tells us that ‘the same thing will appear in different places on the input’, hence the convolutional neural network; intuition tells us that ‘humans read sentences in order, word by word’, hence the recurrent neural network; intuition tells us that ‘the correlation between each word in a sentence can be disordered’, hence attention. In other words, we can use our intuition to build models from the data itself, using different network structures depending on the characteristics of the data. Moreover, the various training methods are unified into one: backward propagation.

But there is one downside to ‘discarding the rules’. It turns deep learning into an unexplainable ‘black box’. In traditional machine learning, you can look at a formula and explain the output of a model. For example, you can point to the Kullback-Leibler divergence and say ‘this is why the EM algorithm must converge’. In deep learning, there is no such formula. But on the other hand, if you want to explain ‘why deep learning is useful’, or ‘why deep learning as a method can replace so many traditional machine learning methods’, it is the universal approximation theorem: neural networks can fit arbitrary functions. So, the ‘interpretability of machine learning’ can be understood from both a micro and macro perspective: micro is ‘pointing to the formula and saying why the model outputs the way it does’, while macro is ‘how much the learning capacity of the model can increase when the size of the neural network increases’.

From this, we can see that the essence of moving from writing rules to traditional machine learning to deep learning is to ‘let go’. Writing rules is about humans teaching machines how to write a sentence by hand. Traditional machine learning is to take a step back, given statistical assumptions, construct a statistical model, and let the machine statistically figure out how to write a sentence based on the model. Deep learning is to take another step back and build a suitable neural network based on the features of natural language itself, and the machine will eventually have the ability to learn how to write a sentence. In the future, it should be about constructing new neural networks and letting the machines to learn by themselves.

In addition, statistic is divided into frequentist and Bayesian depending on whether ‘the model parameters are constants or also random variables’. In machine learning, frequentist developed into statistical machine learning, while Bayesian developed into probabilistic graphical models. While current deep learning models are mainly developed from frequentist, some deep learning models that are developed from Bayesian. It is significant to introduce Bayesian methods into deep learning. For example, from the perspective of Bayesian, sentence generation can be viewed as sampling from a target distribution, which can lead to better results in certain tasks.

(Written on 22 November 2021, published on 27 November 2021, translated on 29 December 2021)