A Primer on Supervised and Unsupervised Machine Learning Models

As an aspiring machine learning engineer, I always tell people that I like to make computers think like humans.

…Okay, I knew I couldn’t fool you. To tell you the truth, what we do isn’t anything fancy, and it’s certainly not magic. This isn’t to say that what ML can accomplish isn’t remarkable: these mathematical models optimize our internet searches, drive cars, re-tailor education experiences for all sorts of individuals, predict sicknesses, and classify all objects with uncanny accuracy, breaking milestones at prophetic rates.

Academics tout their “AI solutions” as homologs to human reasoning, but the merit of doing so doesn’t really transcend the fact that these algorithms work by learning and improving based on previous iterations. We operate on introspection, planning, higher-order thinking, and self-reflection. We take in sequential information well. The machines that we train hone probability, an ability to take in feedback on a continuous, cyclical basis, and pattern recognition. Much of the time, the “artificial mind” crunches numbers and “things” in a manner that remains a mystery to the general public.

So, what processes do ML algorithms undertake to achieve such eery results? They build internal models of the data (sensory information) and look for an underlying structure, sewing patterns from the top down. This is what is called an unsupervised learning task. Others take a bottom-up approach and generate their own assumptions about their outputs based on the correct answers they have learned before, similar to how we are taught in a classroom. This is the heart of supervised learning. Of course, these different workflows are saturated with moving parts beyond what most of us can comprehend, but the fundamental goals may not be as worthy of the reverence we hold them in.

I will be going over the mechanics of some elementary machine learning models and differentiating between the unsupervised and supervised ones, and discussing why each of them falls into their respective categories. Let’s start with some supervised algorithms.

Supervised Learning 🟢

Linear Regression

In simple terms, linear regression finds the best-fit line for describing the relationships between two or more variables. This is found by calculating the sum of the squared residuals from the observations to the predicted line. When we end up with a resultant line, we can express it as follows:

E(Y | X_1,…, X_p)=Y=β_0+β_1X_1+…+β_pX_p, where:

  • E is the expected value.
  • Y is the dependent variable, eventually designated as a function of the predictors.
  • X_i, i = 1, …, p are the observations.
  • β_i,i = 1, …, p are the degrees in change of the outcome for every unit of change in the predictor variable.
  • ε is the disturbance, which represents variation in Y that is not caused by the predictors

Note that a linear regression model has three main types of assumptions:

  1. Linearity assumption: There is a linear relationship between the dependent and independent variables.
  2. Error term assumptions: ε^(i) are normally distributed, independent of each other, have a mean of zero, and have constant variance, σ^2 (homoscedasticity).
  3. Estimator assumptions: Independent variables are independent of each other (no multicollinearity) and measured without error.

Linear regression is supervised 🟢. You are trying to predict a real number from the model you trained on your dataset of known dependent variables.

Image

Visual from Backlog

Linear Support Vector Machines (SVMs)

SVMs detect classes in training data by finding the optimal hyperplane between the classes. This hyperplane must be optimal in the sense that it maximizes the margin, or separation, between the classes. This is used in both classification and regression problems, but SVM typically exemplifies the idea of a large margin classifier.

The geometric intuition behind classifier optimality is shown below. There are clearly infinitely many separators for the two classes (blue circles and X’s), but there is one that is equidistant from the two “blobs”. Therefore, we margin between the blue circles and the X’s is maximized. As for how this hyperplane is found, we take a subset of the training samples such that its elements are close to the decision surface, regardless of the class. Intuitively, SVM draws two parallel lines through the subset members of each class. These two parallel lines, in red below, are called support vectors.

Image

Image created by author

Eventually, the margin is maximized, and the line is drawn with the help of the support vectors. This optimal hyperplane will improve the classifier accuracy on new data points.

In our case, the dataset is linearly separable and has no noise or outliers, making it a hard margin SVM. However, soft-margin SVMs are preferred in practice because they are less susceptible to overfitting than are hard-margin SVMs and are more versatile as they can include some misclassified outliers.

Clearly, SVM is supervised 🟢. It requires the dataset to be fully labelled, hence the blue circles and the X’s. The only answer you’re trying to procure from SVM is the function describing the separation between the two known classes.

Image

Image from Velocity Business Solutions

Naive Bayes

You probably understand the Naive Bayes operates with Bayes’ Theorem, which gives you the posterior probability of an event given prior knowledge. We mathematically define it as follows:

P(A|B)=P(B|A)P(A) / P(B), where i, B =events

It’s also expressed as the true positive rate of an experiment divided by the sum of the false positive rate of a population and the true positive rate of the experiment. What’s interesting about Bayes’ Theorem is that it separates a test for an event from the event itself. You can read more about how this works here.

Naive Bayes is a binary and a multi-class classification algorithm in which predictions are simply made from calculating the probabilities of each data record associated with a particular class. The class having the largest probability for a data record is rated its most suitable class. Naive Bayes is an uncomplicated and fast algorithm and is a benchmark for text categorization tasks. It works well even with very large volumes of data.

Why is Naive Bayes “naive”, however? It calculates a conditional probability from other individual probabilities, implying independence of features, a fact we’ll almost never encounter in real life.

We can infer from looking at the formula that Naive Bayes is supervised 🟢. We need labels in our records to compute the probabilities for values of a certain feature given the label.

Unsupervised Learning 🔴

Principal Component Analysis (PCA)

Principal Component Analysis attempts to reduce dimensions in a dataset while preserving as much variance as possible, storing data points in vectors named “principal components”. These principal components are in descending order of how much variance in the dataset they carry. Visually, the data points are projected on to the principal axis. Below, PC 1 is the axis that preserves the maximum variance.

Finding PC 2 requires a linear orthogonal transformation. The amount of axes that PCA obtains is equal to the number of dimensions of the dataset so that every principal component is orthogonal to one another.

Image

Image from StatistiXL

Step by step, PCA is conducted by computing the covariance matrix, an example of which is shown below. Next, we decompose the covariance matrix to find its eigenvalues, denoted by λ, and corresponding eigenvectors. Geometrically, the eigenvector points in the resultant direction after the applied transformation, and it is stretched by a factor denoted by the eigenvalue. Therefore, the eigenvector gives us information about the direction of the principal component, while the eigenvalue tells us about its magnitude.

Image

Image from Wikipedia

Mathematically, we can express these terms as follows. Consider the linear differential operator, ddx, that scales its eigenvector (or eigenfunction). For example, d/dx e^(λx)=λe^(λx). In terms of matrices, let A_n×n denote a linear transformation, then the eigenvalue equation can be written as a matrix multiplication Ax=λx, where x is the eigenvector. The set of all eigenvectors of a matrix associated with the same eigenvalue (including the zero vector) is called an eigenspace.

This math might look menacing, but you’ll have a firm understanding after you’ve completed one linear algebra course. It all boils down to geometric intuition.

Note that PCA won’t work well if you:

  1. Don’t have a linear relationship between your variables.
  2. Don’t have a large enough sample size.
  3. Don’t have data suitable for dimensionality reduction.
  4. Have significant outliers

PCA is an unsupervised algorithm 🔴. It learns without any target variable. We also tend to associate clustering techniques with unsupervised learning algorithms. People have contested about whether it can be considered a machine learning method at all since it’s used largely for preprocessing, but you can use the resultant eigenvectors to explain behavior in your data.

K-Means

Clustering excavates a dataset and discovers natural groupings, or clusters, within it, not necessarily disjoint. K-Means accomplishes this by computing the distance from a particular record to fixed numbers called centroids. As more records are “processed”, the centroids are redefined to equal the means of their corresponding groups. This is essentially the high-level description of the algorithm.

Theoretically, K-Means works to minimize its objective function, which is the squared error. The squared error expresses intra-cluster variance.

Image

Image created by author

  1. J is the objective function.
  2. k is the predefined number of clusters.
  3. n is the number of records.
  4. x_i^(j) is the i-th point in the j-th cluster.
  5. c_j is the j-th centroid.
  6. ||x_i^(j) — c_j||^2 is the Euclidean distance function.

In practice:

  1. k points are selected at random as cluster centroids.
  2. Objects are assigned to a cluster according to its minimum Euclidean distance from each centroid.
  3. Update the centroids. They will be recalculated as the mean of the objects in their analogous cluster.
  4. Stopping condition: The centroids do not change, or the objects remain in the same cluster, or the maximum number of iterations is reached.

Below is a useful flowchart that visualizes the k-means algorithm:

Image

K-Means clustering is one of the most popular and straightforward unsupervised learning algorithms 🔴. It infers the grouping solely from the Euclidean distance, and not any initial labels. However, there are semi-supervised variations of k-Means, such as semi-supervised k-means++, that adopt partially labeled datasets to add “weights” to cluster assignment. Much of the time, “supervised” and “unsupervised” labels shouldn’t serve as limitations for machine algorithms, and certainly shouldn’t forestall new developments for alternative methods on the same dataset.

Association Rules

Finding interesting associations among data items is the leading unsupervised learning method after clustering. Association learning algorithms track the rates of complimentary cases in datasets, and make sure that these associations are found after random sampling. This approach is rule-based so it scales to categorical databases. I’ll briefly discuss its motivation and intuition.

Consider a database of 10,000 transactions from a store. It is found that 5,000 customers bought Item A and 8,000 bought Item B. Although at first glance, these transactions are statistically independent, but a second look at the dataset helps reveal that 2,000 bought both Items A and B. Not only did 50% of the customers buy Item A and 80% Item B, but 20% bought both items. This information has been useful for marketing strategies.

Rules are developed from these uncovered relationships. The composition of an association rule is as follows: an antecedent implies a consequent, and each element in both of these sets makes up an itemset. For example:

{A,B}⇒{C}; I={A,B,C}

Various metrics such as support, lift, and confidence help quantify the strengths of found associations. Support is expressed as the frequency of an itemset in a database, as calculated by the following:

supp(X)=|{tT;Xt}| / |T|, where T is a database of transactions, and X is the itemset in question.

Confidence is the rate at which a rule such as XY is found to be true. It’s expressed as the proportion of transactions that contain X that also contain Y.

conf(XY)=supp(XY) / supp(X)

Lift of a rule is the ratio of the actual confidence and the expected confidence. How likely is Item B to be purchased when Item A is purchased, while still adjusting for how popular Item B is? It’s defined as follows:

lift(XY)=supp(XY) / supp(Xsupp(Y)

Some known algorithms that discover these rules are the Apriori algorithm, Eclat, and FP-growth. Again, no class labels are assigned to the datasets in use, and it works solely in its constraints to find relationships, so it is unsupervised 🔴.

Conclusion

I hope you found this article useful in differentiating unsupervised and supervised learning, and what segregates machine learning algorithms from our internal processes of problem solving and building intuition. However, the upper echelons of the artificial intelligence world are rapidly developing, with self-aware agents, deep learning, attention mechanisms, and so much more, aiming to mimic our own cognitive systems.