Supervised vs. Unsupervised Machine Learning: A Non-Technical Walkthrough
Supervised learning (SL) and unsupervised learning (UL) are two fundamental approaches to machine learning (ML). One of the most significant distinctions between these two approaches is SL uses labeled data to generate output or predictions that conform to those defined labels, while UL does not. But there is far more to machine learning than the nature of the data: they are both used to perform related but distinct tasks. In this article, we provide a non-technical walkthrough of SL and UL with the goal of building your understanding of these two approaches and how they can be used in a commercial context.
Supervised Learning explained
As mentioned above, supervised learning uses labelled data to generate predictions about some new and unseen data. To do so, SL uses a training set to “teach” models to learn specific patterns, trends and relationships in order to generate the desired output. This training dataset includes inputs and correct outputs, which is what enables a model to learn over time.
Solidifying your understanding of this concept may require a bit of background on statistics. For example, say the department of health wants to examine the relation of obesity level with age and exercise frequency within a given population. The result might be obesity = x * age + y * frequency. The x and y, or known as weight, is what supervised learning often tries to calculate. This is a drastic oversimplification and the real-life use-cases that leverage supervised learning are often much more complicated, involve working with immense data sets which entails extensive processing time, and involves a greater number of unknowns. However, the underlying principle of learning the relationships contained within on data set and generalizing these to another hold true for supervised machine learning.
There are two types of supervised learning: classification and regression.
Regression
Regression is used to understand the relationship between dependent and independent variables, which is the example given above. The simplest way of understanding regression is that it answers, ‘what comes next’ and ‘how much or how many’ using historical data. Other commercial applications of supervised machine learning include modelling customer life-time-value (LTV), predicting customer churn, forecasting sales and revenue, dynamic pricing, and risk analysis.
Classification
Classification uses an algorithm to sort data into specific categories or groups by recognizes specific entities within the dataset and drawing conclusions as to how those entities should be labeled or defined: it answers the question of ‘is this A or B’? Classification is a valuable tool when there are a limited number of well-defined categories (e.g., a self-driving car being able to accurately identify other cars, pedestrians and road signs) that the data must be sorted into.
Unsupervised learning
In contrast with the use of labelled data in SL, unsupervised machine learning involves analyze and categorize using unlabeled data sets. How UL works is fundamentally very similar to SL: both involve analysing patterns and discovering relationships in data. However, the techniques that are used, what they accomplish, and the nature of human involvement differs greatly from SL.
There are two main types of UL: clustering and dimension reduction.
Clustering
Clustering is an UL technique that involves organising data points into subsets or ‘clusters’ based on shared similarities between those data points. Clustering answers the questions ‘how is this (data) organised’? Whilst very similar to SL classification in terms of organising data into groups, there is a very important difference: clustering doesn’t use labelled data.
The implication is that without human labels to guide their analysis, machines can detect invisible or seemingly arbitrary relationships and form groups based on them. A great example of this is used in marketing to understand patterns in customer behaviour and create customer segments on those bases, then subsequently targeting those segments with specific relevant offers.
Dimension Reduction
Dimension reduction involves reducing the number of features (within a dataset) a machine learning model must analyse to identify the main features that explain most of the trends and patterns the model observes within the data. Dimension reduction answers the question ‘how can this be simplified’? Dimension reduction is commonly used in use-cases where a data set may contain significant noise (i.e., information that is not relevant to generating an accurate prediction), or to simplify the visual representation of data.
Supervised vs. Unsupervised learning:
From a technical perspective, supervised learning uses labelled data to infer a conditional probability distribution (i.e., what happens given something) and establish the relationship between a direct and indirect variable(s) to answer, “what is the probability of B given A”? Therefore, SL is used for regression and classification challenges when the relationship between the direct and indirect variable is known.
In contrast, the goal of UL is to infer an a priori probability distribution and asks, “what sort of probability outcomes exist in the data already”. The fundamental assumption of UL is that the probability outcomes identified in the training data exist within the testing data within the same proportions. The performance of the algorithm is worse if this assumption is violated as it will be unable to correctly identify the outcomes within the test data set (which is used to test the accuracy of models post training). An example of this is anomaly detection, where an algorithm will miss anomalies that were not represented in the training data.
The Key Thing to Remember: Labelled Data
The technical differences between the two approaches aside, the key distinction that characterizes SL and UL most people should keep in mind is whether labelled data is used: a SL model uses labeled input and output data, while an UL model does not.
In supervised learning, the algorithm “learns” from the training dataset by iteratively making predictions on the data and adjusting for the optimal answer (or adjusting to find the most perfect weight). However, supervised learning requires human involvement in labelling data to give it a specific meaning (for example, labelling an image as a “dog” or the word “house as a noun). This enables SL to generate predictions calibrated to perform tasks that involve an understanding – or at least label – of real-world objects and concepts. To further illustrate this, think of the navigation app you use on your smart phone. Using supervised learning, it might be able to predict your commute time based on time of the day, weather, reported car accidents and so on. But first, the model must be “told” that rainy weather would extend driving time, which is the action of labelling data.
In contrast, UL models do not look at data through the lens prescribed by human labels. UL models work on their own to discover the underlying structure and patterns of unlabeled data. However, human involvement is still required to validate the output of UL models and assess the usefulness and relevance of the identified pattern. In an e-commerce example, an UL model may reveal some unknown relationship in the buying patterns of certain shoppers, such as purchasing a particular set of products at the same time – such as diapers, applesauce and sippy cups - and use this insight to subsequently recommend those products to other online shoppers with a similar profile.
Other differences
- Goals: Supervised learning predicts outcomes for new data, which you know up front what to expect. Thus, supervised learning can be thought of as having a target to aim for. Unsupervised learning algorithms generates insights from large volumes of new data, and it is the algorithm which determines what is unique or interesting from the dataset. In this sense, UL is untargeted and exploratory.
- Applications: Supervised learning models are ideal for spam detection, sentiment analysis, weather forecasting and pricing predictions and other tasks that use labelled data. Unsupervised learning is a great fit for anomaly detection, recommendation engines, customer personas and medical imaging.
- Complexity: Supervised learning is a more straightforward method for machine learning, whereas unsupervised learning requires powerful tools for working with large amounts of unclassified data and human expertise is required to interpret the trends that are identified.
- Drawbacks: Supervised learning models can be time-consuming to train, and the labels for input and output variables require expertise. Unsupervised learning methods can have inaccurate results that make no sense unless you have professional and experienced human intervention to validate the output variables.
Which approach to machine learning is best?
The best approach to ML depends entirely on the nature of the business challenge and the characteristics and criteria of the use-case. Here are some actions and questions you should ask:
- Define your goal: do you know what to expect? Do you have a recurring well-defined problem, or do you need to solve new problems?
- Examine the input data: is it labeled data? Do you have the expertise to label it later?
- The algorithm: do you have an algorithm that supports the volume and structure of your data?
Final Word
You now have a solid foundation in understanding the principles of SL and UL, their respective techniques and the types of business challenges they can be applied to solve. However, consider this information as ‘nice to know’. From a commercial perspective, understanding the technology and how it works isn’t critical: a well-defined business challenge and a clear understanding of your data is.
Do you have a business challenge you think could be a good candidate for machine learning? Contact us at [email protected] to discuss with a member of our team.
Written by: Clayton Black and Alex Shen