Machine learning can be majorly classified into two types.
It can be thought of as a teacher supervising the learning process, Here given a training dataset with some feature value as (x) and target/output (y), algorithms try to learn from it and create a mapping function so that given a new input on x it can predict the approximate value of y. Supervised learning problems can be further grouped into regression and classification.
It can be thought of as given a learning material you need to find out how many different syllabi it can make based on some pattern. Here, we only have input data (x) and no corresponding target/output (y)variables. There is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data. Unsupervised learning problems can be further grouped into clustering and association problems.
Applications of Unsupervised methods:
- Marketing: In marketing segmentation, when a company wants to segment its customers to better adjust products and offerings
- Insurance: Identifying groups of motor insurance policyholders with a high average claim cost.
- City-planning: Identifying groups of houses according to their house type, value, and geographical location.
- Social network analysis: Identifying possible friends as well as targeted marketing campaigns
- Anomaly detection: Identifying anomalies in any process with unsupervised learning
In this article, we will be driving through one of the implementations of unsupervised learning i.e clustering.
When you’re trying to learn about something, say music, one approach might be to look for meaningful groups or collections. You might organize music by Singer, while your friend might organize music by decade. How you choose to group items helps you to understand more about them as individual pieces of music. In both cases, you and your friend have learned something interesting about music, even though you took different approaches. When examples are unlabeled, clustering relies on unsupervised machine learning. If the examples are labeled, then clustering becomes classification. Clustering algorithms work by computing the similarity between all pairs of examples(input instances).
Types of Clustering:
These are some of the well-known types of clustering algorithms.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. k-means is the most widely-used centroid-based clustering algorithm. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. This course focuses on k-means because it is an efficient, effective, and simple clustering algorithm.
Distribution models: This clustering approach assumes data is composed of distributions, such as Gaussian distributions and how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). As the distance from the distribution’s center increases, the probability that a point belongs to the distribution decreases. These models often suffer from overfitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions.
Connectivity models: They are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lack scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its variants.
Note: While assigning instances to clusters there are two ways: An instance can belong to only one cluster, OR an instance can fall into a cluster with some probability. The first type is called Hard clustering and the second type is called Soft clustering.
This article focuses on k-means because it is an efficient, effective, and simple clustering algorithm.
K-means is an iterative clustering algorithm and it is implementing below pseudocode.
Outline of the algorithm:
Assuming we have input data points x1,x2,x3,…,xn and value of K (the number of clusters needed). We follow the below procedure:
- Pick K points as the initial centroids from the dataset, either randomly or the first K.
- Find the Euclidean distance of each point in the dataset with the identified K points (cluster centroids).
- Assign each data point to the closest centroid using the distance found in the previous step.
- Find the new centroid by taking the average of the points in each cluster group.
- Repeat 2 to 4 for a fixed number of iteration or till the centroids don’t change.
Euclidean Distance between two points in space: If p = (p1, p2) and q = (q1, q2) then the distance is given by
Assigning each point to the nearest cluster: If each cluster centroid is denoted by ci, then each data point x is assigned to a cluster based on
where dist() is the euclidean distance.
Finding the new centroid from the clustered group of points:
Si is the set of all points assigned to the ith cluster.
Advantages of k-means:
- Relatively simple to implement
- Scales to large data sets
- Guarantees convergence
- Easily adapts to new examples
- Generalizes to clusters of different shapes and sizes, such as elliptical clusters
Disadvantages of k-means:
- Choosing k manually
- k-means has trouble clustering data where clusters are of varying sizes and density
- Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored
- As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reduce dimensionality by using PCA
Practical considerations for K-Means:
- KMeans cannot handle missing values and it must be taken care of before going forward with Kmeans
- KMeans computes distance at its core, so it's necessary to convert categorical variables into numerical so that KMeans works properly
- Data Normalization generally considered good practice
Elbow Method in K-means:
The algorithm is somewhat naive — it clusters the data into k clusters, even if k is not the right number of clusters to use. Therefore, when using k-means clustering, users need some way to determine whether they are using the right number of clusters. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to n), and for each value of k calculate the within-cluster sum of squares (WCSS).
One method to validate the number of clusters is the elbow method by generating a plot called a scree plot. It is a plot where the number of clusters (K) is in the X-axis and within-cluster sum of squares (WCSS) is on the Y-axis. It is shown below:
You know by now that as the number of clusters increase, the WCSS keeps decreasing. This decrease of WCSS is initially steep and then the rate of decrease slows down resulting in an elbow shape of the plot. The number of clusters at the elbow formation usually gives an indication of the optimum number of clusters.
K-Means implementation using python sklearn:
K-Means implementation using python in Customer Segmentation
A collection of projects as part of the Data Science Masters Program at GreyAtom EduTech Pvt Ltd …
I hope this blog was useful to you. Please leave comments or send me an email if you think I missed any important details or if you have any other questions or feedback about this topic.