Python programming language has been used in order to provide the analysis of the strata for the online retail shop. K means clustering method has been used in the program for generating all the clusters of the customers according to all the criteria. The frequency and decency of the customers have been calculated in order to develop the clusters using the k means clustering method.
List of Figures
Figure 1: Visualizing the cluster of data present in the data set 4
Figure 2: Removing the null values in the data set 5
Figure 3: Implementing the K-means Clustering method. 6
Figure 4: Generating the clusters based on customers. 6
Figure 5: Generating the clusters using k means. 7
Figure 6: Dendrograms displaying the linked clusters. 8
The data set has been imported inside the python notebook in order to understand all the different properties of the data. The properties of the data present in the data set are going to help perform the suitable analysis methods on the data set. Distribution analysis of the data set is going to help to understand the distribution of values and data in all the columns of the data set.
A boxplot has been used in order to understand the distribution of the variables in the data set. These types of data exploration methods help to understand all the properties of the data present in the data set. Python helps to perform different types of data visualization techniques on the data sets in order to perform data analysis methods.
The above image shows the removal of null values present inside the columns of the data set. The columns containing all the null values have been dropped with the help of the drop () function in the python notebook (Bogdándyet al. 2020). These null values can cause errors during the data analysis of the data set. The redundant columns have also been dropped along with the null values in order to reduce the data set. The method of removing null and redundant values from the data set is known as data normalization.
The k means model has been used in order to create all the clusters for dividing the customer’s lists. The above image shows the different clusters along with the silhouette score of the clusters. K-means clustering algorithm can be used to find out groups that are not labeled in the dataset. This kind of analysis can be used to make business assumptions after analyzing the type of groups or identity of different unknown groups in the dataset. It is an unsupervised machine learning process that can be used to make interference in the dataset without referring to known. Therefore this task has been implemented by using this type of clustering process and the result can be measured from the above image.
The box plots have been used in order to display the clusters based on each criterion of customers. The box plots show the frequency, recency, and total amount expended by each customer in the store. The above result can be determined based on the “cluster_id and the amount” column where cluster_id is on the x-axis and amount on the y-axis. This kind of plot helps the data spread out and the box plot can has been extracted by using the python programming language.
The k means clustering method has been used to display the cluster line of the data in the data set. The cluster line helps to identify the average clusters of the customers that have been properly clustered as per the criteria. Based on the k-means clustering process and after extracting results based on this the above graph can be determined. That detects the different number of clusters that are present based on the dataset analysis.
The dendrogram has been generated as shown in the above image in order to display the link between the clusters (Garainet al. 2018). The dendrogram is an effective method that can easily display all the relations between compound correlation data in the data set.
The required result can be measured after performing the data analysis on the data set of the online retail shop; it can be observed that the customer is the main aspect of data design. The data analysis results have shown the necessary dimensions and metrics that have been used in order to develop the analysis of the data set. The float data type has been used for assigning the data types of the values present in the columns of the data set.
Bogdándy, B., Kovács, Á. and Tóth, Z., 2020, September. Case Study of an On-premise Data Warehouse Configuration. In 2020 11th IEEE International Conference on Cognitive Infocommunications (CogInfoCom) (pp. 000179-000184). IEEE.
Garain, N., Chattopadhyay, S., Mahapatra, G., Chatterjee, S. and Mondal, K.C., 2018, July. Design and implementation of an improved data warehouse on clinical data. In International Conference on Computational Intelligence, Communications, and Business Analytics (pp. 278-290). Springer, Singapore.