K-nearest neighbors (KNN) is a non-parametric, instance-based learning algorithm that operates on the principle that similar data points tend to have similar values or belong to the same class. This algorithm is commonly employed for regression and classification tasks. This article deliberates on different aspects of KNN and recent developments.
Introduction to KNN
The KNN algorithm is utilized for object classification in pattern recognition based on the objects’ closest training examples within the feature space. An object is primarily classified by its neighbors’ majority vote/is assigned to the most common class amongst its KNN, where K is a positive integer. For instance, a new test feature vector classification is determined by the classes of the vector’s KNN. Euclidean distance metrics are used to implement the KNN algorithm to locate the nearest neighbor.
The KNN algorithm is run multiple times with various K values, and the K that decreases the number of errors encountered is selected while maintaining the algorithm’s ability to precisely make predictions while it is applied to unseen data to select the right K for the dataset. Although several approaches can be applied to calculate distance based on the problem being solved, the straight-line distance/the Euclidean distance is applied most extensively.
The predictions gradually become less stable as the K value decreases to 1/become more stable with the increasing K value due to majority voting/averaging. The predictions become more accurate up to a specific point, and then the errors are increased. This point indicates that the appropriate K value has been exceeded. The K value is typically an odd number to have a tiebreaker in scenarios where a majority vote is required among labels, such as while selecting the mode in a classification problem.
KNN is effective for huge numbers of training examples and robust to noisy training data. KNN does not have any requirement to build a model, make additional assumptions, or tune several parameters. This supervised machine learning algorithm is easy to implement, versatile, and simple, which can be effectively used for solving search, regression, and classification problems. However, the number of nearest neighbors/value of parameter K and the distance type to be used must be determined in this algorithm.
Additionally, the computation time is lengthy as the distance of every query instance to all training samples must be computed, which makes the algorithm significantly slower with the increasing number of predictors/independent variables and/or examples. This aspect makes KNN impractical in environments where predictions must be made rapidly as faster algorithms that can generate more accurate classification and regression results are available.
Despite this disadvantage, KNN is still effective for solving problems with solutions that depend on the identification of similar objects if sufficient computing resources are provided to rapidly handle the data to make predictions. The KNN algorithm is utilized for classification, regression, and search problems. It is useful in solving problems that have solutions that depend on identifying similar objects.
Classification and Regression
Nearest neighbor classification is based on the concept that the patterns nearest to the target pattern x, for which the label is required, can deliver crucial label information. KNN assigns the majority of the K-nearest patterns’ class label in data space. Thus, a similarity measure has to be defined in the data space.
KNN is also applicable to multi-class classification problems. KNN for multi-class classification can predict the majority of the K-nearest patterns’ class label in data space with an indicator function that returns one if its argument is true and zero otherwise for an unknown pattern x.
Similarly, KNN regression computes the function values’ mean of its KNN for an unknown pattern x. The averaging concept is based on the locality assumption of functions in label and data space. KNN has been effective in different applications, such as quasar detection based on spectroscopic data.
Additionally, the KNN's neighborhood size K is a crucial parameter in regression. The KNN regression overfits to the label of x's nearest neighbor for K = 1. While the KNN regression averages over all training patterns for K = N, where N equals the total number of training patterns.
KNN Variants
Model-based KNN: The model-based KNN’s idea is to replace the training set with a set of reference points/codebook vectors that attain the same prediction results. The concept is related to the landmark unsupervised kernel regression variant. The collection of landmark points is referred to as a model.
The selection of a set of landmarks is treated as an optimization problem. Specifically, an optimal landmark vector subset is searched that achieves an identical nearest neighbor result as KNN on the complete pattern set. Initially, a similarity matrix from the dataset is computed, and then a search is performed for the neighborhood that encompasses the largest number of neighbors with the same label. Eventually, the resulting model consists of a selection of landmark vectors that can be used as the original KNN model’s surrogate.
Distance-weighted KNN: KNN induces locally constant outputs, which implies an output space with plateaus from the optimization perspective. Thus, in KNN regression, dissimilar output values are possible for neighborhood size K and N patterns. Plateaus hinder optimization methods from the optimal solution’s fast approximation as sufficient information on promising search directions cannot be gained during optimization.
Thus, distance-weighted KNN rules were introduced in the late seventies to streamline the prediction function that is weighting the prediction with the similarity of the nearest patterns to the target x. Patterns nearest to the target contribute more to the prediction compared to far-away patterns. Similarity can be defined using the distance between patterns.
Propagating 1-nearest Neighbor: Propagating 1-nearest neighbor is primarily an approach for semi-supervised learning that involves both a set of unlabeled patterns and a set of labeled patterns. Unlabeled data in semi-supervised learning amplifies learning by revealing hidden patterns in the data's overall structure. In propagating the 1-nearest neighbor, the unlabeled pattern closest to any of the labeled patterns is chosen at every step.
Recent Developments
A hybrid classifier, designated as deep k-nearest neighbors (DkNN), which combines the KNN algorithm with representations of the data learned by every layer of the deep neural network (DNN), was introduced. A test input is compared to its neighboring training points based on the distance that separates them in the representations.
The neighboring points’ labels afford confidence estimates for inputs outside the training manifold of the model, including on malicious inputs such as adversarial examples, which offers protection against inputs outside the model's understanding as the nearest neighbors can estimate the nonconformity of a prediction in the training data.
Moreover, the neighbors also establish human-interpretable explanations of predictions. The DkNN algorithm evaluation on several datasets showed that the confidence estimates could identify inputs outside the model accurately, and the nearest neighbor-provided explanations are useful and intuitive in understanding model failures. The quantum kNN (QkNN), which is a quantum analog of classical KNN, was introduced based on fidelity as the similarity measure. The QkNN algorithm can be reduced to a quantum k-maxima algorithm’s instance.
In this reduction, the fidelity information between the train and test states is encoded as a quantum state’s amplitude. The conversion of this amplitude-encoded information to a digital format enables efficient comparison of the fidelity information between the test state and all the train states, thus completing the reduction.
Unlike the existing quantum and classical KNN algorithms, this algorithm can be directly used on quantum data, avoiding expensive processes such as quantum state tomography. The algorithm has been successfully employed in quantum state discrimination and entanglement classification.
A study published in the Journal of Robotics and Control proposed an optimized KNN-based breast cancer detection model. Its breast cancer detection performance was evaluated using a test dataset. Results showed that the proposed optimized model has 94.35% detection accuracy.
In conclusion, the simple and versatile KNN tackles diverse tasks, including classification and regression. However, the algorithm’s drawback of becoming slower with the growing size of data in use must be factored in during its implementation.
References and Further Reading
Assegie, T. A. (2021). An optimized K-Nearest Neighbor based breast cancer detection. Journal of Robotics and Control (JRC), 2(3), 115-118. https://doi.org/10.18196/jrc.2363
Kramer, O. (2013). K-Nearest Neighbors. Intelligent Systems Reference Library, 13–23. https://doi.org/10.1007/978-3-642-38652-7_2
Boateng, E. Y., Otoo, J., Abaye, D. A. (2020). Basic tenets of classification algorithms K-nearest-neighbor, support vector machine, random forest and neural network: a review. Journal of Data Analysis and Information Processing, 8(4), 341-357. https://doi.org/10.4236/jdaip.2020.84020
Basheer, A., Afham, A., Goyal, S. K. (2020). Quantum k-nearest neighbors algorithm. ArXiv. https://doi.org/10.48550/arXiv.2003.09187
Papernot, N., McDaniel, P. (2018). Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning. https://www.researchgate.net/publication/323746954_Deep_k-Nearest_Neighbors_Towards_Confident_Interpretable_and_Robust_Deep_Learning