arXiv:2601.08781v1 Announce Type: new 
Abstract: The CLASSIX algorithm is a fast and explainable approach to data clustering. In its original form, this algorithm exploits the sorting of the data points by their first principal component to truncate the search for nearby data points, with nearness being defined in terms of the Euclidean distance. Here we extend CLASSIX to other distance metrics, including the Manhattan distance and the Tanimoto distance. Instead of principal components, we use an appropriate norm of the data vectors as the sorting criterion, combined with the triangle inequality for search termination. In the case of Tanimoto distance, a provably sharper intersection inequality is used to further boost the performance of the new algorithm. On a real-world chemical fingerprint benchmark, CLASSIX Tanimoto is about 30 times faster than the Taylor--Butina algorithm, and about 80 times faster than DBSCAN, while computing higher-quality clusters in both cases.

تم تحسين خوارزمية CLASSIX لدعم المسافات مانهاتن وتانيموتو، مما يوفر نهجًا سريعًا وقابلًا للتفسير لتجميع البيانات. تتيح هذه الإضافة تحسين الأداء في تحديد المجموعات من خلال استخدام معايير متجهات البيانات وعدم المساواة الحادة للتقاطع لمسافة تانيموتو، مما يؤدي إلى مزايا كبيرة في السرعة مقارنة بالطرق الحالية مثل تايلور-بوتينا وDBSCAN.

El algoritmo CLASSIX ha sido mejorado para admitir distancias de Manhattan y Tanimoto, proporcionando un enfoque rápido y explicable para el agrupamiento de datos. Esta extensión permite mejorar el rendimiento en la identificación de clústeres al utilizar normas de vectores de datos y una desigualdad de intersección más aguda para la distancia de Tanimoto, lo que resulta en ventajas significativas de velocidad sobre métodos existentes como Taylor–Butina y DBSCAN.

L'algorithme CLASSIX a été amélioré pour prendre en charge les distances Manhattan et Tanimoto, offrant une approche rapide et explicable pour le regroupement de données. Cette extension permet d'améliorer les performances dans l'identification des clusters en utilisant des normes de vecteurs de données et une inégalité d'intersection plus précise pour la distance de Tanimoto, entraînant des avantages de vitesse significatifs par rapport aux méthodes existantes telles que Taylor–Butina et DBSCAN.

The CLASSIX algorithm has been enhanced to support Manhattan and Tanimoto distances, providing a fast and explainable approach to data clustering. This extension allows for improved performance in identifying clusters by utilizing norms of data vectors and a sharper intersection inequality for Tanimoto distance, resulting in significant speed advantages over existing methods like Taylor–Butina and DBSCAN.

Fast and explainable clustering in the Manhattan and Tanimoto distance

Was this article worth reading? Share it

LucidQuery AI

DataCircle.io

SnapChip

The Visualizer

FastML

Topic

Ready to build your own newsroom?