arXiv:2511.06973v2 Announce Type: replace 
Abstract: Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. To quantify spreadsheet similarity, we introduce a hybrid distance metric that combines semantic embeddings, data type information, and spatial positioning. In order to calculate spreadsheet similarity, our method converts spreadsheets into cell-level embeddings and then uses aggregation techniques like Chamfer and Hausdorff distances. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 versus 0.90) on the FUSTE dataset. Our approach facilitates large-scale automated template discovery, which in turn enables downstream applications such as retrieval-augmented generation over tabular collections, model training, and bulk data cleaning.

تم تقديم مقياس مسافة هجين جديد لقياس تشابه جداول البيانات، مما يعزز من تحديد جداول البيانات المتشابهة هيكليًا. تتفوق هذه الطريقة على الأساليب التقليدية من خلال تحقيق إعادة بناء مثالية للقوالب على مجموعة بيانات FUSTE، وهو ما يعد مهمًا لاكتشاف القوالب بشكل آلي وتطبيقات مثل تنظيف البيانات وتدريب النماذج.

Se ha introducido una nueva métrica de distancia híbrida para medir la similitud de hojas de cálculo, mejorando la identificación de hojas de cálculo estructuralmente similares. Este método supera los enfoques tradicionales al lograr una reconstrucción perfecta de plantillas en el conjunto de datos FUSTE, lo que es significativo para el descubrimiento automatizado de plantillas y aplicaciones como la limpieza de datos y el entrenamiento de modelos.

Une nouvelle métrique de distance hybride pour mesurer la similarité des tableurs a été introduite, améliorant l'identification des tableurs structurellement similaires. Cette méthode surpasse les approches traditionnelles en atteignant une reconstruction parfaite des modèles sur le jeu de données FUSTE, ce qui est significatif pour la découverte automatisée de modèles et des applications comme le nettoyage des données et l'entraînement de modèles.

A new hybrid distance metric for measuring spreadsheet similarity has been introduced, enhancing the identification of structurally similar spreadsheets. This method outperforms traditional approaches by achieving perfect template reconstruction on the FUSTE dataset, which is significant for automated template discovery and applications like data cleaning and model training.

Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery

Was this article worth reading? Share it

Ready to build your own newsroom?