arXiv:2601.08121v1 Announce Type: new 
Abstract: Many applied problems contain signal that becomes clear only after combining multiple raw measurements. Ratios and rates are common examples. In gradient boosted trees, this combination is not an explicit operation: the model must synthesize it through coordinated splits on the component features. We study whether intra-tree column subsampling in XGBoost makes that synthesis harder. We use two synthetic data generating processes with cancellation-style structure. In both, two primitive features share a strong nuisance factor, while the target depends on a smaller differential factor. A log ratio cancels the nuisance and isolates the signal. We vary colsample_bylevel and colsample_bynode over s in {0.4, 0.6, 0.8, 0.9}, emphasizing mild subsampling (s >= 0.8). A control feature set includes the engineered ratio, removing the need for synthesis. Across both processes, intra-tree column subsampling reduces test PR-AUC in the primitives-only setting. In the main process the relative decrease reaches 54 percent when both parameters are set to 0.4. The effect largely disappears when the engineered ratio is present. A path-based co-usage metric drops in the same cells where performance deteriorates. Practically, if ratio-like structure is plausible, either avoid intra-tree subsampling or include the intended ratio features.

أظهرت دراسة حديثة أن أخذ عينات الأعمدة داخل الشجرة في XGBoost يمكن أن يعيق قدرة النموذج على التعلم من التفاعلات الشبيهة بالنسب، والتي تعتبر حاسمة لتوليف الإشارات من قياسات متعددة. استخدمت الدراسة بيانات اصطناعية ذات هياكل إلغاء لإظهار أن أخذ العينات يقلل من أداء النموذج في تحديد الإشارات المهمة.

Un estudio reciente ha revelado que el muestreo de columnas intra-árbol en XGBoost puede obstaculizar la capacidad del modelo para aprender de interacciones similares a razones, que son cruciales para sintetizar señales a partir de múltiples mediciones en bruto. La investigación utilizó datos sintéticos con estructuras de cancelación para demostrar que el muestreo reduce el rendimiento del modelo en la identificación de señales significativas.

Une étude récente a révélé que le sous-échantillonnage des colonnes intra-arbre dans XGBoost peut entraver la capacité du modèle à apprendre des interactions de type ratio, qui sont cruciales pour synthétiser des signaux à partir de plusieurs mesures brutes. La recherche a utilisé des données synthétiques avec des structures de type annulation pour démontrer que le sous-échantillonnage réduit les performances du modèle à identifier des signaux significatifs.

A recent study has revealed that intra-tree column subsampling in XGBoost can hinder the model's ability to learn from ratio-like interactions, which are crucial for synthesizing signals from multiple raw measurements. The research utilized synthetic data with cancellation-style structures to demonstrate that subsampling reduces the model's performance in identifying significant signals.

Intra-tree Column Subsampling Hinders XGBoost Learning of Ratio-like Interactions

arXiv:2506.11849v2 Announce Type: replace 
Abstract: With origins in game theory, probabilistic values like Shapley values, Banzhaf values, and semi-values have emerged as a central tool in explainable AI. They are used for feature attribution, data attribution, data valuation, and more. Since all of these values require exponential time to compute exactly, research has focused on efficient approximation methods using two techniques: Monte Carlo sampling and linear regression formulations. In this work, we present a new way of combining both of these techniques. Our approach is more flexible than prior algorithms, allowing for linear regression to be replaced with any function family whose probabilistic values can be computed efficiently. This allows us to harness the accuracy of tree-based models like XGBoost, while still producing unbiased estimates. From experiments across eight datasets, we find that our methods give state-of-the-art performance for estimating probabilistic values. For Shapley values, the error of our methods can be $6.5\times$ lower than Permutation SHAP (the most popular Monte Carlo method), $3.8\times$ lower than Kernel SHAP (the most popular linear regression method), and $2.6\times$ lower than Leverage SHAP (the prior state-of-the-art Shapley value estimator). For more general probabilistic values, we can obtain error $215\times$ lower than the best estimator from prior work.

تقدم دراسة جديدة مقدرات مونت كارلو المعدلة بالانحدار لحساب قيم شابلي والقيم الاحتمالية، مما يعزز كفاءة هذه الحسابات في الذكاء الاصطناعي القابل للتفسير. تدمج هذه الطريقة بين أخذ عينات مونت كارلو والانحدار الخطي، مما يسمح باستخدام عائلات مختلفة من الدوال، بما في ذلك النماذج القائمة على الأشجار مثل XGBoost، لإنتاج تقديرات غير متحيزة.

Un nuevo estudio presenta estimadores de Monte Carlo ajustados por regresión para calcular los valores de Shapley y los valores probabilísticos, mejorando la eficiencia de estos cálculos en la IA explicativa. Este método integra el muestreo de Monte Carlo con la regresión lineal, permitiendo el uso de diversas familias de funciones, incluidos modelos basados en árboles como XGBoost, para producir estimaciones imparciales.

Une nouvelle étude présente des estimateurs de Monte Carlo ajustés par régression pour le calcul des valeurs de Shapley et des valeurs probabilistes, améliorant ainsi l'efficacité de ces calculs dans l'IA explicable. Cette méthode intègre l'échantillonnage de Monte Carlo avec la régression linéaire, permettant l'utilisation de diverses familles de fonctions, y compris des modèles basés sur des arbres comme XGBoost, pour produire des estimations non biaisées.

A new study introduces regression-adjusted Monte Carlo estimators for calculating Shapley values and probabilistic values, enhancing the efficiency of these computations in explainable AI. This method integrates Monte Carlo sampling with linear regression, allowing for the use of various function families, including tree-based models like XGBoost, to produce unbiased estimates.

Intra-tree Column Subsampling Hinders XGBoost Learning of Ratio-like Interactions

Was this article worth reading? Share it

LucidQuery AI

Xgager

Hypertune

ClassX

Gracker AI

Aidx.ai

Ready to build your own newsroom?