A Robust and Explainable Query Optimization Cost Model Based on Bidirectional Graph Neural Networks

Chang, Baoming2024-11-052024-11-052024-11-05http://hdl.handle.net/10393/49824https://doi.org/10.20381/ruor-30662Learning representations for query plans play a pivotal role in machine learning-based query optimizers of database management systems. To this end, particular model architectures are proposed in the literature to transform the tree-structured query plans into representations with formats learnable by downstream machine learning models. However, existing research rarely compares and analyzes the query plan representation capabilities of these tree models and their direct impact on the performance of the overall optimizer. In this thesis, we conduct a comparative study of existing tree models and innovatively propose a tree model architecture based on Bidirectional Graph Neural Networks (Bi-GNN) aggregated by Gated Recurrent Units (GRUs) and experimentally show its significant improvement in the performance of cost estimation and plan selection tasks. The primary objective of query optimizers in relational database management systems (RDBMSs) is to identify the optimal execution plan with the lowest estimated cost for a given query. However, the inherent uncertainty in data and model parameters often leads to inaccuracies in cost estimation, resulting in the selection of suboptimal execution plans and not sufficiently robust query performance. In this thesis, we propose a novel learning-to-rank cost model that effectively quantifies the uncertainty associated with query execution cost predictions. This model integrates the predicted costs and their uncertainties using a learning-based mechanism and adapts by analyzing pairwise comparisons of integrated values, enabling more precise and adaptive uncertainty quantification and integration. We provide experimental evidence that our model significantly enhances query optimization's robustness, surpassing the capabilities of existing state-of-the-art techniques. In addition, the decision-making processes of machine learning models often present an inherent lack of transparency, making their predictions difficult for humans to understand or trust. This issue exists in learning-based query optimization cost models as well. To address this, we propose a novel explainability technique specifically designed for learning-based cost models. This is the first technique to incorporate explainability into the learning-based cost model by assessing the influence of subtrees in the query plan on the final cost prediction. It explains the impact of specific subgraphs or nodes by analyzing the inclusion relationships between subtrees. This technique can be seamlessly integrated into any learning-based cost model and be trained concurrently with it. Our experiments demonstrate that combining this technique can substantially improve the explainability and credibility of the cost model while further improving its cost estimation performance. By incorporating these innovations, we propose a comprehensive cost model for a Robust and Explainable Query Optimizer, Reqo, which simultaneously improves the performance of three critical dimensions of query optimization: cost estimation accuracy, plan selection robustness, and explainability. Our experimental results demonstrate that each dimension surpasses the corresponding state-of-the-art cost models.enAttribution 4.0 Internationalhttp://creativecommons.org/licenses/by/4.0/Query OptimizationCost Model ExplainabilityQuery Plan RepresentationLearning-Based Cost ModelCost Model RobustnessBidirectional Graph Neural NetworksA Robust and Explainable Query Optimization Cost Model Based on Bidirectional Graph Neural NetworksThesis