Distributed and Parallel Databases, Vol .6, No. 4, pp 373-420, 1998 ----------------------------- Solving Local Cost Estimation Problem for Global Query Optimization in Multidatabase Systems Qiang Zhu Department of Computer and Information Science The University of Michigan - Dearborn Dearborn, MI 48128, USA Per-Ake Larson* Department of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada * Current address: Microsoft Corporation One Microsoft Way Redmond, WA 98052-6399, USA ABSTRACT To meet users' growing needs for accessing pre-existing heterogeneous databases, a multidatabase system (MDBS) integrating multiple databases has attracted many researchers recently. A key feature of an MDBS is local autonomy. For a query retrieving data from multiple databases, global query optimization should be performed to achieve good system performance. There are a number of new challenges for global query optimization in an MDBS. Among them, a major one is that some local optimization information, such as local cost parameters, may not be available at the global level because of local autonomy. It creates difficulties for finding a good decomposition of a global query during query optimization. To tackle this challenge, a new query sampling method is proposed in this paper. The idea is to group component queries into homogeneous classes, draw a sample of queries from each class, and use observed costs of sample queries to derive a cost formula for each class by multiple regression. The derived formulas can be used to estimate the cost of a query during query optimization. The relevant issues, such as query classification rules, sampling procedures, and cost model development and validation, are explored in this paper. To verify the feasibility of the method, experiments were conducted on three commercial database management systems supported in an MDBS. Experimental results demonstrate that the proposed method is quite promising in estimating local cost parameters in an MDBS. keywords: multidatabase, global query optimization, cost model, query sampling, multiple regression