Proc. of 4th IEEE Int'l Conf. on Parall. and Distr. Inf. Syst., pp 220 - 31, Dec. 18 - 20, 1996 ------------------------------ Building Regression Cost Models for Multidatabase Systems Qiang Zhu Department of Computer and Information Science The University of Michigan - Dearborn Dearborn, MI 48128, USA Per-Ake Larson* Department of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada * Current address: Microsoft Corporation One Microsoft Way Redmond, WA 98052-6399, USA ABSTRACT A major challenge for performing global query optimization in a multidatabase system (MDBS) is the lack of cost models for local database systems at the global level. In this paper we present a statistical procedure based on multiple regression analysis for building cost models for local database systems in an MDBS. Explanatory variables that can be included in a regression model are identified and a mixed forward and backward method for selecting significant explanatory variables is presented. Measures for developing useful regression cost models, such as removing outliers, eliminating multicollinearity, validating regression model assumptions, and checking significance of regression models, are discussed. Experimental results demonstrate that the presented statistical procedure can develop useful local cost models in an MDBS. KEYWORDS: multidatabase system, global query optimization, cost model, cost estimation, multiple regression