Integrating Heterogeneous

Data Sources

Mala Krishnamurthy

John Lifter

CS 561

Advanced Database Systems

March 26, 1997


Integrating Heterogeneous Data Sources

The objective of database integration efforts is to make data distributed over a group of distinct, heterogeneous data sources accessible through a single user interface. This capability will become increasingly more important as the quantity of information available in publicly accessible databases increases as a result of public funding of multi-year data gathering research programs, and the number of individuals and organizations attempting to access and use this information become more numerous.

Three approaches to database integration can be envisaged: merging of the individual schema into a consensus schema before data entry; transformation of the data in the individual databases into a common database with a rich enough schema to provide for all of the collected data; and real-time querying of multiple heterogeneous databases and coordinated presentation of the selected data to the user. The third approach could be extended to include construction and maintenance of a local user specific data warehouse.

The students are familiar with the work of Buneman, Davidson, Overton, Hart and Wong at the University of Pennsylvania, in which a programming language (CPL: Collection Programming Language) has been developed to transform data between multiple data formats and to serve as a query interface to data stored in these formats. Through an analysis of this work, the essential requirements for a real-time multi-database query tool have been defined.

The query tool must be capable of:

The objective of this course project will be to use commercially available application development software [ Windows 95; Sybase SQL Anywhere (RDBMS); dBASE; PowerBuilder; PowerBuilder Internet Developer Toolkit; WebSite (server); and Netscape (browser) ] to develop a WWW based application that will allow users to perform multi-database queries. This report summarizes the current status of the project.

Equipment Selection

The application will be developed on stand-alone Pentium chip based computers. No attempt will be made to network multiple computers. Consequently, all of the application's software components will be running simultaneously on a single computer. This arrangement allows each student to develop components of the application while having the capability of running and evaluating the complete application.

Software Selection

PowerBuilder 5.0 Desktop and the PowerBuilder Internet Developer Toolkit, from the PowerSoft Division of Sybase, will be used to develop the database management, common gateway interface, and database management components of the application. PowerBuilder is an industry standard tool for developing client/server applications. Both students have had experience using a earlier release, PowerBuilder 4.0. The Internet Developer Toolkit provides methods to transform client/server applications into common gateway interface applications; this is a new product with which the students have no prior experience. These packages are described more completely in a following section.

WebSite hypertext transfer protocol server software is marketed by the publishers O'Reilly and Associates. This is also a product with which the students have no prior experience.

The relational database management system will be Sybase SQL Anywhere, a 32-bit Windows 95/NT application. While this software is available as a stand-alone product, it is also included as part of the PowerBuilder application. The students have had experience using an earlier release of the product, Watcom SQL, which was a 16-bit application for Windows 3.1 and PowerBuilder 4.0.

Corel Quattro Pro7 will be used to create non-relational databases. Files will be saved as dBASEIII files, as this is a data source format supported by PowerBuilder.

Model Databases

Although the students have an interest in applying the techniques developed in this project to querying multiple public databases containing information related to the human genome program, this project will be developed using model databases stored on the local computer. This approach was required as the selected version of PowerBuilder only supports local database access, and it is unlikely that user accounts on genome center databases can be established as part of this project.

The model relational database was obtained from a material tracking application used in a pharmaceutical manufacturing operation. The database includes three tables containing part descriptions, lot descriptions, and product component listings.

The following screen-shot shows the relationships between the tables.

For the purposes of this project, two non-relational databases have been created: a parts description database, which contains copies of the part_number and description columns of the relational database parts_list table; and a suppliers database which contains a listing of part numbers and suppliers (there can be multiple suppliers for a single part number).

PowerBuilder 5.0 Desktop

PowerBuilder 5.0 Desktop is a single user tool set for building Windows applications. As with other application development tools, e.g., Microsoft Visual Basic or Visual C++, the developer has access to a large variety of objects, e.g., command buttons, single-line edit boxes, static text. However the PowerBuilder product line has unique features to simplify the development of client/server applications that access databases.

Firstly, managing database connections is accomplished through either a built-in Open Database Connectivity (ODBC) interface or direct connection via a database management system specific database interface. When using the ODBC interface, the developer uses the PowerBuilder application to define a data source name, source file, and associated database management system. Direct connections to network databases are made through add-on modules which connect to the database through the database vendor's application programming interface. The PowerBuilder Desktop product can only utilize the ODBC interface to local databases; interfaces to Access, dBASE, and FoxPro, in addition to 16-bit and 32-bit Sybase SQL Anywhere (Watcom) interfaces, are included with the product. Direct database connection is supported in the PowerBuilder Professional and Enterprise editions. Applications developed with the PowerBuilder Desktop edition can be transported to the more advanced editions. Consequently, adapting the application developed in this project to network databases is uncomplicated.

Secondly, retrieving and manipulating data from a database is accomplished through the PowerBuilder datawindow object. The datawindow object maintains an association between the database ODBC interface, the query statement, and the presentation style. Additionally, the datawindow object automatically codes SQL update statements following modification of the data at the client computer. Multiple datawindow objects can interact; data can be passed from one datawindow object to a second datawindow object, providing a simple technique to use output from one query as a selection criterion for a subsequent query.

Because the PowerBuilder product line was introduced to develop business applications, the capability to generate a variety of printed report formats has been included in the product. Basically, the developer can edit the style, format, and content of the data within the datawindow object and create an attractive and effective report. Using the Internet Developer Toolkit (described in following section), the formatted report can be converted into HTML for viewing with web browsers.

Thus PowerBuilder provides the capabilities to meet all of the requirements of a multi-database query tool.

PowerBuilder Internet Developer Toolkit

The Internet Developer Toolkit provides three approaches to providing database data via the WWW:

Both the "window plug-in" and "datawindow plug-in" approaches require additional software on the user's computer which is accessed by the browser. While this requirement would be acceptable in a private corporate environment, it is not suitable for a public WWW interface.

The Web.PB approach is similar to common gateway interface applications; the browser calls an application program located on the web server. The Web.PB approach differs from common gateway interface applications in that the application program located on the web server assumes the client/server role of "client," and all data processing and database accesses are performed by the database server and a finished report, converted to HTML, is transmitted back to the web server and browser. Because all of the software required to support this approach can be installed at the source, the Web.PB approach is suitable for a public WWW interface.

This project will utilize the Web.PB approach to provide information from the model databases to the requesting browser.

Client/Server Application

As an initial step in the development program, we used PowerBuilder to create a client/server application which retrieved information into two datawindow controls. The application was written as a multi-document interface, i.e., separate documents within a single application are used to execute independent queries. Queries that extracted data from either one or two databases were written; either the relational or non-relational database could serve as the data source for the independent or dependent query.

The transaction object specifies the parameters that PowerBuilder uses to connect to a database. PowerBuilder has a built-in transaction object (SQLCA: SQL Communication Area) that is created automatically when the application is started. In order to connect to two databases simultaneously, a second transaction object must be created by defining a global object inherited from the built-in transaction object. The variables of each transaction object are populated with the specifics of the database connection.

For this developmental application, two transaction objects were used; the first transaction object, SQLCA, maintains information related to the relational database; the second transaction object, trans2db, is provided with information related to the non-relational database.

// PowerBuilder provides a transaction
// object -- SQLCA
// There is no need to create SQLCA
// Connect to Material.DB
SQLCA.DBMS = "ODBC"
SQLCA.DBParm = "Connectstring='DSN=Material;UID=DBA;PWD=SQL'"

Connect Using SQLCA;

// Test for successful connection
If SQLCA.SQLCODE = -1 Then
Messagebox("Database","Unable to connect to Material DB" , information!)
End IF

// Create a second transaction object
// trans2db has been defined as a global variable
trans2db = CREATE transaction

// Connect to Parts.DB
trans2db.DBMS = "ODBC"
trans2db.DBParm = "Connectstring='DSN=Parts;UID=DBA;PWD=SQL'"

Connect using trans2db;

// Test for successful connection
If trans2db.SQLCODE = -1 Then
Messagebox("Database","Unable to connect with Parts DB", information! )
End IF

The important code in this excerpt are the assignments to the DBMS and DBParm variables.

SQLCA.DBMS = "ODBC"
SQLCA.DBParm = "Connectstring='DSN=Material;UID=DBA;PWD=SQL'"

and

trans2db.DBMS = "ODBC"
trans2db.DBParm = "Connectstring='DSN=Parts;UID=DBA;PWD=SQL'"

These assignment statements associate the ODBC interface and the data source name with each transaction object. Within PowerBuilder, the data source name "Material" had been previously associated with the Sybase SQL Anywhere ODBC driver and Material.DB file and the name "Part" associated with the dBASEIII ODBC driver and the directory containing the non-relational databases.

Three queries have been coded. The first provides a listing of part numbers and descriptions (top datawindow), and for each part number a listing of the lots of that part number (bottom datawindow). Scrolling through the parts listing causes the appropriate lot listing to be retrieved and displayed. A representative screen may be viewed.

The second query lists part numbers and descriptions, and for each part a list of suppliers is retrieved and displayed. A representative screen is reproduced below.

The third query lists the lots of material in inventory, and for each lot provides a listing of products manufactured from that material. To list the products in the dependent datawindow, this query extracts data from multiple tables of the relational database, as described in the following SQL statement.

SELECT "parts_list"."part_number",
"components"."lot_number",
"parts_list"."description"
FROM "components",
"lot_list", "parts_list"
WHERE ( "lot_list"."lot_number" = "components"."lot_number" ) and
( "parts_list"."part_number" = "lot_list"."part_number" ) and
( ( "components"."component" = :lot ) )
ORDER BY "parts_list"."part_number" ASC,
"components"."lot_number" ASC

The argument :lot, corresponds to the lot number of the selected part and is passed to the query from the independent datawindow. A representative screen is reproduced below.

The complete design of the client/server application is documented in the following library listing.

The PowerScript coding associated with each object is presented in the Appendix.

Continuing Work

Initial coding of the client/server application was completed on March 10th. Throughout the remainder of March we will be learning the basics of web server maintenance (WebSite) and the details of the Internet Developer Toolkit. Initial WWW pages will be designed and we will gain experience with forms programming.

In order to convert this client/server application into a web application, the Web.PB approach of the Internet Developer Toolkit will be utilized. We anticipate that the user's initial interaction with the application will be a WWW page allowing selection of one of the pre-coded or a user initiated query. To mimic multiple datawindow controls on a single screen, we will utilize "frame tags" to divide the response WWW page into multiple sections.

During the first two weeks of April, we will transpose the client/server application into a Web.PB application. During the third week of April, we will code queries in which the user provides a part number, lot number or part description and the database application provides some relevant information directly into the Web.PB application, and the last week will be spent completing the project documentation.

References

Buneman, P, Davidson, SB, Hart, K, Overton, C, & Wong, L. A Data Transformation System for Biological Data Sources. In: Proc. of the 21th VLDB Conference. Zurich, Switzerland. 1995.

Davidson, S, Hart, K, & Wong, L. Using CPL. Department of Computer and Information Sciences, University of Pennsylvania.

Davidson, SB, Overton, C, & Buneman, P. Challenges in Integrating Biological Data Sources. Department of Computer and Information Sciences and Department of Genetics, University of Pennsylvania.

Hart, K, & Wong, L. CPL as a Query Language for Genetic Databases. Department of Computer and Information Sciences, University of Pennsylvania.

Hart, K, & Wong, L. A Query Interface for Heterogenous Biological Data Sources. Department of Computer and Information Sciences, University of Pennsylvania.

Musciano, C. & Kennedy, B. HTML: The Definitive Guide. O'Reilly and Associates, Sebastopol, CA. 1996.

Peck, SB, & Arrants, S. Building Your Own WebSite. O'Reilly and Associates, Sebastopol, CA. 1996.


Appendix

The entire application is stored in a single PowerBuilder library file, "project.pbl." Within this file, objects, corresponding to windows, menus, and data window objects (the database queries), etc., are stored. Each object has an associated list of events. For example, the list of events associated with a window object includes "open," "close," "clicked," etc. To carry out the logic of the program, the application programmer associates code with selected events; not all events require code.

The application object, "project," is a nonvisual object whose name becomes the name of the application program. All application specific visual and nonvisual objects are must be associated with an application object. The code presented on page 6 of this report is contained within the "open" event of the application object. Within the "close" event, connections to the databases are released and the second transaction object is destroyed.

// Disconnect from the databases
disconnect using sqlca;
disconnect using trans2db;

// Destroy the second transaction object
destroy trans2db

// PowerBuilder destroys SQLCA automatically

Five window objects are currently required for the application. "w_about" simply presents version information for the application. None of the associated window events contain code.

"w_mdi_frame" is the multi-document interface window, the window in which each of the query windows is displayed. The only coding associated with this window is contained within the "open" event. When this window first opens, no query windows are opened. Consequently, coding within the "open" event resets the opened-window-counter to zero and disables the "File-Close" menu item. [ uo_sheetcounter is a nonvisual object which encapsulates a single variable representing the number of opened query windows. The object also encapsulates three functions which operate on the variable: zero, which resets the variable to zero; increment, which adds one (1) the variable; and decrement, which subtracts one (1) from the variable. As query windows open and close, the increment and decrement functions will be called. ]

uo_sheetcounter.zero ()

// Disable the Close menu item
m_mdi.m_file.m_close.enabled = false

The "open" and "close" events are coded for each of the query windows: "w_parts_lots;" "w_products;" and "w_suppliers." The code within the windows' "open" event increments the opened-window-counter, deactivates a window specific "File-Open-" menu item, associates the datawindow controls with the appropriate transaction object (database), and retrieves data into the independent datawindow.

For the "Parts and Associated Lots" query window, the "open" event code is:

// Increment the sheet counter and enable
// the Close menu item
uo_sheetcounter.increment ()

// Deactivate the open menu
m_mdi.m_file.m_open.m_lotlisting.enabled = false

// Associate datawindow controls with transaction objects
dw_partslisting.settransobject (trans2db)
dw_lotlisting.settransobject (sqlca)

// Retrieve data into the parts datawindow
dw_partslisting.retrieve ()
dw_partslisting.setfocus ()

For the "Active Parts and Suppliers" query window, the "open" event code is:

// Increment the sheet counter and enable
// the Close menu item
uo_sheetcounter.increment ()

// Deactivate the open menu
m_mdi.m_file.m_open.m_supplierlisting.enabled = false

// Associate datawindow controls with transaction objects
dw_suppliers.settransobject (trans2db)
dw_active_parts.settransobject (sqlca)

// Retrieve data into the parts datawindow
dw_active_parts.retrieve ()
dw_active_parts.setfocus ()

And for the "Products" query window, the "open" event code is:

// Increment the sheet counter and enable
// the Close menu item
uo_sheetcounter.increment ()

// Deactivate the open menu
m_mdi.m_file.m_open.m_products.enabled = false

// Associate datawindow controls with transaction objects
dw_lots.settransobject (sqlca)
dw_products.settransobject (sqlca)

// Retrieve data into the parts datawindow
dw_lots.retrieve ()
dw_lots.setfocus ()

The code within the"close" event decrements the opened-window-counter and enables the associated "File-Open" menu item. For the "Products" query window, the "close" event code is:

// Decrement the sheet counter
uo_sheetcounter.decrement ()

// Activate the open menu
m_mdi.m_file.m_open.m_products.enabled = true

Similar code is contained in the "close" event scripts for the other query windows.

As with the application object and windows, events are also associated with objects positioned on the window. The independent datawindow control on each window has associated events which may contain application specific code. For this application, the "rowfocuschanged" event of the independent datawindow contains code which retrieves data into the dependent datawindow. The data retrieved into the dependent datawindow is specific to the data in the selected row of the independent datawindow.

For the w_parts_lots query window, the coding in the "rowfocuschanged" event of the independent datawindow (dw_partslisting) is:

string ls_partname , ls_partnumber

// Highlight the selected row
this.selectrow ( 0 , false )
this.selectrow ( currentrow , true )

// Retrieve the part number and description from the row
ls_partname = this.getitemstring ( currentrow , 2 )
ls_partnumber = this.getitemstring ( currentrow , 1 )

// Set the title of the lot listing data window
dw_lotlisting.title = "Lot Listing for " + ls_partname

// If no lots are retrieved, notify the user
if dw_lotlisting.retrieve ( ls_partnumber ) = 0 then
dw_lotlisting.title = "There are no lots of " + ls_partname + " in inventory."
end if

As the user scrolls through the data, or clicks on a row of data, the "rowfocuschanged" event is triggered. With each execution of the "rowfocuschanged" event script the selected row is highlighted, information is extracted from the data in the selected row, the title of the dependent datawindow (dw_lotlisting) is changed to indicate the selected row, and, finally, data is retrieved into the dependent datawindow with the function call:

dw_lotlisting.retrieve ( ls_partnumber ).

Similar coding is contained within the "rowfocuschanged" event of the independent datawindow (dw_lots) of the w_products query window. The only difference in the script is the syntax of the function call which retrieves data into the dependent datawindow (dw_products):

dw_products.retrieve ( ls_lotnumber ).

While the coding is nearly identical the sources of data differs for the two query windows. w_parts_lots retrieves data from the nonrelational data base into the independent datawindow and from the relational database into the dependent datawindow. w_products retrieves data from the relational database into both datawindows. The appropriate database connections were established within the "open" event of the query window through the function call:

For the w_parts_lots query window:

dw_partslisting.settransobject (trans2db)
dw_lotlisting.settransobject (sqlca)

For the w_products query window:

dw_lots.settransobject (sqlca)
dw_products.settransobject (sqlca)

The transaction object "sqlca" contains connection information to the relational database and transaction object "trans2db" contains connection information to the nonrelational database.

Coding for the "rowfocuschanged" event of the independent datawindow (dw_active_parts) of the w_suppliers query window is slightly simpler:

string ls_partname , ls_partnumber

// Highlight the selected row
this.selectrow ( 0 , false )
this.selectrow ( currentrow , true )

// Retrieve the part number and description from the row
ls_partname = this.getitemstring ( currentrow , 2 )
ls_partnumber = this.getitemstring ( currentrow , 1 )

// Retrieve the listing of manufacturers
dw_suppliers.retrieve ( ls_partnumber )

Data is retrieved into the dependent datawindow (dw_suppliers) through the function call:

dw_suppliers.retrieve ( ls_partnumber )

The appropriate database connections for the w_suppliers query window were established within the window "open" event through the function calls:

dw_suppliers.settransobject (trans2db)
dw_active_parts.settransobject (sqlca)

During the application design process, the programmer associates the database queries with the datawindow object. These pre-defined datawindow objects are then placed into a datawindow control on the appropriate query windows. In event scripts, functions reference the datawindow control, as this is the object on the window, not the datawindow object.

The following table summarizes the relationships between the query windows, the datawindow objects, the datawindow controls, and the database queries.

A single common menu is used throughout the application. A submenu under the "File-Open-" menu allows opening of the query windows. As each window is opened, the corresponding submenu entry is disabled during execution of the "open" event of the query window; as each window is closed, the corresponding submenu entry is enabled during execution of the "close" event of the query window. The script within each submenu item is:

opensheet ( w_parts_lots , w_mdi_frame , 0 , original ! ),
for query window w_parts_lots;

opensheet ( w_suppliers , w_mdi_frame , 0 , original ! ),
for query window w_suppliers;

and

opensheet ( w_products , w_mdi_frame , 0 , original! ),
for query window w_products.

The "File-Close" menu item is disabled as the application starts. Opening a query window enables the "File-Close" menu item during execution of the function call -- uo_sheetcounter.increment (). When a query window is closed, execution of the function call -- uo_sheetcounter.decrement () -- disables the "File-Close" menu item if no other query windows remain open. The script of the "File-Close" menu item first obtains a reference to the active query window and then closes the window. The code for this action is:

close ( parentwindow.getactivesheet ( ) ).

Finally, the "File-Exit" menu item exits the application. The code for this script is:

close ( parentwindow ).

No menu selections exist under the "Window" menu item. As each query window is opened, a reference to the window is added to the "Window" menu item. The "0" within the argument list for the opensheet ( ) function indicates that a reference to the opened window should be added to the next to last menu item on the menu bar (typically the "Window" menu item).

The "Help" menu item has a single selection "About...," which opens the informational window w_about through the script:

open (w_about).