CS4432 (DBII) LAB #2

CS4432 (DBII) LAB #2 for D term 2001

OUT: Thursday, March 29, 2001.
DUE: Monday, April 9, 2001 at midnight.
TO BE SUBMITTED INTO MyWPI
POINTS: max of 100pts.

Your second programming lab is to extend the functionality of the basic MyDB database engine that you built in the first lab. In particular, you need to add a B+-tree index to support more efficient search over the data files and to avoid having to scan the complete file each and every time you are looking for a particular record. You are again required to use C++ or C, and no other languages are permitted.

Synopsis

Your program will be able to set up one or several persistent B-tree indices, to load up the B-tree index over a given database file, and to utilize the B-tree index to perform more efficient searches over the datafile thereafter. The searches could be queries that either look for tuples based on matching one equality condition (such as Give me all students in CS4432-relation where color of hair = "green") or a range condition (such as Give me all students in CS4432-relation where raising-hand-in-class >= "1" and raising-hand-in-class <= "100"). In the third programming lab, you may add some more advanced query support to your database program such as for example join processing.

Indexing of MyDB Data Files

Your program will create a persistent B+-tree index over a given datafile MyDB database implemented using lab 1. Please attempt to make use of the persistent file structure and all its basic access routines that you have been creating for lab 1. Note here that in principle both data files as well as index files should be supportable using the same persistent file structure from lab 1. You should however make sure you can handle for example several index files over one data file, as well as more than one data file. So some management of which files are in your database should be added.

You can continue to make the simplifying assumptions from lab1 in terms of needing to support only the string data format for attributes still hold, being able to use fixed-format records to lay out your persistent file, having the first attribute of each record of the database represent the key field, and such so.

To set up a B-tree index for a given MyDB datafile, you will be given:

Name of MyDB data file,
Name of index file for the MyDB data file
Number of the attribute (field) over which this index is defined.
Size of a node of the B+tree index specified in terms of how many search key values each index node can maximally hold (and not in terms of bytes).

For example, given the Movie database with schema from lab1, we have:

Movie( Key char(6), Name char(60), Year char(4), Ranking char(3), Score char(3)),

You may have an index defined as:

Name of MyDB data file: MovieDB.
Number of the attribute (field): 3 (3 here means that attribute "ranking" is the search field because we number attributes starting with position 0).
Name of index file: index.MovieDB.ranking
Size of the B+tree nodes: 4

The B+-tree index itself needs to be made persistent, i.e., it needs to be stored in a persistent (possibly fixed format) datafile similar to the MyDB data files created in lab1. You are free to design your own B+-tree node formats as long as you respect the requirement that the nodes of a given B+-tree must hold at most the number of key fields as indicated by "size" above.

Possible Design of B+-tree Nodes.

We sketch out one possible design for your B+-tree node formats below which you can adopt or adapt to your needs:

type char(1): to indicate if the node is an internal node ("1") or a leaf node ("0").
fill char(2): indicate the number of key values <= size that are actually currently filled in the node.
Ranking char(3): in particular, for the above example you would have 4 slots with 3 char each for 4 different ranking values for a total of 4*3 bytes.
Fileptr char(10): in particular, for the above example, you would have 5 slots with 10 char each for 5 possible file pointers to other locations in a datafile. If the node is an internal node, then all 5 of those file pointers may be utilized pointing to other locations in the index file that currently hold a child B+-tree node. If the node is a leaf node, then at most 4 of those file pointers may be utilized. Each of them now would refer to locations in the associated data file where they would now point to the particular record that holds the given search field value.

Indicator

Fill

Set of Data Items (d)

Set of Ptrs (d+1)

0 or 1

number from [1,size]

LIST of d SEARCH-KEYS

LIST OF d+1 POINTERS

Indicator	Fill	Set of Data Items (d)	Set of Ptrs (d+1)
0 or 1	number from [1,size]	LIST of d SEARCH-KEYS	LIST OF d+1 POINTERS

In other words, the "size" of an index node is not restricted in terms of the number of bytes, but only in terms of the number of different search key values it can hold. You are free to modify the above proposed format of the B+-tree nodes to include additional auxillary information that is useful to your design.

Possible Design of B+-tree Internal Node.

Continuing the above index example, an internal node in that B+-tree would have the following format:

type char(1), fill char(2), Ranking char(3), Ranking char(3), Ranking char(3), Ranking char(3), Fileptr char(10), Fileptr char(10), Fileptr char(10), Fileptr char(10), Fileptr char(10).,

If that node happens to be half-full only, then we may have:

Indicator

Fill

Set of Data Items (d)

Set of Ptrs (d+1)

"1"

"02"

< "va1", "val2", BLANK(3), BLANK(3) >

< "pointer-01", "pointer-02", "pointer-03", BLANK(10), BLANK(10) >

Indicator	Fill	Set of Data Items (d)	Set of Ptrs (d+1)
"1"	"02"	< "va1", "val2", BLANK(3), BLANK(3) >	< "pointer-01", "pointer-02", "pointer-03", BLANK(10), BLANK(10) >

In this example, note that the type of internal nodes is indicated by indicator="1". In this case, max is 4 and fill=2, meaning that only 2 of the 4 key value fields have actual values in it. In this example, the first file pointer "pointer-01" would point to a node in the index tree that denotes the search space of all search fields that have RANKING <= "va1", "pointer-02" would point to an index node capturing all records with search fields "va1" < RANKING <= "va2", and "pointer-03" would point to an index node for all search fields with "va2" < RANKING.

Possible Design of B+-tree Leaf Node.

Continuing the above index example, a leaf node in that B+-tree would have the same format as above however with a different interpretation. First, its indicator = "0" to indicate that it is a leaf. Second, the file pointers now all (but the last one) are directly pointing to a position in the corresponding data file holding a data record that is being indexed by the given key value. The last file pointer on the other hand would point to the sibling leaf node of the index tree to its right. This way all the children of the B+-tree are linked together in a sorted order.

If that node happens to be also half-full only, we may have:

Indicator

Fill

Set of Data Items (d)

Set of Ptrs (d+1)

"0"

"02"

< "va1", "val2", BLANK(3), BLANK(3) >

< "pointer-01", "pointer-02", BLANK(10), BLANK(10) "pointer-05", >

Indicator	Fill	Set of Data Items (d)	Set of Ptrs (d+1)
"0"	"02"	< "va1", "val2", BLANK(3), BLANK(3) >	< "pointer-01", "pointer-02", BLANK(10), BLANK(10) "pointer-05", >

In this example, the first file pointer "pointer-01" would point to a location in the associated data file where the corresponding record is located with the RANKING attribute equal to the search key value "va1", i.e., record has value RANKING="va1". The second file pointer "pointer-00" would point to a location in the associated data file where the corresponding record is located with the RANKING attribute equal to the search key value "va1", i.e., its RANKING="va2".

And, we note that the last and thus 5th file pointer of this leaf node would instead point to the right sibling leaf node in the index tree.

Search Using Index of MyDB Data Files.

Your program should support additional search functions of data files that now would utilize a tree index for data file access, whenever one is available on a given data file. In particular, you should now support search functions that either process an equality-based condition or a range-condition using your B+-tree, if one is available.

Header Files

Besides the functionality already provided by the first prototype version of your MyDB database engine, the extended MyDB database engine interface should now support additional functions as specified in the header file MyDB.h, which you will later ``#include'' in the main program. The MyDB.hshould export the following functions that are further described below:

// Original Functions

int CreateDB( char* dbName, char* datafile );

int PrintDB( char* dbName );

int RecordScan( char* dbName, int attrNumber, char* value );

int KeyScan( char* dbName, char* value);

int InsertRecord (char* dbName, char* value);

int DeleteRecord (char* dbName, char* keyValue);

int ModifyRecord (char* dbName, char* keyValue, int attrNumber, char* newValue);

void error( int rc );

// New MyDB functions: these are the functions you need to implement for lab2.

int CreateIndex( char* dbName, char* IndexName, int attrNumber, int indexNodeSize);

int InsertRecordIntoIndex (char* dbName, char* indexName, char* value);

int PrintIndex( char* dbName, char* indexName, int attrNumber );

int IndexScan( char* dbName, char* indexName, int attrNumber );

int IndexScanPlusAccess( char* dbName, char* indexName, int attrNumber );

int EqualitySearch( char* dbName, char* indexName, int attrNumber, char* value);

int RangeSearch( char* dbName, char* indexName, int attrNumber, char* value-min, char* value-max );

As in the first project, all functions that return an int, return a ``0'' when the operation was successful or error codes (defined by you) in any other case.

A Closer Look at the New MyDB Functions

int CreateIndex( char* dbName, char* indexName, int attrNumber, int indexNodeSize);

This function creates an index named indexName for the attribute in position attrNum of the database dbName . The index will be a persistent database file that consists of B+-tree nodes as explained above. The search keys stored in those B+-tree nodes will now correspond values of the indexed attribute at attrNum position of the database. For example, for the MovieDB database listed earlier, the call CreateIndex( ``MovieDB'', ``index.MovieDB.ranking'', 3, 4 ) will create an index on the Ranking attribute field of the MovieDB database listed earlier. And, each node of this index will hold at most 4 search keys. And, a second index could be created for the same MovieDB database with the call CreateIndex( ``MovieDB'', ``index.MovieDB.year, 2, 5 ) will create an index on the Year attribute field of the MovieDB database listed earlier. Each node of this index will hold at most 5 search keys. Details of how this index and its nodes may look like are given earlier in this document.

int InsertRecordIntoIndex (char* dbName, char* indexName, char* value);

This function should insert a reference to a record with values as given in the input string pointed to by value into the corresponding index indexName. For this, it would first check that this record with value value does indeed exist in the database file dbName. Then i would verify that a reference to this record has not yet been inserted into the index, because we want to avoid referencing the same data record twice within the index.

int PrintIndex( char* dbName, char* indexName, int attrNumber );

This function would print a description of the index indexName defined over the database relation dbName . You can design your own strategy of how best to display the index so that its structure as well as content are clear for the DB administrator. At a minimum, you need to display statistics about the tree, such as its size per node, the search key field it is defined over, the height of the tree, the number of leaf nodes, as well as some description of each node in the tree such as its content as well as its children. You can get fancy if you want and think about some graphical display of the tree for the DB Administrator to observe its properties better, but you do not need to do so.
int IndexScan( char* dbName, char* indexName, int attrNumber );
This function would traverse your B+-tree down from the Root node to the left most leaf node and then would start printing all leaf index nodes from left to the right in some format that reveals the content of each node in the index. You can design your own format to print out the content of each index node.
int IndexScanPlusAccess( char* dbName, char* indexName, int attrNumber );
This function would traverse your B+-tree from the root down to the left most leaf node and then would start printing the data records of the database dbName in the order in which they have been indexed by the index (leaf nodes). You can design your own format to print out the content of each data record. Note that this function would result in a sorted output of your database file if the index works correctly. Hence you may want to use this function when testing for the correctness of the working of your index structure and algorithm.
int EqualitySearch( char* dbName, char* indexName, int attrNumber, char* value );
This function would search for (and then also print) all records of the database dbName for which the attribute in the attrNumber position (starting with the first attribute key being at position zero) has the same value as the null-terminated string pointed by value. If no index is available on the database dbName, this function should revert back to utilizing the int RecordScan() function from lab1. However, if an index file indeed exists on this particular database and the given search key, then this index is selected by your optimizer and made use of for executing this search and locating the desired records.

int RangeSearch( char* dbName, char* indexName, int attrNumber, char* value-min, char* value-max );

This function would search for (and then print) all records of the database dbName for which the attribute in the attrNumber position (starting with the first attribute key being at position zero) has a value that is larger equal to the string pointed by value-min and smaller equal to the string pointed by value-max. Again if a suitable index is available it must be used. And, if not available, you may consider how to create a variation of the int RecordScan() function that now can handle such range search also without an index.

Sample Application

Below we give a very very simple application that uses your database engine to create and utilize an index over the Movie database from project1. We assume here that the Movie file is called movies.txt. This program prints all records in the database, prints all records with score of 8.9, and prints the record of the movie with `107712'' as key. Then the program inserts a new record with value "59734&Deep Blue See&2000&2&8.8", deletes record with `107712'' as key, modifies the rank of record with `97141'' as key to be "8.8", and at the end prints all records in the database after all previous operations. Thereafter, we set up two different indices over two different attributes of this one database, and print them out respectively. Then we insert one new databse record and also make sure that record gets inserted into both indices. We then do four different type of searches with the help of our indices above. One, we simply search for a sorted list of all references of data records and then for a sorted list of the actual records. Then we conduct one equality-based search based on the condition "rating = 111" and one range query based on the predicate "year >= 1999" and "year <= 2000".

#include "MyDB.h" main() { int rc; if (rc=CreateDB( "mydb", "movies.txt" )) { error(rc); } if (rc=PrintDB("mydb")) { error(rc); } if (rc=RecordScan( "mydb", 4, "8.9" )) { error(rc); } if (rc=KeyScan( "mydb", "107712" )) { error(rc); } if (rc= InsertRecord("mydb","59734&Deep Blue See&2000&2&8.8"){error(rc); } if (rc= DeleteRecord ("mydb","107712") { error(rc); } if (rc=ModifyRecord ("mydb", "97141", 4, "8.8") { error(rc); } if (rc=PrintDB("mydb")) { error(rc);} if (rc= CreateIndex(``mydb'',``index.MovieDB.ranking'',3,4) { error(rc); } if (rc= PrintIndex(``mydb'',``index.MovieDB.ranking,3) { error(rc); } if (rc= CreateIndex(``mydb'',``index.MovieDB.year,2,5) { error(rc); } if (rc= PrintIndex(``mydb'',``index.MovieDB.year,2) { error(rc); } if (rc= InsertRecord("mydb","58834&Deep Blue Ocean&2001&2&9.9"){ error(rc); } if (rc= InsertRecordIntoIndex ("mydb","index.MovieDB.ranking","58834&Deep Blue Ocean&2001&2&9.9"){ error(rc); } if (rc= InsertRecordIntoIndex ("mydb","index.MovieDB.year","58834&Deep Blue Ocean&2001&2&9.9"){ error(rc); } if (rc= IndexScan(``mydb'',``index.MovieDB.year,2) { error(rc); } if (rc= IndexScanPlusAccess(``mydb'',``index.MovieDB.ranking,3){ error(rc); } if (rc= EqualitySearch(``mydb'',``index.MovieDB.ranking,3,"111"){ error(rc); } if (rc= RangeSearch(``mydb'',``index.MovieDB.year,2,"1999","2000"){ error(rc); }}

Documentation

Each student is expected to submit a description of your design in a plain text file called README, as well as to include comments in the code. For the format of this document, please refer to CS Department Standard Document. We point out in particular that this means that you have to provide example testing cases you have worked with to evaluate your program, including testing runs that illustrate that your tests were successful. A discussion of what features work and do not work must be included.

Grading

The grade of your program will be based on four aspects:

Functionality - that the program correctly implements the specifications.
Quality - elegant design and modularity (to allow you to extend this functionality for example for lab3 in this class) .
Documentation - as described above, including discussion of your design choices and limitations and commenting of your code.
Evaluation - testing quality includes a clear description of the test runs utilized for assuring the functionality of your system, as well as scripts clearly illustrating that your program indeed worked for complicated cases and with proper error handling.

Submission

Submission of the assignment will take place electronically via myWPI. Each team turns in one copy of their MyDB system project. Please make sure to include your names and the term "CS4432 project 2" on top of each of your files to avoid any confusion. Ideally, we suggest that you give a "name" to your team by which you identify yourselves, like "Hackers101" or "DBWizards", and then utilize this team name for example for the README.teamname. Instructions for the submission process will be made available at myWPI. Your complete work including your README documentation file must be submitted in one package that has been zipped. Several individual files by one team will not be accepted. Make sure that your zipped file has a clear name of the form "project2.teamname".