PRASAN ROY

Network Data and Services Research Department

Bell Laboratories

600 Mountain Avenue

Murray Hill, NJ 07974 USA

mailto:prasan@research.bell-labs.com

Tel: +1-908-582 7332

Fax: +1-908-582 7308

 

Research Interests

 

My primary area of interest is database systems. My recent research has focused on:

 

XML Data Synopsis and Storage

·         Design of a simple and effective statistical model for XML data.

·         Design of efficient algorithms for deriving the most effective schema for storing XML documents in relational stores.

 

Query Processing and Optimization

·         Design of an efficient, scalable framework and algorithms for multi-query optimization.

·         Using the above multi-query optimization framework for designing efficient intelligent algorithms for related problems, such as query result caching and view/index selection and maintenance.

 

Online Garbage Collection

·         Design of a transactional reference counting algorithm for online garbage collection in object-oriented databases.

 

Education

 

Jan'98 -- Mar'01

Ph.D. (Computer Science and Engineering),

Indian Institute of Technology, Bombay

Thesis: Multi-Query Optimization and Applications

Advisor: Prof. S. Sudarshan

 

Jul'95 -- Jan'98

M. Tech. (Computer Science and Engineering),

Indian Institute of Technology, Bombay

Thesis: Optimization of DAG-structured Query Evaluation Plans

Advisor: Prof. S. Seshadri

 

Jul'90 -- May'94

B. Tech. (Computer Science and Engineering),

Indian Institute of Technology, Delhi

Thesis: Motion Parameter Estimation

Advisor: Prof. Kanad K. Biswas

 

Patents

 

1.    Prasan Roy, S. Seshadri, Avi Silberschatz and S. Sudarshan,

Garbage Collection in Object Oriented Databases Using Transactional Cyclic Reference Counting,

2002 U.S. Patent 6,363,402

2.    Phil Bohannon, Juliana Freire, Prasan Roy and Jerome Simeon,

A Cost-Based Approach to XML Storage, in preparation.

 

Publications

 

Refereed Journal Publications

1.    Prasan Roy and S. Sudarshan,

Multi-Query Optimization: Algorithms and Applications,

invited survey paper, Information Systems, in preparation.

2.    Nilesh Dalvi, Sumit Sanghai, Prasan Roy and S. Sudarshan,

Pipelining in Multi-Query Optimization,

invited paper, Journal of Computing and Systems Sciences (special issue on 2001 ACM Symposium on Principles of Database Systems (PODS)), in preparation.

Expanded and improved version of 6 below.

3.    Prasan Roy, S. Seshadri, Avi Silberschatz, S. Sudarshan and S. Ashwin,

Garbage Collection in Object Oriented Databases Using Transactional Cyclic Reference Counting,

invited paper, VLDB Journal (special issue on the 23rd International Conference on Very Large Data Bases, ed. Matthias Jarke), Vol. 7, No. 3, August 1998, pp. 179--193.

Expanded and improved version of 9 below.

 

Refereed Conference Publications

4.    Juliana Freire, Jayant Haritsa, Maya Ramanath, Prasan Roy and Jerome Simeon,

StatiX: Making XML Count,

2002 ACM SIGMOD Conference on Management of Data, Madison, WI, June 2002

5.    Phil Bohannon, Juliana Freire, Prasan Roy and Jerome Simeon,

From XML Schema to Relations: A Cost-Based Approach to XML Storage,

2002 International Conference on Data Engineering, San Jose, CA, March 2002

6.    Nilesh Dalvi, Sumit Sanghai, Prasan Roy and S. Sudarshan,

Pipelining in Multi-Query Optimization,

2001 ACM Symposium on Principles of Database Systems (PODS), Santa Barbara, CA, May 2001

Ranked as one of the top five papers in the conference.

Extended and improved version (2 above) was invited to a special issue of Journal of Computing and Systems Sciences.

7.    Hoshi Mistry, Prasan Roy, K. Ramamritham and S. Sudarshan,

Materialized View Selection and Maintenance using Multi-Query Optimization,

2001 ACM SIGMOD Conference on Management of Data, Santa Barbara, CA, May 2001

8.    Prasan Roy, S. Seshadri, S. Sudarshan and Siddhesh Bhobe,

Efficient and Extensible Algorithms for Multi-Query Optimization,

2000 ACM SIGMOD Conference on Management of Data, Dallas, TX, May 2000

9.    S. Ashwin, Prasan Roy, S. Seshadri, Avi Silberschatz and S. Sudarshan,

Garbage Collection in Object Oriented Databases Using Transactional Cyclic Reference Counting,

23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, August 1997.

Ranked as one of the top five papers among the 335 papers submitted to the conference.

Extended and improved version (3 above) was invited to a special issue of VLDB Journal.

 

Technical Reports

10.Prasan Roy, K. Ramamritham, S. Seshadri, Pradeep Shenoy and S. Sudarshan,

Don't Trash your Intermediate Results, Cache 'em,

Technical Report, Dept. of Computer Science and Engineering, IIT-Bombay.

 

Thesis

11.Prasan Roy,

Multi-Query Optimization and Applications,

Ph.D. Thesis, Dept. of Computer Science and Engineering, IIT-Bombay, December 2000.

12.Prasan Roy,

Optimization of DAG-structured Query Evaluation Plans,

M.Tech. Thesis, Dept. of Computer Science and Engineering, IIT-Bombay, January 1998.

 

A discussion of the above papers appears later in the section Summary of Research Work.

 

Honors and Awards

 

·         Recipient of IBM graduate fellowship for Ph.D. studies.

·         Ranked 38th among approximately 4,000 examinees in GATE 1995, the Graduate Aptitude Test in Engineering in the stream Computer Science and Engineering.

·         Ranked 67th among approximately 80,000 examinees in JEE 1990, the Joint Entrance Examination for admission to the undergraduate program at the IITs.

·         Recipient of National Talent Search Scholarship in 1988 by the National Council for Educational Research and Training, New Delhi.

 

Employment History

 

Dec'00 -- present

Designation: Member of Technical Staff

Network Data and Services Research Department,

Bell Laboratories, Murray Hill, NJ (USA)

Working on:

- XML storage and querying.

- Database management and policy management aspects of IN Services.

Supervisor: Dr. Richard B. Hull

 

Jun'99 -- Aug'99

Designation: Research Intern

Microsoft Corporation,

Redmond, WA (USA)

Prototyped multi-query optimization algorithms on Microsoft SQL-Server.

Supervisor: Dr. Paul Larson

 

Jan'98 -- Apr'98

Designation: Teaching Assistant

Department of Computer Science and Engineering,

Indian Institute of Technology, Bombay (India)

Course: CS602: Applied Algorithms

Supervisor: Prof. Sundar Vishwanathan

 

Jun'97 -- Jan'98

Designation: Research Associate

Department of Computer Science and Engineering,

Indian Institute of Technology, Bombay (India)

Worked on multi-query optimization algorithms.

Sponsor: Engage Technologies/Redbrick Systems.

Supervisor: Prof. S. Seshadri/Prof. S. Sudarshan

 

Aug'95 -- Jun'97

Designation: Research Associate

Department of Computer Science and Engineering,

Indian Institute of Technology, Bombay (India)

Worked on the development of Brahma, a parallel database system.

Sponsor: Board of Research in Nuclear Sciences (BRNS), Department of Atomic Energy, India.

Supervisor: Prof. S. Seshadri/Prof. S.S.S.P. Rao

 

Dec'94 -- Jun'95

Designation: Research Associate

Department of Computer Science and Engineering,

Indian Institute of Technology, Bombay (India)

Worked on the development of a re-engineering tool for legacy COBOL software.

Sponsor: Associated Cement Companies Ltd., India.

Supervisor: Prof. D.B. Phatak

 

Aug'94 -- Nov'94

Designation: Assistant Systems Analyst

Tata Consultancy Services, Bombay (India)

 

Professional Activities

 

·         Member, Program Committee, ICDE 2003

·         Member, Program Committee, ACM SAC 2003 (Web Technologies and Applications track)

·         Member, Demonstrations Committee, SIGMOD 2003

·         External reviewer for ICDE, SIGMOD and VLDB for the past several years

·         Member of the ACM SIGMOD and SIGACT and the IEEE Computer Society

·         Organization volunteer for 22nd International Conference on Very Large Data Bases (VLDB) held at Mumbai in 1996

 

References

 

S. Sudarshan

Department of Computer Science and Engineering

Indian Institute of Technology, Bombay

Powai, Mumbai 400 076 India

Tel: +91-22-576 7714, +91-22-572 2545 (ext 7714)

Fax: +91-22-572 0290

mailto:sudarsha@cse.iitb.ac.in

 

S. Seshadri

Chief Operating Officer

Strand Genomics

146, 5th Cross, RMV Extension

Bangalore 560 080 India

Tel: +91-80-346 7567

Fax: +91-80-346 7568

mailto:seshadri@strandgenomics.com

 

Krithi Ramamritham

Department of Computer Science and Engineering

Indian Institute of Technology, Bombay

Powai, Mumbai 400 076 India

Tel: +91-22-576 7740, +91-22-572 2545 (ext 7740)

Fax: +91-22-572 0290

mailto:krithi@cse.iitb.ac.in

 

Summary of Systems Implementation Work

 

Query Optimizer

I have designed and implemented an extensible top-down query optimizer. The implementation is primarily based on the Volcano optimizer generator framework (Graefe and McKenna [ICDE93]).

 

The optimizer is being used as a research platform in the department. In particular, it forms the testbed for our research in multi-query optimization. We have extended the basic Volcano based framework in order to integrate the optimizer with the multi-query optimization algorithms and heuristics developed by us. This is discussed later in the section Summary of Research Work.

 

The optimizer is written in C++ and is approximately 17,000 lines of code. Certain ideas implemented as a part of this work were prototyped on Microsoft SQL-Server during the summers of 1999.

 

Query Result Cache Manager

As a part of the Exchequer project at IIT-Bombay, we are considering the problem of caching intermediate results of query evaluation for reuse in subsequent queries in the workload.

 

I have designed and implemented a cache manager that is capable of performing the above task. This tightly coupled system supports cache-aware query optimization as well as optimization-aware cache replacement. This is discussed later in the section Summary of Research Work.

 

Parallel Database Systems

I was a key architect of a prototype database management system (called Brahma).

 

This involved the design of an OS microkernel suited to a DBMS, a generic parallel object storage manager that formed the lowest layer of the DBMS, and a parallel query processing engine. The work brings together, in one system, a lot of interesting ideas that have appeared in recent literature.

 

I was responsible for the project as a whole and was actively involved in the overall design and development with other graduate students. I have developed the parts of the system that deal with storage management, object management, cache management and cache coherence. I have been very closely involved in the implementation of concurrency control and recovery protocols (implemented an extension of ARIES protocol of Mohan et al. [TODS92] for parallel shared-disk systems), the generic distributed lock manager (support for adaptive locking proposed by Joshi [VLDB91]) and the parallel query processing engine (based on the operator model of Volcano proposed by Graefe [SIGMOD90]).

 

Additionally, I have implemented GiST (Generalized Search Tree) which encapsulates the maintenance algorithms and concurrency control protocols for index structures of the B-Tree family (B+-Tree, R-Tree, etc.). The implementation is based on the work of Hellerstein et al. [VLDB95] and was done on top of the Brahma storage manager. The system is written in C++ and is approximately 65,000 lines of code. It is presently used as a platform for instruction and research in the department.

 

Summary of Research Work

 

I have been working on XML storage and querying, on multi-query optimization and its applications, and on online garbage collection algorithms for object oriented databases.

 

XML Data Synopsis

StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. This information can be used to build concise, yet accurate, statistical summaries for XML data.  StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this work, we develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design.

 

This work was published in the ACM SIGMOD Conference held at Madison, WI (USA) in June 2002.

 

XML Storage

As Web applications manipulate an increasing amount of XML, there is a growing interest in storing XML data in relational databases. However, due to the mismatch between XML's tree structure and flat relational tables, there are many ways to store the same document(s) in a RDBMS, resulting in very different performances. LegoDB is a cost-based XML storage engine that addresses this problem. It explores a space of possible XML to relational mappings and selects the best mapping for a given application. LegoDB leverages current XML and relational technologies. It models the target application using with an XML Schema, XML data statistics, and an XQuery workload. The space of configurations is generated through XML-Schema rewritings. Finally, the best among of the derived configurations is selected using cost estimates obtained through a standard relational optimizer.

 

This work was published in the IEEE International Conference on Data Engineering (ICDE) held at San Jose, CA (USA) in March 2002.

 

Materialized View Selection and Maintenance

Materialized views are increasingly being supported by commercial database/data warehouse systems. In particular, in data warehouses the window for view maintenance is small, so efficient view maintenance techniques are essential. We extend our multi-query optimization framework to find an efficient plan for maintenance of a set of views, by exploiting common subexpressions between different view maintenance expressions. These common subexpressions may be materialized temporarily during view maintenance. Our algorithms also choose subexpressions/indices to be materialized permanently (and maintained along with other materialized views), to speed up view maintenance. Our algorithms can be used to efficiently select materialized views to speed up workloads containing queries, in addition to speeding up view maintenance.

 

This work was published in the ACM SIGMOD Conference held at Santa Barbara, CA (USA) in May 2001.

 

Multi-Query Optimization

We consider the problem of query optimization where either a single query has repeated subexpressions or multiple queries have common subexpressions. This problem is known as multi-query optimization, and is of importance because of the increasing use of complex decision support queries on large sets of data, particularly in data warehouses. Such queries typically benefit a lot from elimination of redundancies due to recomputation of repeated and shared subexpressions.

 

Traditional query optimizers are not appropriate for optimizing queries with common subexpressions, since they make locally optimal choices, and may miss globally optimal plans. We propose several novel and practical cost-based algorithms for multi-query optimization, based on the compact AND-OR DAG representation of queries.

 

We have implemented an optimizer based on the Volcano framework, and extended it to implement the multi query optimization algorithms, along with the above-mentioned extensions, in a seamless fashion; we believe the engineering aspect of our work is also one of our contributions. We present a performance study comparing the heuristics, using a workload consisting of queries based on the TPC-D benchmark. Our study shows that multi-query optimization can provide significant benefits over traditional optimization, at a very acceptable overhead.

 

This work was published in the ACM SIGMOD Conference held at Dallas, TX (USA) in May 2000. Extensions to this work were presented in the ACM Symposium on PODS held in Santa Barbara, CA (USA) in May 2001.

 

Query Result Caching

In data warehouse and data mart systems, queries often take a long time to execute due to their complex nature. Query response times can be greatly improved by caching final/intermediate results of previous queries, and using them to answer later queries. This is termed query result caching.

 

Query result caching is closely related to the multi-query optimization problem. The difference arises primarily from the fact that instead of a pre-defined batch of queries, there exists a continuous workload of queries. Therefore, instead of taking a one-time decision on picking the subexpressions to materialize as in multi-query optimization, the decision on what to keep in the cache needs to be revisited continuously. This work shows how to adapt the multi-query framework in this context, while keeping the overheads acceptable.

 

We have developed a query result caching system called Exchequer, based on our multi-query optimization framework. Using an optimizer framework for cache management in Exchequer results in its ability to do cache-aware optimization as well as optimization-aware cache management. In existing work, such a close coupling is absent. Furthermore, existing approaches are either restricted to cube queries, or to caching just the final query results. Our work is extensible and in fact presents a data-model independent framework and algorithm.

 

This work is in progress.

 

Garbage Collection

As a part of our work on garbage collection in object oriented databases, we have developed the Transactional Cyclic Reference Counting algorithm, a transaction-aware variant of a cyclic reference counting algorithm proposed for functional programming languages. The algorithm keeps track of auxiliary reference count information to detect and collect cyclic garbage. It works correctly in the presence of concurrently running transactions, and system failures. It does not obtain any long term locks, thereby minimizing interference with transaction processing. It uses recovery subsystem logs to detect pointer updates; thus, existing OODB code need not be rewritten to make pointer updates explicit. Finally, it exploits schema information, if available, to reduce costs.

 

We have integrated the algorithm with the Brahma storage manager. We have also implemented the Partitioned Mark-and-Sweep algorithm proposed by Amsaleg et al. [VLDB95]. We performed a comparative study based on the above implementations; the study clearly illustrates the benefits of our algorithm.

 

We have also provided a formal proof of correctness. We show that the algorithm is safe: it does not collect live objects; complete: it eventually collects all garbage objects; and bounded: it does a bounded amount of work in absence of concurrent update transactions.

 

Further, we have extended the algorithm for client-server environments. Units of data shipping (like page or object) or whether or not data is cached at the client do not affect our techniques. We have proved the correctness of the algorithm in the client-server setting also.

 

This work was published in the 23rd International Conference on Very Large Data Bases (VLDB) held at Athens, Greece in September 1997. It was selected as among the best five papers submitted to the conference and an extended and improved version appears as an invited paper in a special issue of the VLDB Journal on the conference (Vol. 7, No. 3, August 1998).