A Plan-X Project

DSL for Merging of Tree-Structured Data [mergelang]

Overview

"DSL for Merging of Tree-Structured Data", Bachelors thesis. Plan-X project. To be submitted at DIKU. Plan-X is a joint effort with IT-U.

Motivation

The widespread use of XML as a data transport- and storage format as well as the recognition of the shortcommings of the relational data model when working with semistructured data have created the need for tools for merging of tree structured data.

In addition, as the shift towards pervasive computing will bring massive, weakly coupled, distributed systems, we believe data synchronization will emerge as a key factor in the realization of that scheme. As the penetration of technologies related to XML (or other semistructured storage formats) increases the amount of data represented in this form grows as well, making resolution of merging conflicts based on user interaction infeasible and exclusive in-memory solutions undesirable.

Also, the hitherto popular differencing, merging, and collaboration tools (e.g. GNU diff and patch) assume a linear data representation and thus do not acknowledge either the structure nor the typically rich semantics present in this type of documents.

Problem Statement

Semistructured documents such as collections of computer programs or legal texts making up a software/document configuration may be updated indpendently and concurrently by multiple users and thus need to be synchronized by merging the multiple updates to arrive at joint . An example is using cvs for merging directories, where files are treated as lists of strings (each string representing a line in the input).

Design and implement a simple domain specific language (DSL) for specifying how to merge multiple updates of tree-structured data into a single compound update. Exemplify and demonstrate update merging of document-oriented and data-oriented XML documents.

Research Thesis

The resolution of conflicts that arise during merging of tree structured data depends, in the general case, on the context of the merge. That is, merging operators must take into consideration the semantics inherent in the problem domain to be able to reconcile detected conflicts. Thus, it is not possible to construct an universal, syntax based algorithm for meaningful merging of tree structured data.Therefore, a parameterized tool taking into account domain specific semantics is needed.

Goals

Notational Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Human Resources

Supervisor

Professor Fritz Henglein (DIKU).

Advisors

Assistant professor Henning Niss (IT-U).

Team

Andreas Christian Olsen (DIKU). [himself@diku.dk]

Morten Møller Holst (DIKU). [mmh@hippocampus.dk]

Approach

Value Oriented Paradigm

From the Plan-X project we inherit a value oriented programming model. Thus, our mergelang prototype abstract machine will be implemented in an applicative language, SML, as this is in keeping with the value oriented approach of Plan-X, and because the high abstraction level and the declarative and functional style are very close to the characteristics we want mergelang to posses. Operational semantics are furthermore very easily translated to SML code. In addition, the rapid prototyping facilities (very fast edit-compile-execute-test iterations) of SML will provide for fast development.

Domain Analysis

We intend to obtain (domain) knowledge by surveying existing litterature on merging of (semistructured) data, the design of DSLs, and by studiying several domains where semistructured data is merged. On the basis of this knowledge we will perform an analysis that shall seek to identify classes of merge problems, and thereafter develop paradigmatic examples representing each merge problem class. These key examples shall fuel the prototyping of merge operators as well as act as a foundation for the evaluation of the constructed language in a later phase.

Domain Specific Language

By specifying desired results of merging in each class of problems and thereby describing the required expressive power of domain specific merge operators, we hope to lay the groundwork for a DSL. The developed operators must provide means for specifying context sensitive notions of e.g. update detection and desired conflict reconciliation with respect to domain specific semantics. The created operators will be combined to form a DSL.

[deursen:2000a] propose the following definition of a DSL: A DSL is programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain.We note that DSLs usually are declarative and therefore they can be regarded as specification languages as well.

By adapting methodologies for developing DSLs proposed by [thibault:1998a, consel:1998a], we hope to address several software development concerns.

From a programming language perspective, DSLs are claimed to be more attractive than GPLs for many applications since they provide for easier programming, systematic reuse of common operations, and easier verification. From a software architecture perspective, DSLs can be regarded as a parameterization mechanism, and an interface to a program library. We adopt the standpoint, that DSL benefits are obtained through familiar program notation, design reuse, high-level abstraction, and program analysis. The latter is due to restricted expressiveness and makes automatic program verification a realistic goal. One can argue, that DSLs are superior to GPLs in many scenarios due to domain specific abstraction mechanisms, a "rich" execution model, declarative style (specifications as opposed to imperatives) and (possibly) controlled access to storage.

Formal Specification

Mergelang will be described using formal methods of specification of syntax (EBNF) and semantics (natural operational semantics). This shall among other things enable us to analyse fundamental attributes of the language as well as preparation for future work, e.g. an semantics conforming, effective compiler for mergelang.

Abstract Machine

As a proof of concept as well as means of evaluation we will construct a simple prototype implementation of the DSL in the form of a interpreter based directly on the operational semantics. We intend to develop an adapter for XMLStore, utilizing an SML interface for XMLStore written by H. Niss, as well; making it possible to test our mergelang interpreter against a XMLStore. In addition, we will use the prototype to run more examples and gain additional insights in the capabilities of the (expressiveness) of the created DSL.

Evaluation

Finally, we intend to evaluate existing merge algorithms and tools by using them on the beforementioned cases thereby performing a limited comparative study.

Plan

Ordered list of activities that we plan to work on during the period September - December 2004. We expect to have most of the project finished medio December. Which gives us about a month to finetune the details.

Description of Phases

I. Inception

Feasibility

A very simple language of specific merge operators that merges xml trees is an acceptable outcome of the project. The project team regards this as a realistic goal.

Essential Requirements

II. Elaboration : Specification

The analysis will be case-driven. We will seek to identify usage-patterns that shall act as foundation for an use-case model. The analysis will roughly be carried out as follows:

  1. Identify essential problems (Doc vs. Data, needed input (metadata) for merging, syntactical merging feasible in the general case?)
  2. Identify usage scenarios. Document centric, data centric, configuration management, personla information mangers etc.
  3. Identify/create use cases. Create examples in XML
    • Data centric (database schemas, CMS exchange, RDL (Ressource Description Language)
    • Document centric (articles), "wikipeda style" encoclypedia entries, articles (online newspapers etc.)
    • Program code (cvs style usage, as text and as (abstract) syntax trees)
    • Law text
    • PIM's (calendars, contacts, etc,)
  4. Specify wanted behaviour.
    • Informal description
    • Formal description
    • Test cases (from formal description)
  5. Prototype merging operators. Implement prototype merging operators in SML

Artefacts

Project artefacts. Our report, created software etc. will be made available here as it becomes ready.


Bibliography 25/05-2004
Digester, a frontend for XMLStore 16/09-2004 Digests XML documents and saves them in an XMLStore (for mosml)

Pointers