DSL for Merging of Tree-Structured Data [mergelang]
Overview
"DSL for Merging of Tree-Structured Data", Bachelors thesis. Plan-X project. To be submitted at DIKU. Plan-X is a joint effort with IT-U.
Motivation
The widespread use of XML as a data transport- and storage format as well as the recognition of the shortcommings of the relational data model when working with semistructured data have created the need for tools for merging of tree structured data.
In addition, as the shift towards pervasive computing will bring massive, weakly coupled, distributed systems, we believe data synchronization will emerge as a key factor in the realization of that scheme. As the penetration of technologies related to XML (or other semistructured storage formats) increases the amount of data represented in this form grows as well, making resolution of merging conflicts based on user interaction infeasible and exclusive in-memory solutions undesirable.
Also, the hitherto popular differencing, merging, and collaboration tools (e.g. GNU diff and patch) assume a linear data representation and thus do not acknowledge either the structure nor the typically rich semantics present in this type of documents.
Problem Statement
Semistructured documents such as collections of computer programs or legal texts making up a software/document configuration may be updated indpendently and concurrently by multiple users and thus need to be synchronized by merging the multiple updates to arrive at joint . An example is using cvs for merging directories, where files are treated as lists of strings (each string representing a line in the input).
Design and implement a simple domain specific language (DSL) for specifying how to merge multiple updates of tree-structured data into a single compound update. Exemplify and demonstrate update merging of document-oriented and data-oriented XML documents.
Research Thesis
The resolution of conflicts that arise during merging of tree structured data depends, in the general case, on the context of the merge. That is, merging operators must take into consideration the semantics inherent in the problem domain to be able to reconcile detected conflicts. Thus, it is not possible to construct an universal, syntax based algorithm for meaningful merging of tree structured data.Therefore, a parameterized tool taking into account domain specific semantics is needed.
Goals
- Identification of classes of merging problems
- Creation of representative, paradigmatic problem instances for each class
- Specification of abstract machine for DSL
- Construction of simple, prototype interpreter of DSL
- Implementation of adapter for interfacing XMLStore with interpreter
- Compilation of real-world usage scenarios for testing and evaluation of DSL
- Analysis and evaluation of fundamental properties of created DSL
Notational Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Human Resources
Supervisor
Professor Fritz Henglein (DIKU).
Advisors
Assistant professor Henning Niss (IT-U).
Team
Andreas Christian Olsen (DIKU). [himself@diku.dk]
Morten Møller Holst (DIKU). [mmh@hippocampus.dk]
Approach
Value Oriented Paradigm
From the Plan-X project we inherit a value oriented programming model. Thus, our mergelang prototype abstract machine will be implemented in an applicative language, SML, as this is in keeping with the value oriented approach of Plan-X, and because the high abstraction level and the declarative and functional style are very close to the characteristics we want mergelang to posses. Operational semantics are furthermore very easily translated to SML code. In addition, the rapid prototyping facilities (very fast edit-compile-execute-test iterations) of SML will provide for fast development.
Domain Analysis
We intend to obtain (domain) knowledge by surveying existing litterature on merging of (semistructured) data, the design of DSLs, and by studiying several domains where semistructured data is merged. On the basis of this knowledge we will perform an analysis that shall seek to identify classes of merge problems, and thereafter develop paradigmatic examples representing each merge problem class. These key examples shall fuel the prototyping of merge operators as well as act as a foundation for the evaluation of the constructed language in a later phase.
Domain Specific Language
By specifying desired results of merging in each class of problems and thereby describing the required expressive power of domain specific merge operators, we hope to lay the groundwork for a DSL. The developed operators must provide means for specifying context sensitive notions of e.g. update detection and desired conflict reconciliation with respect to domain specific semantics. The created operators will be combined to form a DSL.
[deursen:2000a] propose the following definition of a DSL: A DSL is programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain.We note that DSLs usually are declarative and therefore they can be regarded as specification languages as well.
By adapting methodologies for developing DSLs proposed by [thibault:1998a, consel:1998a], we hope to address several software development concerns.
From a programming language perspective, DSLs are claimed to be more attractive than GPLs for many applications since they provide for easier programming, systematic reuse of common operations, and easier verification. From a software architecture perspective, DSLs can be regarded as a parameterization mechanism, and an interface to a program library. We adopt the standpoint, that DSL benefits are obtained through familiar program notation, design reuse, high-level abstraction, and program analysis. The latter is due to restricted expressiveness and makes automatic program verification a realistic goal. One can argue, that DSLs are superior to GPLs in many scenarios due to domain specific abstraction mechanisms, a "rich" execution model, declarative style (specifications as opposed to imperatives) and (possibly) controlled access to storage.
Formal Specification
Mergelang will be described using formal methods of specification of syntax (EBNF) and semantics (natural operational semantics). This shall among other things enable us to analyse fundamental attributes of the language as well as preparation for future work, e.g. an semantics conforming, effective compiler for mergelang.
Abstract Machine
As a proof of concept as well as means of evaluation we will construct a simple prototype implementation of the DSL in the form of a interpreter based directly on the operational semantics. We intend to develop an adapter for XMLStore, utilizing an SML interface for XMLStore written by H. Niss, as well; making it possible to test our mergelang interpreter against a XMLStore. In addition, we will use the prototype to run more examples and gain additional insights in the capabilities of the (expressiveness) of the created DSL.
Evaluation
Finally, we intend to evaluate existing merge algorithms and tools by using them on the beforementioned cases thereby performing a limited comparative study.
Plan
Ordered list of activities that we plan to work on during the period September - December 2004.- Get the SML interface for XMLStore, and the XMLStore itself up and running
- Establish prototyping environment
- Analyse the problem of merging, and identify classes of problems
- Find and describe reallife problems related to each of the classes of problems
- Create informal description of DSL
- Develop a formal description of the DSL (syntax & semantics)
- Prototyping, while documenting the DSL
Description of Phases
I. Inception
Feasibility
A very simple language of specific merge operators that merges xml trees is an acceptable outcome of the project. The project team regards this as a realistic goal.Essential Requirements
- Formal description of syntax (EBNF)
- Formal description of semantics (operational semantics (?))
- Prototype interpreter (at least incorporating language constructs developed in iteration alpha-00)
- Feasible (polynomial) complexity
II. Elaboration : Specification
The analysis will be case-driven. We will seek to identify usage-patterns that shall act as foundation for an use-case model. The analysis will roughly be carried out as follows:
- Identify essential problems (Doc vs. Data, needed input (metadata) for merging, syntactical merging feasible in the general case?)
- Identify usage scenarios. Document centric, data centric, configuration management, personla information mangers etc.
- Identify/create use cases. Create examples in XML
- Data centric (database schemas, CMS exchange, RDL (Ressource Description Language)
- Document centric (articles), "wikipeda style" encoclypedia entries, articles (online newspapers etc.)
- Program code (cvs style usage, as text and as (abstract) syntax trees)
- Law text
- PIM's (calendars, contacts, etc,)
- Specify wanted behaviour.
- Informal description
- Formal description
- Test cases (from formal description)
- Prototype merging operators. Implement prototype merging operators in SML
Artefacts
Project artefacts. Our report, created software etc. will be made available here as it becomes ready.
| Item | Update Date | Description and Remarks |
| Bibliography | 25/05-2004 | |
| Digester, a frontend for XMLStore | 16/09-2004 | Digests XML documents and saves them in an XMLStore (for mosml) |
Pointers
- 3DM. XML diff and merge tool [Lindholm]; based on a grammatical approach.
- bepjea (plan-X). Another plan-X project working on merging of semistructured data, but in the context of concurrency control.
- FXP XML parser for Moscow ML by Andrzej Wasowski.
- Harmony. A project, headed by Benjamin C. Pierce, that aims to develop implementation architectures and conceptual foundations for a broad class of synchronizers, among them one specialized for tree structured data. A subproject within Harmony is a DSL for bi-directional tree transformations.
- Moscow ML. Light-weight implementation of Standard ML.