Central article: Baumann, Fennestad, Thorn, "A distributed, value-oriented XML Store", MS thesis, ITU, August 2002 Questions: 3 1 Summarize the principal advantages of value-oriented programming (share-and-create style programming, with atomic reference updates) vis-a-vis object-oriented (imperative copy-and-update style programming, with array updates and file updates). What are its practical disadvantages? (You may take current hardware and OS design, legacy problems, and even social and historical developments into account.) 4 1 Consider DOM (Document Object Model). Each DOM node (corresponding to a node in an XML tree/info set) has, amongst others, the following instance fields and methods: Document ownerDocument // the XML document root node of current node Node parentNode // parent node of current node Node insertBefore(Node newChild, Node refChild) // inserts newChild before refChild of current node, // returns inserted node // ... more child manipulation methods like insertBefore Node cloneNode(boolean deep) // clones a node shallow or deep A node can be thought of as containing: a pointer to its root node, a pointer to its immediate parent, and a pointer to its children. The children can be thought of implemented using a doubly-linked list connecting the children. The newChild argument node in the insertBefore method is unlinked from its parent before being inserted into the list of children of the current node. Why? If we want to a DOM node n to occur twice in the list of children of another node p, how can this be programmed in DOM? In the latter case, how do you program an update of the node n; e.g., changing its chardata from "blib" to "blob"? 4 2 Imagine a distributed implementation of DOM for this question (that is, the pieces of a document). Explain how the interface of Question 4.1 supports or counteracts some central ideas from value-oriented programming, as exemplified in the XML Value Store design: sharing of XML (sub)trees, coalescing of values (only one copy of an XML element is stored), caching and reuse of cached copies without cache coherence protocol, programmatic support for atomic updates (abort/commit of complex updates), migration of XML (sub)trees from one computing node to another. 7 1 BFT describe a red-black tree based implementation of child lists of nodes. What is the computational complexity of their implemenation of executing a single this.indexOf(childNode) operation on a node with n children? (Ref.: Chap. 7.5. The Java source code is in edu.it.xmlstore.xml in the appendix to BFT. NB: On p. 91, BFT write the indexOf operation has worst case running time O(n), but that is not correct here.) What is the complexity of an insert operation? Describe a data structure that improves the asymptotics of the ChildList API. (You may give an extended API for this, e.g., for bulk-inserting; that is, inserting more than one child at a time.) 7 2 BFT write: "Since we make use of sharing of subtrees, parents and siblings are not uniquely determined. Therefore, we only have the child axis [in the location step]." and describe methods for XPath based methods. Show that these also make sense for parent and sibling axes. (Hint: Think of pairs consisting of (current node, list of nodes from root note to current node) as describing the current node's placement in the the "current tree".) 8 (Chord questions: Will be posed in connection with the lecture on routing.) 9 1 BFT use SHA-1 as the under hash function for computing a value reference for a given value. Given an XML tree, describe how the hash function is used. Is the result the same as applying the hash function to the linearized XML form of the tree? A faster hash function than SHA-1 is exclusive or: Split the input to which the hash function is applied into 160-bit blocks (possibly padding the last block with zeroes) and exclusively-or all the 160-bits blocks with each other. Compare this function, XOR160, with SHA-1 in terms of its usability. (Hint: think of collisions and security.) 9 2 Collisions in SHA-1 are extremely unlikely, but nonetheless possible. Sketch how you would change/extend the XML Store if total (100,0000000000000%) probability of detecting and handling collisions is required. 9 3 BFT's implementation stores each value in a separate file. In a file system with block allocation size 16 KB how much disk space is used to store a typical XML document of size 1 MByte? (Make some assumptions, or take some test data (e.g., by looking at a bunch of HTML files), about "typical" XML documents of 1 MByte size --- how many elements and chardata segments they have, how big the latter are on average --- and determine how much disk space the implementation is bound to take in this case. (You may make assumptions about the implementation instead of looking at the implementation.) 10 1 BFT store only immutable nodes (values) in their XML Store. Sketch how you would make extend the XML Store design with mutable nodes (cells). How would you handle caching/replication and coalescing for cells? BFT employ a centralized name server; that is, the map from symbolic names to value references is kept on a single machine. Explain how you can make the name server distributed, using the remaining parts of the XML Store, plus cells. (Ref. Section 10.4) 10 2 BFT employ an update operation with functionality void update(name, new, expected). The mapping for name is only updated to new, if the current value for name equals expected. Why do they not use the simplified update function void update(name, new), which updates name to new regardless of the current value for name. (This is just a teaser. More on this problem in the lecture on wait-free synchronization!) 11 1 BFT use synchronous remote procedure call built on top of TCP. Every load and save block until an element is stored on disk. Sketch a way of implementing asynchronous saving and discuss its advantages and disadvantages vis a vis doing it synchronously. In asynchronous message passing, messages may overtake each other; that is, if a process first sends m1 and then m2 to the same receiver, the receiver may process m2 before m1. How does that affect your way of saving asynchronously? (Do you need sequence numbers, message buffers, etc.?) 14 1 The prototype implementation saves only 1 to 4 XML values per second. This is at least 3 orders of magnitude (that is, a factor of at least 1000) too slow to be competetive performance-wise on a modern LAN. Describe appropriate changes to the XML Store and to its implementation and make back-of-the-envelope calculations to figure how much speed-up you can get? Can you improve the performance by 3 orders? (Hint: You're allowed to change their network hardware. Assume their network is not switched, but only hubbed.) 2 BFT write on p. 114 that the time it takes to save a changed document is linear in the size of the changes made to the document. Is this strictly correct?