Plan-X: Plagiarism Detection Tool
Overview
XMLStore is an application configurable, distributed, mobile persistence layer with a value-oriented interface for storing and retrieving semi-structured data, in particular XML documents. Transparent sharability and actual sharing of stored data is a central benefit of the value-oriented approach in XMLStore, embodied in its enforced data immutability and exploited in XMLStore by storing a unique data item only once and having all occurrences represented by pointers to the stored item.
The general goal of this project is to develop a tool that exploits the degree of sharing detected and represented in XMLStore to flag suspicious similarity of parts of tree-structured documents; to apply this to measuring similarity of Standard ML programs authored by different persons; and to compare this sharing detection technique with 'fingerprinting' (hashing) techniques used otherwise for copy detection [1].
Goals
-
to design and implement a modular, configurable copy detection tool for Standard ML programs that can be used to flag suspicious similarities between groups of programs as a filter for (manual) plagiarism analysis. The tool should allow configuration by, e.g., parameter setting to use it interactively and 'tune it' to a particular input set (of programs). The tool should be modular to enable (re)use with other data than Standard ML programs.
-
to evaluate the tool on Standard ML programs that are representative of solutions to small to mid-size programming exercises (20 to 1000 lines of code).
-
to identify weakness of the method proposed and to pinpoint promising remedies.
-
to compare this sharing detection method with fingerprinting techniques, notably winnowing, with respect to expressive power (what kinds of similarities can be discovered, what kinds of counter measures can foil copy detection; which counter measures can copy detection 'see through'?) and efficiency.
-
to give an appraisal of the viability of copy detection by sharing detection and to present important problems that should be solved to increase its applicability.
Method
-
Get familiar with XMLStore and value-oriented programming in general by reading. Then get familiar with the XMLStore implementation and try it out in a simple configuration (e.g., with raw disk XML store alone). Look at the source code to understand how it works.
-
Get MOSS, install it, and experiment with it using SML programs.
-
Design a modular software architecture for a plagiarism detection tool:
SML program --(parser)--> SML abstract syntax tree --(dvm-convertere) --> document value model --(save)--> XMLStore, or similar. In particular, find a suitable SML parser and adapt it (e.g., an SML parser written in Java, or an SML parser written in Standard ML, unparse to XML document; then use Java-based tools to read XML document in and store in XMLStore). Please see figure 1. -
Analyze counter measures against copy detection and design counter counter measures by normalizing the input documents as preprocessing; e.g. combinations of:
-
semantics-preserving preprocessing: standard naming for variables (to detect alpha-converted versions of programs); sorting of lists of declarations, etc.
-
semantics-changing preprocessing making each variable the same; removing parts of input programs/ documents, sorting without regard to correctness, elimination of multiple elements etc. Preservation of semantics is not considered important as the user is expected to look in the original files helped by the information given by this tool.
-
-
Incorporate important counter counter measures (given above) into the tool and evaluate its 'expressive' strength on input data (SML programs).
-
Design an extension to XMLStore that discovers and outputs sharing information in the stored data; e.g., given a root set of nodes (representing documents/programs by pairwise distinct authors):
-
compute the set of root node parents for each stored node;
-
compute size, minimum and maximum depth
-
design and compute other parameters (if any) for nodes that may be interesting to use in copy detection.
-
Output should be easy understandable for the user. It could be either text on screen, giving which document have something in common and where, or as HTML with the common parts in each document highlighted.
-
Implement it by extending XMLStore with new methods.
-
-
Design and implement a similarity measure for groups of 2 or more documents (root nodes); evaluate
-
Find input (SML programs representing solutions to a given programming exercise; e.g., from Dat0 and Dat1E) for tests and generate (synthesize) new tests (simulating plagiarized). Evaluate the results.
-
Compare expressive power of the plagiarism detection tool with MOSS. Analyze strengths and weakness of the tool relative to MOSS and relative to its use as a plagiarism detection tool (that is, how easily it can be foiled by counter measures). Identify important problems that remain for using this tool as a plagiarism detection tool in general; as a plagiarism detection for computer programs in particular; and, even more specifically, as a plagiarism detection tool for SML programs.
Milestones & Dates
This section contains a list of the milestones we have found. Each milestone represents what we think is an important step towards the completion of our plagiarism detection tool. Each item in the below list is described more closely earlier in this project plan. The dates should rather be seen as guidelines than as final dates.
- Install XMLStore and play with it.
05.03.2004
- Understand relevant parts of the XMLStore source code.
Learn to work with MOSS.
05.03.2004 - 16.03.2004
- Parsing of SML-programs. Analysis and implementation.
Further of analysis of XMLStore and computation. Implementation.
Finding SML-programs to be used as input.
16.03.2004 - 23.03.2004
- Conversion of SML tree to DVM tree. Implementation.
Further of analysis of XMLStore and computation. Implementation.
23.03.2004 - 09.04.2004
- Test and evaluation. Included is tuning of parameters.
09.04.2004 - 16.04.2004.
- Debugging. Detection of weaknesses and strengths. Implementation of
counter counter measures.
16.04.2004 - 23.04.2004
- Compare with MOSS. Identification of problems that remain. (Further
designing of counter counter measures.) Completion of report.
23.04.2004 ->
Literature
- Alex Aiken, Saul Schleimer, Daniel Wilkerson, "Winnowing: Local Algorithms
for Document Fingerprinting." Proceedings of the ACM SIGMOD International
Conference on Management, June 2003
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf - Kasper Bøgebjerg Pedersen and Jesper Tejlgaard Pedersen, "Value-oriented
XML Store", Master's thesis, ITU and DTU, 2002.
http://www.it-c.dk/~kasperp/xmlstore/pdf/thesis.pdf - XMLStore source code - http://plan-x.org/xmlstore/
- Moss (Measure Of Software Similarity) - http://www.cs.berkeley.edu/~aiken/moss.html
- Lawrence C. Paulsen: "ML for the working programmer," 2nd Edition
- Comparator - http://www.catb.org/~esr/comparator/comparator.html
- Cormen, Leiserson, Rivest, Stein: "Introduction to Algorithms," 2nd Edition, MIT Press, Cambridge 2001.
Paul Clough: "Plagiarism in natural and programming languages: an overview of
current tools and technologies," July 2000. - http://www.dcs.shef.ac.uk/~cloughie/papers/Plagiarism.pdf
People
Christa Fotel, stud.scient - christa @ diku.dk
Lars Langer, stud.scient - langer @ diku.dk
Supervisor: Prof. Fritz Henglein, Department of Computer Science, Copenhagen University.
Appendix
Figure 1: The procedure of our detection tool - first we parse and normalize the SML-program and dump XML code by using the MosML compiler. The generated XML documents are saved by XMLStore. After this we analyse the references computed by XMLStore and based on the input parameters given, we generate a list of possible plagiarism.
