1. Project Overview

1.1  Introduction

The Merriam-Webster collegiate Dictionary describes plagiarism as follows:

Plagiarize:

Transitive senses: to steal and pass off (the ideas or words of another) as one's own: use (another’s production) without crediting the source.

Intransitive senses: to commit literary theft: present as new and original an idea or product derived from an existing source

Plagiarism remains one of greatest temptations facing students and the only barrier to its frequent use is the fear of detection. With the spread of the Internet, the ability to share bodies of work has increased many folds and the risk of detection has diminished almost proportionally. A search of the Internet reveals many stores for plagiarizable papers. There are many sites that sell papers and several other smaller ones that provide smaller collections available for free. As a direct consequence of the alarming increase in plagiarism, writers and scholars are getting discouraged from directly sharing their works over the Internet.

Several tools are available to combat plagiarism.

·  Turn-it-in.com: A very well respected tool and used widely by many institutions. The University of Dayton has a site license for this plagiarism tool.

·  Essay Verification Engine (EVE): Available at www.canexus.com, it accepts popular formats such as Word and plain text and returns a report with the URL suspected of plagiarism and other statistical data. However it has a subscription fee.

·  Moss: A tool developed at university of Berkeley to detect similarities in software programs.

Our aim at Kansas State University was to develop our own version of a plagiarism detection tool. The tool was to use the Google search engine for the web search for related documents and should have an intuitive Graphical User Interface that was simple yet attractive. It should be able to reduce complex remote file system operations to simple drag and drop operations on the GUI and should be able to analyze source documents by just the click of a mouse button.

1.2 Architecture

Figure1-1 shows the basic architecture for the entire application.

Fig 1-1 Architecture.

The web server acts as an interface between the client and the database/file system. Most of the middle tier comprises of SOAP services running on the web server. The client end will be running an application delivered via Java Web Start.

The client end application authenticates user login, fetches file system data from the server in the form of XML data and displays it as a tree structure at the client end. The application also recurses through the client end file system to produce another tree structure representing a hierarchical view of the client file system. The user can then move files from his local directory to his web directory at the server end, feeds files into the IDM search engine and start the search operation using simple drag and drop operations and mouse button clicks.

2  The Existing Scenario

2.1  Current IDM Interface

In the fall of 2001, Sorel Robeldo of Kansas State University, under the guidance of Dr. Daniel Andersen developed a web-based tool capable of searching the Internet for any document(s) that could have been totally or partially copied by a student.

Figure 2-1 Screen shot of the old systems index page

Figure 2-2 Screen shot of the old systems search start page

Figure 2-2 shows a sample Screen shot of the old IDM tool. The start page is a simple HTML page with text boxes, where the user can select files from his file system to upload. When the upload button is pressed, the files are uploaded to the server using JSPSmartUpload and the next screen is presented to the user indicating what files have been uploaded and a search button. Hitting the search button starts the search operation and the results generated are presented to the user.

While the IDM tool is easy to use, it has some inherent drawbacks. There is no provision for user customization with respect to a persistent file store so that a user can upload and store documents to his web directory before adding them to the search engine. The need was also felt to limit users to this tool and therefore the ability to create user accounts and logins. Moreover, the current interface was thought to be too simplistic and a newer graphical user interface was proposed whereby the user can view his local system and his web directory in a windows- like hierarchical view. Instead of using a simple upload mechanism as shown in 2-1, it was proposed to have a drag and drop interface for the user to upload files for searching.

The other main issue to be addressed was obtaining permission from Google before running the search. Due to the large number of search queries generated by the system, Google blocked the IP address from where the requests were originating after a maximum number of automated requests were served. It was resolved to use the Google API as against the current method of querying the Google web-site directly. Every user of the API is provided with a license key, which entitles him to a maximum limit of 1000 queries per day.

2.2  The Intended Audience

The intended audience for the new IDM system is mainly instructors and their assistants in Kansas State University and universities throughout the United States. The user is assumed to have a minimal knowledge of file system operations like in windows type GUI and should be able to install or already have installed Java Web Start on their machines.

For UNIX users, Netscape is expected to be in the user’s classpath. The tool is pretty easy to use and the user interface quite intuitive so that people with minimal computer knowledge can use the tool.


3. Tools and Technologies Used

3.1  J2SE/JDK1.3

Java was the natural choice of programming language to use since the application is a Graphics intensive one and because it is to be delivered via the web. As Sun Microsystems advertises on its web site “Java 2 Platform, Standard Edition (J2SETM) software is the premier solution for rapidly developing and deploying mission-critical, enterprise applications. Version 1.4.1 builds upon Java technology's cross-platform support and robust security model with new features and functionality, enhanced performance and scalability, and improved reliability and serviceability. Version 1.4.1 advances rich client application development and provides the foundation for standards-based, interoperable Web services that can be built and deployed today!”

J2SE 1.4.1or the Java 2 Standard edition is Sun Microsystems’ latest release and JDK 1.1.8 is the development kit that is associated with J2SE. Java is fairly easy to understand and has a rich set of API’s for application development. In particular, our application draws heavily from the Java Swing API for its rendering and functioning.

3.2  Java Web Start

Java Web Start provides the technology to greatly simplify development of Java applications. It includes the security features of the Java platform and allows the user to use the latest Java 2 Technology with any web browser. It automatically downloads any files required to run an application and caches them locally for faster deployment.

Java Web Start can also run applications independent of a web browser. Applications can also be launched through desktop shortcuts making the launching of a web-deployed application similar to launching a native application.

From the security point of view, Java Web Start applications can bypass the typical Sandbox Environment that an applet is subjected. Thus, applications deployed via Web Start have access to system resources.

3.3  Webserver (Resin)

Resin is a fairly lightweight easy to configure webserver from Caucho Technologies. Resin includes a full-featured HTTP/1.1 web server dedicated to serving fast Java dynamic content. It supports the latest Servlet 2.3 specification from Sun. Resin simplifies creating Java classes by automatically recompiling and reloading the java when the source changes. It has a very short response times and most importantly it is free.

3.4 SOAP

SOAP is an XML based lightweight protocol for exchange of information in a decentralized, distributed environment. Data encoded in a SOAP message can be used in a variety of situation such as Message Passing and Remote Procedure Calls. SOAP can be potentially used in combination with a variety of other protocols. SOAP itself does not define any programming model or implementation specific semantics. It instead defines a simple mechanism by providing a modular packaging model and encoding mechanisms for encoding data within modules. SOAP consists of three parts.

1)  The SOAP envelope - the top level XML element or the root element in a XML encoded SOAP message.

2)  The SOAP encoding rules – which define a serialization mechanism.

3)  The SOAP RPC representation defines a convention that can be used to represent remote procedure calls and responses.

3.4  JavaMail API

The JavaMail API provides a set of abstract classes that model a mail system and can be used to build Java Technology based mail and messaging applications.

3.5  JavaBeans Activation Framework (JAF)

Java Activation Framework enables developers to take advantage of standard services to determine the type of an arbitrary piece of data encapsulate access to it, and to instantiate appropriate bean to perform the required operations.

3.5 Database (oracle)

Oracle is a robust relational database management system. It was the automatic choice for our system since it was already installed on the CIS system in Kansas State University.

3.6 XML

Extensible Markup Language (XML) is the universal format for data on the Web. XML allows developers to easily describe and deliver rich, structured data from any application in a consistent way. Structured information contains both content as well some indication about what role that content plays. More and more developers are migrating towards XML in their applications.

XML helps in creating richer indexes, databases and content management systems, it lowers switching costs by letting software systems talk to each other such a car manufacturers system talking to a parts suppliers system. Data represented using XML can be displayed on a variety of target devices. The same XML script can be used to target a PDA or a PC.

3.7  Old IDM System

Rather than develop an all-new backend for the document matching system, we have used the backend of the document matching system developed by Sorel Robeldo with some changes. The IDM system, written mainly in Java, uses a mix of Java Servlet technology, Java Server Pages and static HTML files.

This system is described in detail in his report “Internet Document Matching (IDM)” submitted as part of the requirements for his Master of Science Degree at Kansas State University.

3.8  Rational Rose

Rational Rose Enterprise Edition was used to generate class diagram.

4. Detailed Architecture and Implementation

4.1  Database Configuration

The database has a single table called the login table. The login table stores information about the users of the system. It currently has 4 fields: Username, Password, Social Security Number(SSN) and Date-of-Birth.

Of the four fields in the login table, only two are used to authenticate the user when the application starts. These are the Username and the Password fields. The other two field, namely the Social Security Number and the Date-of-Birth fields currently are only used as additional information about the user and can be used to validate a user incase he forgets his Username or Password.

4.2  Web-Server Configuration and Database Connectivity

The system can be broadly divided into the following: static HTML pages, JNLP files, jar files, the backend IDM system, SOAP Services running on the server end and the client end application packaged as a jar file.

4.21 HTML pages

There is just one static HTML page, which is basically the index page that contains a link to the main JNLP file

4.2.2.  JNLP files

A JNLP file is basically an XML document. The following shows a complete example of the Login.jnlp file:

Login.jnlp

<?xml version="1.0" encoding="utf-8"?>

<!-- JNLP File for Login Demo Application -->

<jnlp

spec="1.0+"

codebase="http://acrux.cis.ksu.edu:7070/test/SOAPlogin/"

href="Login.jnlp">

<information>

<title>Login Window</title>

<vendor>shahid, Inc.</vendor>

<description> Login Frame</description>

<description kind="short">Login into the IDM search .</description>

<offline-allowed/>

</information>

<security>

<all-permissions/>

</security>

<resources>

<j2se version="1.4.0-beta3" max-heap-size="256M" href="http://java.sun.com/products/autodl/j2se"/>

<j2se version="1.4.0-beta2" max-heap-size="256M" href="http://java.sun.com/products/autodl/j2se"/>

<j2se version="1.4" max-heap-size="256M" href="http://java.sun.com/products/autodl/j2se"/>

<j2se version="1.3+" max-heap-size="256M" href="http://java.sun.com/products/autodl/j2se"/>

<j2se version="1.3+" max-heap-size="256M"/>

<j2se version="1.3+"/>

<j2se version="1.3.1" max-heap-size="256M" href="http://java.sun.com/products/autodl/j2se"/>

<j2se version="1.3" max-heap-size="256M" href="http://java.sun.com/products/autodl/j2

<j2se version="1.4.1+" max-heap-size="256M" href="http://java.sun.com/products/autodl/j2se"/>

<jar href="Login.jar" main="true" download="eager"/>

<extension name = "Additional Jars"

href = "additional.jnlp"/>

<extension name = "Additional Jars"

href = "additionaltwo.jnlp"/>

</resources>

<application-desc main-class="LoginUser"/>

</jnlp>

Some of the attributes as defined by the Java Web Start developers guide are specified below:

The JNLP Element

codebase attribute: All relative URLs specified in href attributes in the JNLP file is using this URL as a base.

href attribute: This is a URL pointing to the location of the JNLP file itself. The Java Web Start software requires this attribute to be set in order for the application to be included in the Application Manager.

The Security Element

Each application is, by default, run in a restricted execution environment, similar to the Applet sandbox. The security element can be used to request unrestricted access.

If the all-permissions element is specified, the application will have full access to the client machine and local network. If an application requested full access, then all JAR files must be signed. The user will be prompted to accept the certificate the first time the application is launched.

The Resources Element