Lab 2 – LASI Prototype Product Specification 2

Lab 2 – LASI Prototype Product Specification

Red Team

Brittany Johnson

CS411W

Janet Brunelle

April 8, 2013

Version 1

Table of Contents

1 Introduction 3

1.1 Purpose 3

1.2 Scope 4

1.3 Definitions, Acronyms, and Abbreviations 5

1.4 References 7

1.5 Overview 7

2 General Description 7

2.1 Prototype Architecture Description 8

2.2 Prototype Functional Description 14

List of Figures

Figure 1. Prototype Major Functional Component Diagram 8

Figure 2. GUI Site Map 9

Figure 3. Prototype Hardware and Software Component Diagram 10

Figure 4. Nouns 11

Figure 5. Verbs 12

Figure 6. Phrase 13

List of Tables

Table 1. Feature comparison between full product and prototype 4

Lab 2 – LASI Prototype Product Specification 3

1  Introduction

Linguistic Analysis is the contextual study of written works and how the words combine to form and overall meaning. Themes are the subject-object-verb relationships that help the reader to comprehend and summarize what has been read. LASI will be a decision support tool to assist users in determining common themes across multiple documents. It is even more difficult to come to a conclusion when the number of documents increases because the theme across all of the documents may not be the theme of each of the individual documents. The complexity of a topic and the reader’s familiarity with it plays an important role in comprehension. The reader’s comprehension, along with the ability to summarize the material is important in being able to communicate the content of a document. Thus, it is often difficult for people to identify a common theme over a large set of documents in a timely, consistent, and objective manner.

1.1  Purpose

LASI stands for Linguistic Analysis for Subject Identification. It is a stand-alone theme finding application conceived by the Old Dominion University CS410 Red Group. It is designed to be a decision support tool for large, multi-document linguistic analysis and allow for more accurate and consistent results. LASI will be able to detect themes across many documents and can provide both individual and cross document analysis to determine a single theme.

LASI’s ability to analyze multiple documents to find a common theme makes it a great decision support tool for teachers, students, research analysts and those that would need to read through large sets of documents on a frequent basis. Teachers, for example, would be able to use LASI as an initial analysis on student papers to check whether or not it is consistent with the topic of that paper. Both students and research analysts could use LASI to quickly assess the usefulness of scientific and literary publications for the topic that they are researching.

1.2  Scope

Prototype features will differ from the real world product in scale. Some features will be eliminated to the project due to limited development time. A complete list of features is available in Table 1.

Table 1. Feature comparison between full product and prototype

The types of documents that the LASI prototype accepts has been limited to just DOC and DOCX. Scanned text recognition has been removed from the prototype since there is not enough time to get the OCR software fully functioning. The prototype will limit the number of documents that can be added to one project to three to five, and there is a size limitation of 10 pages on each of those documents to insure that the algorithm can function in a timely manner.

Rather than focusing on every part of speech, in the LASI prototype we will focus on noun-verb binding. There were also a few of the more complex features that did not make it into the prototype like user defined dictionaries, synonym identification, and content assumption.

1.3  Definitions, Acronyms, and Abbreviations

A.I.D.: Assessment Improvement Design

A.I.D. Process: A process that provides quantitative and qualitative basis to identify problems and determine the feasibility of solutions.

Analysis: Detailed examination of the elements or structure of something, typically as a basis for interpretation.

Document: A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research.

Head word: A locally distinct word within a phrase which, by its syntactic associations, determines the category of the phrase itself.

LASI: Linguistic Analysis for Subject Identification

Linguistic Analysis: The scientific analysis of a language.

Parser: Takes in DOC and DOCX files and converts them to TXT files.

Part of Speech Tagger: Software utility that associates words with the parts-of-speech in a sentence.

Phrase: An instance of the Phrase class.

Phrase: (Linguistically) A group of words standing together as a conceptual unit.

Phrase Class: The root of the taxonomy of class types which correspond to syntactic roles at the phrase level and whose instances contain a collection of Words which together represent a linguistic phrase.

Semantic Analysis: Relating the syntactical structure of words to their language independent meanings.

Sharp NLP: Written in C#, natural language processing tool used to parse and tag parts-of-speech.

Strategic Document: Document produced by a client that defines their Goals, Visions and Missions.

Subject Identification: The process by which the subject matter and thematic content of documents is determined.

Syntactic Analysis: Identifies key words based on their location in the sentence, rather than their overall meaning throughout the document.

.TAGGED: The type of file that stores the output of the part-of-speech tagger containing the all of the text of the document with embedded syntactic annotations.

Theme: Subject-object-verb relationships that LASI is attempting to generate from the input set.

Tag: A label, or the act of attaching a label, that specifies the syntactic role of a selected element in a document.

Tagged Set: A group of words, whose part of speech and location in a sentence have been identified by the parser.

WordNet: Compiler and provider of the data files which forms the basis for the LASI thesaurus.

Word Class: The root of the taxonomy of class types which correspond to parts-of-speech at the word level and whose instances encapsulate each occurrence of a textually identified word.

Word Weight: A numeric value, associated with each syntactically and lexically unique word in a written work, indicating its significance.

[This space intentionally left blank.]

1.4  References

Johnson, Brittany. (2013). Lab 1 – LASI Product Description.

SharpNLP. (n.d.). Retrieved from http://sharpnlp.codeplex.com/

Office binary to open xml. (n.d.). Retrieved from http://b2xtranslator.sourceforge.net/

1.5  Overview

This product specification provides the hardware and software configuration, external interfaces, capabilities and features of the LASI prototype. The information provided in the remaining sections of this document includes a detailed description of the hardware, software and interface of the LASI prototype as well as the key features of the prototype.

2  General Description

The following sections describe the prototype in more detail. Section 2.1 identifies and describes each architectural component of the prototype. Section 2.2 explains the prototype’s functional requirement. Lastly, Section 2.3 describes the external interfaces of the prototype.

[This space intentionally left blank.]

2.1  Prototype Architecture Description

The architecture for the LASI prototype consists of 3 major components: a Graphical User Interface, an algorithm and a file management system. Figure 1. shows a major functional component diagram of the prototype.

The first major component is the graphic user interface. The LASI User Interface is a Windows Presentation Foundation (WPF) project using XAML to define the structure of the views and C# to provide the interactivity. The LASI prototype GUI contains: a Start-up Screen, a Create Project View, a Project Preview, an In Progress View and a Results View.

[This space intentionally left blank.]

Figure 2. GUI Site Map

As shown in Figure 2., results can be viewed in three different format types: Top Results, Word Relationships, and Word Count and Weighting. The top results will be represented graphically based on the user’s preferred chart type. The charting engine that is being used for this feature is a functionality of the WPF Toolkit. The word relationships will also be displayed for each document. Each word is colorized based on its part-of-speech. This will allow the user to see the relationships between all of the words and phrases in a document. Results will also be displayed based on the individual word count and weight. The weight that will be displayed is based on the weighting algorithm. Results can either be printed or exported in PDF, JPG, and PNG.

The second major component is the file management system. It manages converting files and invoking the tagger. The file management system contains the file converter and the parts-of-speech tagger. The file converter that the LASI prototype is using is the B2XTranslator, third party open source software that can convert DOC and DOCX into an XML file. The parts-of-speech tagger software being used is SharpNLP, open source C# natural language processing tool. The SharpNLP POS Tagger tags words and phrases with the respective parts-of-speech for use by the LASI algorithm. SharpNLP utilizes the Penn Treebank parts-of-speech tags to define the parts of speech.

The last major component is the algorithm. The LASI prototype algorithm is written in C#. The Algorithm, as shown in Figure 1., contains a Tag Parser which converts the text into word and phrase types representative to their parts-of-speech. A Word, in reference to word types, is the root of the classification of class types which correspond to parts-of-speech at the word level and whose instances encapsulate each occurrence of a textually identified word. Figure 4. shows all of the Word types in the LASI prototype. Every word that is tagged by the part-of-speech tagger has a corresponding Word type.

[This space intentionally left blank.]

Figure 4. Word

[This space intentionally left blank.]

A Phrase, as shown in Figure 5., is the root of the classification of class types which correspond to syntactic roles at the phrase level and whose instances contain a collection of Words which together represent a linguistic phrase. Just like with Word types, every type of phrase that can be tagged with our part-of speech tagger is represented.

.

Figure 5. Phrase

The LASI prototype algorithm binds word and phrase types together based on their syntactic relationship via a state machine derived logic flow. Words and phrases will be bound together based on their Word or Phrase type mentioned above and how they relate to one another within phrases, paragraphs, and the document. The weighting algorithm will assign each word a weight based on its part-of-speech, frequency count and the number of times and ways it is referenced. For the LASI prototype we will be focusing subject, object and attributive binding.

2.2  Prototype Functional Description

The major functional components are shown in Figure 1. A user will interact with the LASI GUI and create a new project using documents that are of the correct file type and stripped of all graphics. The user will need to fill out all required information needed to create a new project. These actions will result in a new project being created and the document converter being called. When documents are added to a project in the GUI, the document converter takes DOC and DOCX and converts it to an XML file. Once the document is in XML, it is converted to raw text that can be used by the parts-of-speech tagger.

The user is then navigated to the document preview where they can either remove or add documents. Once analysis has begun, SharpNLP will embed a part-of-speech tag into the text from each document. The tagged file is then passed on to tagged file reader which then assigns each word and phrase a word or phrase type which corresponds to its part of speech given by the tagger.

Once the word and phrase binding is finished, it will begin weighting the words based on their frequency as the number of times it is referenced. The weighting metrics for each word will be based on a raw frequency as well as a relative frequency. Each word will have a raw frequency that is based on a simple word count, the number of times that the word was used in a particular manner, and a frequency count for synonyms of that word. The relative frequency will be based on subject, verb and object relationship between words as well as where a word is located in a document. Each of the word weights will then be passed on to the GUI Results page for the user to view.