Web-scale Semantic Social Mash-Ups with Provenance

1. Problem Statement:

As the Web grows ever larger and increasing amounts of data are available on the Web, everyday users want free and unrestricted access to combine data from multiple sources, often in order to discover particular information about a particular entity, such as a person, place, or organization. For example, a person may want to know the mobile phone number of Dan Connolly, and whether or not they have a social connection in common on LinkedIn or some other social networking site that would help them find a job. This information is spread throughout multiple sites using keyword-based search engines, where it is up to user to not only find the needle in the haystack, but integrate this knowledge themselves. An alternative approach would have the user identify the entity they want information about, and then let a program find and integrate on the Web across multiple websites. Websites that make this type of information available do so through specialized APIs, but these have to be integrated on a service by service basis using editors such as Microsoft Popfly. Companies like Google are releasing APIs like openSocial that claim to be interoperable Web standards, but actually are not and work only under restricted conditions with a limited number of data-sources. The problem is that each web service, such as LinkedIn, has its own “walled garden” of data that is incompatible and not easily “mashed up” with other sources of data.

One solution to this problem is to do the mash-ups based on semantics and open Web standards, by giving overt semantics to standardized data and then by doing the “mash-up” based on this semantics. Many attempts to add semantics to data seek to do so in an open-ended manner, by giving semantics for at least sizable fragments of natural language. Given the unreliability of these methods and the inherent ambiguity of natural language, a far less ambitious but potentially Web-scalable and more practical methodology would be to take advantage of common data formats that already have a clear, if informal, meaning associated with them, such as business cards, calendars, social networks, and item reviews. In this case, the user should be able to enter the name and desired data about the entity based on common data format (such as “All business card data for Dan Connolly”) and the mash-up will be try to retrieve and “fill in the data” for the requested format, relying and storing data using open Web standards such as OpenID for identity and Friend-of-a-Friend for social networks.

This approach will be based on the Semantic Web Resource Description Framework (RDF), a W3C standard for metadata and data integration, and could “Web-scale” since each component of the “mash- up” will be given a distinct URI to serve as the “globally unique foreign key” during the mash-up. In this manner, any data that can be mapped to a common RDF vocabulary based around already widely-deployed data formats such as vCard can be “mashed-up” or integrated. While much of the data on the Web lacks any semantics and attempts to add them have been viewed as too complex for users, with the spread of microformats, an estimated 500 million web-pages have had semantics for these common data formats explicitly added to them (http://microformats.org/). Furthermore, most of the APIs and structured HTML associated with “Web 2.0” applications have an agreed upon semantics that can be mapped to the vocabularies such as iCal and vCard. In this manner, instead of trying to “boot-strap” semantics first, we take advantage of the semantics already available in large quantities on the Web, and then can if necessary supplement these techniques from information retrieval and natural language processing such as named-entity recognition and even dependency parsing (Marshall, 2003).

However, another question that arises is the source and quality of the data. Some data sources will be of high quality, such as LinkedIn, but not easily exportable due to lack of APIs. Using Myspace again as an example, invalid HTML usage makes it difficult even for scrapers to convert to RDF. Other data sources, like Wikipedia, may require some natural language processing (or spidering of the structured data Wikipedia gives in its “boxes”), a process fraught with error. Without any sort of pre-processing it is difficult for ordinary users to create trustable mash-ups from their data.

Our solution is to track the provenance all the data in the mash-up, which includes not only the source of the data but whatever processing is done to the data, with each step of processing tracked in a step-by-step fashion, like the steps of a mathematical proof. This will allow the user themselves to be aware of where the data is going, and so “follow-their-nose” to the source of the data in the mash-up, as well as information about the tools used to process the data, and so take into account as much of the contextual elasticity of the semantics by annotating them explicitly as temporally-dated steps in the proof. If the users finds an error in the integration, they should be able to correct it by simply removing the source or component from the “mash-up.” By using a functional framework with tight ties to a formal logic via the Curry-Howard Isomorphism(Wadler, 1989), provenance-tracking can be built into the very fabric of the mash-up itself, allowing “provenance for free” with no additional work by mash-up creators. The provenance lets other users see if they trust the results and correct errors, allowing ordinary users, not only experts, to create data with semantics in order to share this data with other users. This framework lets users comment on and correct other people's mash-ups, with these changes also being tracked via provenance information attached as “proofs” of the data, and so allowing the “wisdom of crowds” to be applied to mash-up data in a principled way. We believe that possible future integration of such work with graphical interfaces such as Popfly would allow users to themselves create mash-ups to extract semantics that combine their social networking and personal data with the vast amount of data in their spreadsheets and documents using open Web standards, a much more productive method for adding semantics to the Web than relying on experts or an API dominated by a single company.

2. Expected Outcome

The expected outcome is a proof-of-concept of using the Semantic Web as a transport and integration format for powering Web-scale “mash-ups” from heterogeneous sources of data, including Microsoft Office documents, using a logical and functional framework that builds provenance into the very process of extraction and integration. We will also demonstrate how these mash-ups can be inserted into HTML code by showing how this functional framework can directly be embedded into the DOM tree. More concretely, the project will deliver both a theory-based deliverable, a practical deliverable that can be used in demonstrations, and guidelines to use our work with Microsoft Office documents.

A formal semantics for N3 Logic and a stabilized N3 syntax: Currently, there is little data (approximately 1 billion instances) already capable of being used by Semantic Web, especially given its Web-scale goals. Yet this can be explained by the confusing serialization of the abstract RDF model into XML in RDF/XML, which almost all users and developers find unreadable and overly complex. The main informal alternative to RDF/XML is a JSON-like syntax for RDF called “N3.” However, even this grammar has never been formally defined and so has fragmented into MIT N3, Turtle, N-Triples, and a fragment of the SPARQL syntax. N3 Logic extends RDF by adding variables and quantification for querying and reasoning capabilities in RDF. Despite being used by MIT in their work on Policy-aware Data-Mining and even used within Cleveland Medical Clinic to manage patient records, N3 Logic has yet to have a precisely defined formal semantics (Berners-Lee, 2007). Once it is given a formal semantics and normalized syntax, we will then proceed to map N3 Logic to a functional framework in a principled manner to allow it to be used to run a large number of components as functions (such HTML tidy, or named-entity and geo-tagging web services), where each step of the process automatically produces a step in a proof, attached to the output produced by the component. To be published as academic paper.

A Large-Scale Proof-Of-Concept of Semantic Mash-ups with Provenance: This framework will

then be demonstrated using a diverse set of heterogeneous source of data. The structured data-set will be created by selecting named-entities and locations from the Microsoft Live Search queries, and will be supplemented with unstructured “Web in the wild” data by spidering the click-through results of the selected query. This data will vary according over the amount of structure already inherent in the data. Some of the data will already be structured in the form of RDF gathered from the Open-Linked Data Project (http://linkeddata.org/), and this data will have a high amount of structure. Another source of data will be RSS feeds and microformat-enabled data that contain structured data automatically capable of being extracted with its semantics via a mechanism like GRDDL, which lets a vocabulary author provide their own transform to RDF in a self-describing manner for XML and HTML documents. It will be of varying quality but have only some structure, and so this data will have to be run through multiple components (such as HTML “tidy”) of varying reliability in order to extract the semantics. Lastly, we will also try to merge data from both high-quality natural language data from Wikipedia, and non-reliable natural language data, using a pipeline of natural language processing tools including part- of-speech taggers, named-entity recognition, dependency parsers, and geo-taggers. Any data from the “Web at large” garnered through usage of the click-through records will likely have no structure and so require more pre-processing than the other sources data. The provenance of exactly what component has processed the data is as important as where precisely the data came from. This large-scale demonstration off of real data will be created using N3 Logic and its functional equivalent, showing how this framework can tackle the problems of processing data from a wide variety of heterogeneous sources while tracking and optimizing via provenance information attached as proofs. The results will be evaluated with a user-study comparing it to traditional search engines, and released as open source to run via Web Services on live data.

Guidelines for Integrating Microsoft Office Documents to the Mash-Ups: As the majority of the world's digital knowledge is stored in Microsoft Office, since Microsoft Office is capable of XML-based output, it should be possible create a transformation from XML to RDF in order to integrate this data into a semantic mash-up. A guideline made available on the Web for all will be produced, including any ideas for possible changes in the format. Furthermore, these transformations will be capable of being integrated into the proof-of-concept framework, so that the demonstration can feature data from Microsoft Excel being mashed-up with data from sources such as Facebook and Wikipedia.

3. Schedule:

Jan-March: Create the formal semantics for N3 Logic needed to track provenance. If possible, show its correspondence to a functional framework via a Curry-Howard Isomorphism. At the same time, gather the needed data-set for demonstration by processing the Live Search data-set and spidering, based on click-through results and Wikipedia, Semantic Web,and microformat-based data. Also, create and disseminate a Web-based survey to determine other sources to concentrate on.

April-June: Program framework for functional provenance-tracking mash-up framework based on formal semantics created in previous step. Convert the data-set collected previously to RDF. Produce first academic publication on formal semantics of proof-based provenance tracing on the Semantic Web.

July-September: Produce proof-of-concept demonstration of how a Semantic Web-enhanced functional framework with rules can create “mash-ups” from data-set with evaluation. Produce second academic publication detailing the practical use of the Semantic Web as a transport layer for “mash-ups.”

Oct-November: Set-up web services in order to do allow users to do integration dynamically of new data with our data-set. Experiments seeing how adding new data to the data-set effects performance.

December: Investigate the integration of data from Microsoft Office to semantically-enabled data-set, and produce a series of guidelines on how current Microsoft Office documents can be integrated into Semantic-Web “mash-ups.” Finished.

4. Use of Funds:

Half-time pay for Harry Halpin: $29,250. Note that Harry Halpin is a postgraduate researcher whose income consists entirely of research grants. If he is funded through this proposal, he will devote at least half-time to this proposal.

10%-time pay for Henry Thompson: $10,000. This lets Henry Thompson participate substantially.

Development Server Costs: $1,820. This is the standard cost of high-performance development server needed for developing, unit-testing and model-testing the mash-up framework.

Web Services Costs: $1,340. This is the cost of the virtual servers needed to host the web services for named-entity recognition, dependency parsing, tokenization, and GRDDL-transformations needed for live demonstrations.

Evaluation Costs: $1,100. The cost of paying each evaluator to evaluate the results of the proof of concept.