NLP has been used to ground temporal and semantic information in video processing. Video annotations have been provided to mark actions [verbs], tools [objects], location, begin time of the action and end time of the action. (we could show an example here).
Two main processes were performed to automatically extract semantic information from texts accompanying videos: (1) syntactic parsing of texts which could be closed captions, transcripts or output of automatic speech recognition. Parsers and entity extractors engines are run on these textual data; (2) semantic processing to determine the semantic relatedness between word relations, such as verb-object and object-instrument. We used statistical measures of lexical collocations and lexical similarities and build matrices of co-occurrences to feed action recognition. We focus here on the semantic extraction which consists of using the web and other knowledge sources to develop asemantic background in the form of a co-occurrence matrix. This provides a likelihoodof a tool and action occurring at the same time in a video. The following table shows the domain actionsfrom our seed data and tools accompanying these actions.
coloring / cutting / drawing / gluing / painting / placingbrush / 0 / 0 / 0 / 1 / 8 / 0
glue / 0 / 0 / 0 / 20 / 0 / 0
scissors / 0 / 38 / 0 / 0 / 0 / 0
Writing tool / 12 / 0 / 42 / 0 / 0 / 0
Table 1. Co-occurrence matrix extracted from training data – “Writing tool” represents the sum of “pencil”, “pen”, and “crayon”.
Extracting co-occurrence matrices using relatedness measure.
To create co-occurrence matrices, we experimented with three different sources of domain knowledge, that is Wikipedia enriched with WordNet, ConceptNet, and web search results. Each of these sourceshas its own strengths and weaknesses, and they can be used in conjunctionwith each other in the global model which integrates vision and textfeatures. We report here on the web co-occurrence matrix, where search results are extracted from Yahoo search engine along with a relatedness measure. We use a semantic distance measure called the Normalized Google Distance (NGD)Cilibrasi and Vitanyi [2007]. The NGD is a relatedness measure based on the Normalized Compression Distance (NCD) and modified to use the number of results returned by Google to stand in for thecompressor in NCD. We use Yahoo instead of Google because the Yahoo API is more flexible, and the Google API did not provide the number of search results returned. The implementation remains the same. To find the semantic distance between two terms x and y, we use the NormalizedGoogle Distance equation:
NGD(x; y) =max{log f(x), log f(y)} - log f(x; y)
log(N) - min {log f(x), log f(y)}
where f(x) is the number of search results for query x, f(x,y) is the number of search results for query x and y, and N is the total number of pages indexed by the search engine. Roughly speaking, this returns a value between 0 and 1. The lower the number, the more related two queries are, with two identical queries having a distance of 0. We calculate the NGD for each action-tool pair and enter it into a matrix to form our co-occurrence matrix.
coloring / cutting / drawing / gluing / painting / placingbrush / 2.51 / 2.11 / 2.40 / NA / 1.85 / NA
glue / 2.51 / 2.51 / 2.51 / 1.2 / 2.44 / NA
scissors / 2.47 / 1.76 / 2.36 / NA / 2.68 / NA
Writing tool / 2.12 / 3.51 / 1.72 / NA / 2.08 / NA
Table 3. Web Matrix - Distance values from the Normalized Google Distance with a specified domain and pattern matching. Lower values indicate two terms are more related. NA values indicate little or no co-occurrence, and were smoothed to a high distance value in the model. “Writing tool” represents the
average of “pencil”, “marker”, “pen”, and “crayon”.
Our modified NGD algorithm can have other uses aside from constructing a relatedness matrix. For example, we may have a large number of results from Wikipedia mining, but some may not be related to the particular domain we are interested in. We can use our distance algorithm to prune results that are
not specific to the domain by using domain scaling and pattern matching.