Building and Interrogating Knowledge Graphs from Text

Much of the promise of wide-coverage parsing for NLP applications such as interrogating knowledge graphs remains unfulfilled.

The main obstacle lies in the semantic representations that such parsers deliver, which are invariably highly language- and form-dependent. Researchers have tried since the early '70s to build a less form-dependent computable natural language semantics by hand, and have invariably failed.

The approach advocated in the present project uses machine-reading with wide-coverage parsers on large amounts of text to build entailment graphs. This is done by finding type-consistent patterns of directional implication in traditional semantic relations over the same n-tuples of named-entity arguments---for example, that for authors X and literary works Y, if we read about X writing Y'', we are likely to also read elsewhere aboutX's Y'' and ``X being the author of Y''. This procedure uses directional similarity measures such as Weeds Precision over the number of pairs X,Y, to yield a noisy disjoint graph of potential directional entailments, to each of which a confidence probability can be assigned.

These still noisy probabilities are then made into an all-or-none entailment graph. exploiting the fact that true entailments are closed under transitivity. (That is, if we believe on the basis of strong local evidence that A entails B, and that B entails C, then we must believe A entails C, however weak the local evidence is.) This process yields set of graphs of hard entailments, in which cliques such as the above can be collapsed to a single paraphrase relation. Since identifying form-independent entailments depends only on the types of the named entities, the relations in the graph may even be extracted from text in more than one language (Lewis and Steedman 2013a,b).

The original form-specific semantics in the parser can then be replaced by a form- and language-independent semantics, in which linguistic forms which are paraphrases are represented by the same relation label, unique to the clique concerned, and directional entailments are represented by conjunctions of such relations. The latter method has the advantage of being immediately compatible with the logical operators of traditional logical semantics, such as negation and quantification (Lewis and Steedman, 2014). The latter papers provide a proof of concept for the method outlined above, and improve empirically on other methods, including vector-based ones, on standard entailment tasks.

The ultimate goal of the present proposal is to leverage this form-independent semantics to create novel large knowledge graphs from unlabeled text, in which the nodes are the named entities, and the arcs the form-independent relation-identifiers. Harrison and Clark (2009) have shown show that a very old technique using ``spreading activation'' actually works to limit the otherwise exponential growth in costs of updating and querying the very large knowledge graphs that it is possible to build with modern computing machinery. Related techniques have been proposed by Lao et al, 2012. Among other datasets, we are investigating the Huawei product manuals and FAQ pages in this connection, working with Yantao Jia at Huawei.