.. _ie-graph-construction: IE Graph Construction ===================== The information extraction graph (IE Graph) aims to extract the structural information to represent the high-level information among natural sentences, e.g., text-based documents. We divide this process into two three basic steps: 1) coreference resolution, 2) constructing IE relations, 3) graph construction. The figure below is an example for IE graph construction which contains both the Co-reference process and the Open Information Extraction process. .. image:: ../imgs/IE_graph_v1.jpg Coreference resolution is the basic procedure for information extraction task which aims to find expressions that refer to the same entities in the text sequence. For example, the noun "James" and pronouns "He" may refer to the same object (person) in sentence "James is on shop. He buys eggs." In this step, all pronouns in the raw paragraph will be replaced with the corresponding nouns. The code below shows the co-reference resolution process. We first use processor such as ``Standford CoreNLP`` to obtain the co-reference dict of the raw text input. Then, we replace the ``pron strs`` in the raw input with the ``original entity str`` to get the resolved text. .. code:: # Do coreference resolution on the whole 'raw_text_data' coref_json = nlp_processor.annotate(raw_text_data.strip(), properties=props_coref) from .utils import CORENLP_TIMEOUT_SIGNATURE if CORENLP_TIMEOUT_SIGNATURE in coref_json: raise TimeoutError('Coref-CoreNLP timed out at input: \n{}\n This item will be skipped. ' 'Please check the input or change the timeout threshold.'.format(raw_text_data)) coref_dict = json.loads(coref_json) # Extract and preserve necessary parsing results from coref_dict['sentences'] # sent_dict['tokenWords']: list of tokens in a sentence sentences = [] for sent in coref_dict['sentences']: sent_dict = {} sent_dict['sentNum'] = sent['index'] # start from 0 sent_dict['tokens'] = sent['tokens'] sent_dict['tokenWords'] = [token['word'] for token in sent['tokens']] sent_dict['sentText'] = ' '.join(sent_dict['tokenWords']) sentences.append(sent_dict) for k, v in coref_dict['corefs'].items(): # v is a list of dict, each dict contains a str # v[0] contains 'original entity str' # v[1:] contain 'pron strs' refers to 'original entity str' ent_text = v[0]['text'] # 'original entity str' if ',' in ent_text: # cut the 'original entity str' if it is too long ent_text = ent_text.split(',')[0].strip() ent_sentNum = v[0]['sentNum'] - 1 # the sentNum 'original entity str' appears in ent_startIndex = v[0]['startIndex'] - 1 # the startIndex 'original entity str' appears in ent_endIndex = v[0]['endIndex'] - 1 # the endIndex 'original entity str' appears in for pron in v[1:]: pron_text = pron['text'] # 'pron strs' if ent_text == pron_text or v[0]['text'] == pron_text: continue pron_sentNum = pron['sentNum'] - 1 # the sentNum 'pron str' appears in pron_startIndex = pron['startIndex'] - 1 pron_endIndex = pron['endIndex'] - 1 # replace 'pron str' with 'original entity str' sentences[pron_sentNum]['tokenWords'][pron_startIndex] = ent_text for rm_idx in range(pron_startIndex+1, pron_endIndex): sentences[pron_sentNum]['tokenWords'][rm_idx] = "" # build resolved text for sent_id, _ in enumerate(sentences): sentences[sent_id]['tokenWords'] = list(filter(lambda a: a != "", sentences[sent_id]['tokenWords'])) sentences[sent_id]['resolvedText'] = ' '.join(sentences[sent_id]['tokenWords']) To construct an IE graph, the second step is to extract the triples from the resolved text, which could be completed by leveraging some well-known information extraction systems (i.e. OpenIE). After this step, we can obtain a list of triples, and a triple can be denoted as :math:`(n_i, r_{i, j}, n_j)`. .. code:: # use OpenIE to extract triples from resolvedText all_sent_triples = {} for sent in sentences: resolved_sent = sent['resolvedText'] openie_json = nlp_processor.annotate(resolved_sent.strip(), properties=props_openie) if CORENLP_TIMEOUT_SIGNATURE in openie_json: raise TimeoutError('OpenIE-CoreNLP timed out at input: \n{}\n This item will be skipped. ' 'Please check the input or change the timeout threshold.'.format(raw_text_data)) openie_dict = json.loads(openie_json) for triple_dict in openie_dict['sentences'][0]['openie']: sbj = triple_dict['subject'] rel = triple_dict['relation'] if rel in ['was', 'is', 'were', 'are']: continue obj = triple_dict['object'] # If two triples have the same subject and relation, # only preserve the one has longer object if sbj+''+rel not in all_sent_triples.keys(): all_sent_triples[sbj+''+rel] = [sbj, rel, obj] else: if len(obj)>len(all_sent_triples[sbj+''+rel][2]): all_sent_triples[sbj + '' + rel] = [sbj, rel, obj] all_sent_triples_list = list(all_sent_triples.values()) # triples extracted from all sentences How to use ---------- For example, we can construct the IE graph given a raw textual input: .. code:: python from graph4nlp.pytorch.modules.graph_construction.ie_graph_construction import IEBasedGraphConstruction from stanfordcorenlp import StanfordCoreNLP raw_data = ('James is on shop. He buys eggs.') nlp_parser = StanfordCoreNLP('http://localhost', port=9000, timeout=300000) props_coref = { 'annotators': 'tokenize, ssplit, pos, lemma, ner, parse, coref', "tokenize.options": "splitHyphenated=true,normalizeParentheses=true,normalizeOtherBrackets=true", "tokenize.whitespace": False, 'ssplit.isOneSentence': False, 'outputFormat': 'json' } props_openie = { 'annotators': 'tokenize, ssplit, pos, ner, parse, openie', "tokenize.options": "splitHyphenated=true,normalizeParentheses=true,normalizeOtherBrackets=true", "tokenize.whitespace": False, 'ssplit.isOneSentence': False, 'outputFormat': 'json', "openie.triple.strict": "true" } processor_args = [props_coref, props_openie] graphdata = IEBasedGraphConstruction.topology(raw_data, nlp_parser, processor_args=processor_args, merge_strategy=None, edge_strategy=None) Merge strategy ^^^^^^^^^^^^^^ Since the ie graph is only constructed for sentences individually, we provide options to construct one graph for the paragraph consisting of multiple sentences. Currently, we support the following options: 1. ``None``. Do not add additional nodes and edges and the original subgraphs may not be connected. 2. ``global``. It means all subjects in extracted triples are connected by a "GLOBAL_NODE" using a "global" edge. 3. ``user_define``. We suggest users to define their merge strategy by overriding the ``_graph_connect``. .. code:: python from graph4nlp.pytorch.modules.graph_construction.ie_graph_construction import IEBasedGraphConstruction class NewIEBasedGraphConstruction(IEBasedGraphConstruction): def _graph_connect(cls, triple_list, merge_strategy=None): ... Edge Strategy ^^^^^^^^^^^^^ There are various dependency relations for ie graph nodes. Currently, we support the following options: 1. ``None``. It means we will not add additional edges. 2. ``as_node``. We will view each edge as a graph node and construct the bipartite graph. For example, if there is an edge whose type is :math:`k` between node :math:`i` and node :math:`j`, we will insert a node :math:`k` into the graph and link node :math:`(i, k)` and :math:`(k, j)`.