graph4nlp.data

The Graph4NLP library uses the class GraphData as the representation for structured data (graphs). GraphData supports basic operations to the graph, including adding nodes and edges. GraphData also supports adding features which are in tensor form, and attributes which are of arbitrary form to the correspondingnodes or edges. Batching operations is also supported by GraphData.

Graph Representation

class graph4nlp.data.data.GraphData(src=None, device: str = None)

Represent a single graph with additional attributes.

Attributes
batch_edge_features

Edge version of self.batch_node_features

batch_node_features

Get a view of the batched(padded) version of the node features.

edge_attributes

Get the edge attributes in a list.

edge_features

Get all the edge features in a dictionary.

edges

Return an edge view of the edges and the corresponding data

node_attributes

Access node attribute dictionary

node_features

Access and modify node feature vectors (tensor).

nodes

Return a node view through which the user can access the features and attributes.

split_edge_features
split_node_features

Methods

add_edge(src, tgt)

Add one edge to the graph.

add_edges(src, tgt)

Add a bunch of edges to the graph.

add_nodes(node_num)

Add a number of nodes to the graph.

adj_matrix([batch_view, post_processing_fn])

Returns the adjacency matrix of the graph.

copy_batch_info(batch)

Copy all the information related to the batching.

edge_ids(src, tgt)

Convert the given endpoints to edge indices.

from_dense_adj(adj)

Construct a graph from a dense (2-D NxN) adjacency matrix with the edge weights represented by the value of the matrix entries.

from_dgl(dgl_g)

Build the graph from dgl.DGLGraph

from_graphdata(src)

Build a clone from a source GraphData

from_scipy_sparse_matrix(adj)

Construct a graph from a sparse adjacency matrix with the edge weights represented by the value of the matrix entries.

get_all_edges()

Get all the edges in the graph

get_edge_feature(edges)

Get the feature of the given edges.

get_edge_feature_names()

Get all the names of edge features

get_edge_num()

Get the number of edges in the graph

get_node_attrs(nodes)

Get the attributes of the given nodes.

get_node_features(nodes)

Get the node feature dictionary of the nodes

get_node_num()

Get the number of nodes in the graph.

node_feature_names()

Get the names of node features.

remove_all_edges()

Remove all the edges and the corresponding features and attributes in GraphData.

set_edge_feature(edges, new_data)

Set edge feature

set_node_features(nodes, new_data)

Set the features of the nodes with the given new_data`.

sparse_adj([batch_view])

Return the scipy.sparse.coo_matrix form of the adjacency matrix :param batch_view: Whether to return the split view of the adjacency matrix.

split_features(input_tensor[, type])

Convert a tensor from [N, *] to [B, N_max, *] with zero padding according to the batch information stored in the graph.

to(device)

Move the GraphData object to different devices(cpu, gpu, etc.).

to_dgl()

Convert to dgl.DGLGraph Note that there will be some information loss when calling this function, e.g.

add_edge(src: int, tgt: int)

Add one edge to the graph.

Parameters
srcint

Source node index

tgtint

Target node index

Raises
ValueError

If one of the endpoints of the edge doesn’t exist in the graph.

add_edges(src: Union[int, List[int]], tgt: Union[int, List[int]]) → None

Add a bunch of edges to the graph.

Parameters
srcint or list

Source node indices

tgtint or list

Target node indices

Raises
ValueError

If the lengths of src and tgt don’t match or one of the list contains no element.

add_nodes(node_num: int)

Add a number of nodes to the graph.

Parameters

node_num (int) – The number of nodes to be added

adj_matrix(batch_view: bool = False, post_processing_fn: Callable = None) → torch.Tensor

Returns the adjacency matrix of the graph. Returns a 2D tensor if it is a single graph and a 3D tensor if it is a batched graph, with the matrices padded with 0 (B x N x N)

Parameters
batch_viewbool

Whether to return a batched view of the adjacency matrix(3D, True) or not (2D).

post_processing_fnfunction

A callback function which takes a binary adjacency matrix (2D) and do some post-processing on it. The return of this function should also be N x N

Returns
torch.Tensor:

The adjacency matrix (N x N if batch_view=False and B x N x N if batch_view=True).

property batch_edge_features

Edge version of self.batch_node_features

Returns
BatchEdgeFeatView
property batch_node_features

(B, N, D)

Returns
BatchNodeFeatView
Type

Get a view of the batched(padded) version of the node features. Shape

copy_batch_info(batch: Any) → None

Copy all the information related to the batching. :param batch: The source batch from which the information comes.

Returns
None
property edge_attributes

Get the edge attributes in a list. :returns: A list of dictionaries. Each dictionary represents all

the attributes on the corresponding edge.

Return type

list

property edge_features

Get all the edge features in a dictionary. :returns: Edge features with the keys being the feature names and values

be the corresponding tensors.

Return type

dict

edge_ids(src: Union[int, List[int]], tgt: int) → List[Any]

Convert the given endpoints to edge indices.

Parameters
srcint or list

The index of source node(s).

tgtint or list

The index of target node(s).

Returns
list

The index of corresponding edges.

Raises
TypeError

If the parameters are of wrong types.

EdgeNotFoundException

If the edge is not in the graph.

property edges

Return an edge view of the edges and the corresponding data

Returns
edgesEdgeView
from_dense_adj(adj: torch.Tensor)

Construct a graph from a dense (2-D NxN) adjacency matrix with the edge weights represented by the value of the matrix entries.

Parameters
adjtorch.Tensor

The tensor representing the adjacency matrix.

Returns
self
from_dgl(dgl_g: dgl.heterograph.DGLHeteroGraph)

Build the graph from dgl.DGLGraph

Parameters
dgl_gdgl.DGLGraph

The source graph

from_graphdata(src: Any)

Build a clone from a source GraphData

from_scipy_sparse_matrix(adj: scipy.sparse.coo.coo_matrix)

Construct a graph from a sparse adjacency matrix with the edge weights represented by the value of the matrix entries. :param adj: The object representing the sparse adjacency matrix. :type adj: scipy.sparse.coo_matrix

Returns
self
get_all_edges() → List[Tuple[int, int]]

Get all the edges in the graph

Returns
edgeslist

List of edges. Each edge is in the shape of the endpoint tuple (src, dst).

get_edge_feature(edges: List[int]) → Dict[str, torch.Tensor]

Get the feature of the given edges.

Parameters
edgeslist

Edge indices

Returns
dict

The dictionary containing all relevant features.

get_edge_feature_names()

Get all the names of edge features

get_edge_num() → int

Get the number of edges in the graph

Returns
num_edgesint

The number of edges

get_node_attrs(nodes: Union[int, slice]) → List[Any]

Get the attributes of the given nodes.

Parameters
nodesint or slice

The given node index

Returns
list

The node attribute dictionary.

get_node_features(nodes: Union[int, slice]) → Dict[str, torch.Tensor]

Get the node feature dictionary of the nodes

Parameters
nodes: int or slice

The nodes to be accessed

Returns
node_features: dict

The reference dict of the actual tensor

get_node_num() → int

Get the number of nodes in the graph.

Returns
num_nodes: int

The number of nodes in the graph.

property node_attributes

Access node attribute dictionary

Returns
node_attributeslist

The list of node attributes

node_feature_names() → List[str]

Get the names of node features.

Returns
List[str]

The collection of feature names.

property node_features

Access and modify node feature vectors (tensor). This property can be accessed in a dict-of-dict fashion, with the order being [name][index]. ‘name’ indicates the name of the feature vector. ‘index’ selects the specific nodes to be accessed. When accessed independently, returns the feature dictionary with the format {name: tensor}.

Returns
NodeFeatView

Examples

>>> g = GraphData()
>>> g.add_nodes(10)
>>> import torch
>>> g.node_features['x'] = torch.rand((10, 10))
>>> g.node_features['x'][0]
torch.Tensor([0.1036, 0.6757, 0.4702, 0.8938, 0.6337, 0.3290,
              0.6739, 0.1091, 0.7996, 0.0586])
property nodes

Return a node view through which the user can access the features and attributes.

A NodeView object provides a high-level view of the underlying storage of the features and supports both query and modification to the original storage.

Returns
node: NodeView

The node view

remove_all_edges()

Remove all the edges and the corresponding features and attributes in GraphData.

Returns
None

Examples

>>> g = GraphData()
>>> g.add_nodes(10)
>>> g.add_edges(list(range(0, 9, 1)), list(range(1, 10, 1)))
# Added some feature tensors to the edges
>>> g.edge_features['random'] = torch.rand((9, 1024, 1024))
# Remove all edges and the corresponding data. The tensor memory is freed now.
>>> g.remove_all_edges()
set_edge_feature(edges: Union[int, slice, List[int]], new_data: Dict[str, torch.Tensor])

Set edge feature

Parameters
edgesint or list or slice

Edge indices

new_datadict

New data

Raises
SizeMismatchException

If the size of the new features does not match the node number

set_node_features(nodes: Union[int, slice], new_data: Dict[str, torch.Tensor]) → None

Set the features of the nodes with the given new_data`.

Parameters
nodesint or slice

The nodes involved

new_datadict

The new data to write. Key indicates feature name and value indicates the actual value.

Raises
SizeMismatchException

If the size of the new features does not match the node number

sparse_adj(batch_view: bool = False) → Union[torch.Tensor, List[torch.Tensor]]

Return the scipy.sparse.coo_matrix form of the adjacency matrix :param batch_view: Whether to return the split view of the adjacency matrix.

Return a list of COO matrix if True

Returns
torch.Tensor or list of torch.Tensor
split_features(input_tensor: torch.Tensor, type: str = 'node') → torch.Tensor

Convert a tensor from [N, *] to [B, N_max, *] with zero padding according to the batch information stored in the graph.

Parameters
input_tensortorch.Tensor
The original tensor to be split.
typestr
‘node’ or ‘edge’. Indicates the source of batch information.
Returns
torch.Tensor
The split tensor.
to(device: str)

Move the GraphData object to different devices(cpu, gpu, etc.). The usage of this method is similar to that of torch.Tensor and dgl.DGLGraph

Parameters
device: str

The target device.

Returns
self
to_dgl() → dgl.heterograph.DGLHeteroGraph

Convert to dgl.DGLGraph Note that there will be some information loss when calling this function, e.g. the batch-related information will not be copied to DGLGraph since it is only intended for computation.

Returns
gdgl.DGLGraph

The converted dgl.DGLGraph

graph4nlp.data.data.to_batch(graphs: List[graph4nlp.pytorch.data.data.GraphData] = None) → graph4nlp.pytorch.data.data.GraphData

Convert a list of GraphData to a large graph (a batch).

Parameters
graphslist of GraphData

The list of GraphData to be batched

Returns
GraphData

The large graph containing all the graphs in the batch.

class graph4nlp.data.dataset.Dataset(root, topology_builder, topology_subdir, tokenizer=<function word_tokenize>, lower_case=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, target_pretrained_word_emb_name=None, target_pretrained_word_emb_url=None, pretrained_word_emb_cache_dir='.vector_cache/', max_word_vocab_size=None, min_word_vocab_freq=1, use_val_for_vocab=False, seed=1234, thread_number=4, port=9000, timeout=15000, for_inference=False, reused_vocab_model=None, **kwargs)

Base class for datasets.

The dataset is organized in a two-layer index style. Direct access to the dataset object, e.g. Dataset[1], will first be converted to the access to the internal index list, which is then passed to access the actual data. This design is for the ease of sampling.

Parameters
root: str

The root directory path where the dataset is stored.

Examples

Suppose we have a Dataset containing 5 data items [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]. The indices of the 5 elements in the list are correspondingly [0, 1, 2, 3, 4]. Suppose the dataset is shuffled, which shuffles the internal index list, the consequent indices becomes [2, 3, 1, 4, 5]. Then an access to the dataset Dataset[2] will first access the indices[2] which is 1, and then use the received index to access the actual dataset, which will return the actual data item ‘b’. Now to the user the 3rd ([2]) element in the dataset got shuffled and is not ‘c’.

Attributes
processed_dir
processed_file_names
processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names
raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

To be implemented in task-specific dataset base class.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

build_topology(data_items)

Build graph topology for each item in the dataset. The generated graph is bound to the graph attribute of the DataItem.

build_vocab()

Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

abstract static collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

abstract download()

Download the raw data from the Internet.

abstract parse_file(file_path)

To be implemented in task-specific dataset base class.

property raw_dir

The directory where the raw data is stored.

property raw_file_paths

The paths to raw files.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data). The raw data file should be organized as the format defined in self.parse_file() method.

This function calls self.parse_file() repeatedly and pass the file paths in self.raw_file_names once at a time.

This function builds self.data which is a dict of {int (index): DataItem}, where the id represents the index of the DataItem w.r.t. the whole dataset.

This function also builds the self.split_ids dictionary whose keys correspond to those of self.raw_file_names defined by the user, indicating the indices of each subset (e.g. train, val and test).

abstract vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.Text2TextDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, share_vocab=True, **kwargs)

The dataset for text-to-text applications.

Parameters
  • graph_name (str) –

    The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:

    1. topology_builder

    2. static_or_dynamic

    If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

  • root_dir (str, default=None) – The path of dataset.

  • topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.

  • topology_subdir (str) – The directory name of processed path.

  • static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)

  • dynamic_init_graph_name (str, default=None) –

    The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:

    1. dynamic_init_topology_builder

    If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

  • dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.

  • dynamic_init_topology_aux_args (None,) – TBD.

Attributes
processed_dir
processed_file_names
processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names
raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2TextDataItem'>])

Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For Text2TextDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).

Parameters
file_path: str

The path of the input file.

Returns
list

The indices of data items in the file w.r.t. the whole dataset.

Examples

input: list job use languageid0 job ( ANS ) , language ( ANS , languageid0 )

DataItem:

input_text=”list job use languageid0”, output_text=”job ( ANS ) , language ( ANS , languageid0 )”

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.Text2TreeDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, share_vocab=True, **kwargs)
Attributes
processed_dir
processed_file_names
processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names
raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

For tree decoder we also need the vectorize the tree output.

process_data_items

register_datapipe_as_function

register_function

build_vocab()

Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2TreeDataItem'>])

Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For Text2TreeDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).

Parameters
file_path: str

The path of the input file.

Returns
list

The indices of data items in the file w.r.t. the whole dataset.

Examples

input: list job use languageid0 job ( ANS ) , language ( ANS , languageid0 )

DataItem: input_text=”list job use languageid0”, output_text=”job ( ANS ) , language ( ANS , languageid0 )”

vectorization(data_items)

For tree decoder we also need the vectorize the tree output.

class graph4nlp.data.dataset.Text2LabelDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, **kwargs)

The dataset for text-to-label applications. :param graph_name: The name of graph construction method. E.g., “dependency”.

Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
  1. topology_builder

  2. static_or_dynamic

If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

Parameters
  • root_dir (str, default=None) – The path of dataset.

  • topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.

  • topology_subdir (str) – The directory name of processed path.

  • static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)

  • dynamic_init_graph_name (str, default=None) –

    The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:

    1. dynamic_init_topology_builder

    If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

  • dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.

  • dynamic_init_topology_aux_args (None,) – TBD.

Attributes
processed_dir
processed_file_names
processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names
raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

build_vocab()

Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2LabelDataItem'>])

Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For Text2LabelDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).

Parameters
file_path: str

The path of the input file.

Returns
list

The indices of data items in the file w.r.t. the whole dataset.

Examples

input: How far is it from Denver to Aspen ? NUM

DataItem: input_text=”How far is it from Denver to Aspen ?”, output_label=”NUM”

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.DoubleText2TextDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, share_vocab=True, **kwargs)

The dataset for double-text-to-text applications.

Parameters
  • graph_name (str) –

    The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:

    1. topology_builder

    2. static_or_dynamic

    If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

  • root_dir (str, default=None) – The path of dataset.

  • topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.

  • topology_subdir (str) – The directory name of processed path.

  • static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)

  • dynamic_init_graph_name (str, default=None) –

    The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:

    1. dynamic_init_topology_builder

    If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

  • dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.

  • dynamic_init_topology_aux_args (None,) – TBD.

Attributes
processed_dir
processed_file_names
processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names
raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.DoubleText2TextDataItem'>])

Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For DoubleText2TextDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ). # TODO: update example

Parameters
file_path: str

The path of the input file.

Returns
list

The indices of data items in the file w.r.t. the whole dataset.

Examples

input: list job use languageid0 job ( ANS ) , language ( ANS , languageid0 )

DataItem:

input_text=”list job use languageid0”, input_text2=”list job use languageid0”, output_text=”job ( ANS ) , language ( ANS , languageid0 )”

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.SequenceLabelingDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, tag_types: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, **kwargs)
Attributes
processed_dir
processed_file_names
processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names
raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

build_vocab()

Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.SequenceLabelingDataItem'>])

Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset. For SequenceLabelingDataset, the format of the input file should contain lines of tokens, each line representing one record of token at first column and its tag at the last column. Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset. For SequenceLabelingDataset, the format of the input file should contain lines of tokens, each line representing one record of token at first column and its tag at the last column.

Examples

“EU I-ORG

rejects O German I-MISC”

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.KGCompletionDataset(root_dir: str = None, topology_builder=None, topology_subdir: str = None, **kwargs)
Attributes
processed_dir
processed_file_names
processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names
raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

build_topology(data_items)

Build graph topology for each item in the dataset. The generated graph is bound to the graph attribute of the DataItem.

build_vocab()

Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.KGCompletionDataItem'>])

Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For KGCompletionDataset, the format of the input file should contain lines of input, each line representing one record of data.

Parameters
file_path: str

The path of the input file.

Returns
list

The indices of data items in the file w.r.t. the whole dataset.

Examples

input: {“e1”: “person100”, “e2”: “None”, “rel”: “term6”, “rel_eval”: “None”, “e2_multi1”: “person90 person80 person59 person82 person63 person77 person85 person83 person56”, “e2_multi2”: “None”}

DataItem: e1=”person100”

e2=”None” rel=”term6” …

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data). The raw data file should be organized as the format defined in self.parse_file() method.

This function calls self.parse_file() repeatedly and pass the file paths in self.raw_file_names once at a time.

This function builds self.data which is a dict of {int (index): DataItem}, where the id represents the index of the DataItem w.r.t. the whole dataset.

This function also builds the self.split_ids dictionary whose keys correspond to those of self.raw_file_names defined by the user, indicating the indices of each subset (e.g. train, val and test).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

Base Dataset Class

class graph4nlp.data.dataset.Dataset(root, topology_builder, topology_subdir, tokenizer=<function word_tokenize>, lower_case=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, target_pretrained_word_emb_name=None, target_pretrained_word_emb_url=None, pretrained_word_emb_cache_dir='.vector_cache/', max_word_vocab_size=None, min_word_vocab_freq=1, use_val_for_vocab=False, seed=1234, thread_number=4, port=9000, timeout=15000, for_inference=False, reused_vocab_model=None, **kwargs)

Base class for datasets.

The dataset is organized in a two-layer index style. Direct access to the dataset object, e.g. Dataset[1], will first be converted to the access to the internal index list, which is then passed to access the actual data. This design is for the ease of sampling.

Parameters
root: str

The root directory path where the dataset is stored.

Examples

Suppose we have a Dataset containing 5 data items [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]. The indices of the 5 elements in the list are correspondingly [0, 1, 2, 3, 4]. Suppose the dataset is shuffled, which shuffles the internal index list, the consequent indices becomes [2, 3, 1, 4, 5]. Then an access to the dataset Dataset[2] will first access the indices[2] which is 1, and then use the received index to access the actual dataset, which will return the actual data item ‘b’. Now to the user the 3rd ([2]) element in the dataset got shuffled and is not ‘c’.

Attributes
processed_dir
processed_file_names
processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names
raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

To be implemented in task-specific dataset base class.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

build_topology(data_items)

Build graph topology for each item in the dataset. The generated graph is bound to the graph attribute of the DataItem.

build_vocab()

Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

abstract static collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

abstract download()

Download the raw data from the Internet.

abstract parse_file(file_path)

To be implemented in task-specific dataset base class.

property raw_dir

The directory where the raw data is stored.

property raw_file_paths

The paths to raw files.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data). The raw data file should be organized as the format defined in self.parse_file() method.

This function calls self.parse_file() repeatedly and pass the file paths in self.raw_file_names once at a time.

This function builds self.data which is a dict of {int (index): DataItem}, where the id represents the index of the DataItem w.r.t. the whole dataset.

This function also builds the self.split_ids dictionary whose keys correspond to those of self.raw_file_names defined by the user, indicating the indices of each subset (e.g. train, val and test).

abstract vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

Task Level Dataset Base Class

class graph4nlp.data.dataset.Text2LabelDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, **kwargs)

The dataset for text-to-label applications. :param graph_name: The name of graph construction method. E.g., “dependency”.

Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
  1. topology_builder

  2. static_or_dynamic

If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

Parameters
  • root_dir (str, default=None) – The path of dataset.

  • topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.

  • topology_subdir (str) – The directory name of processed path.

  • static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)

  • dynamic_init_graph_name (str, default=None) –

    The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:

    1. dynamic_init_topology_builder

    If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

  • dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.

  • dynamic_init_topology_aux_args (None,) – TBD.

Attributes
processed_dir
processed_file_names
processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names
raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

build_vocab()

Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2LabelDataItem'>])

Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For Text2LabelDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).

Parameters
file_path: str

The path of the input file.

Returns
list

The indices of data items in the file w.r.t. the whole dataset.

Examples

input: How far is it from Denver to Aspen ? NUM

DataItem: input_text=”How far is it from Denver to Aspen ?”, output_label=”NUM”

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.