graph4nlp.data¶

The Graph4NLP library uses the class GraphData as the representation for structured data (graphs). GraphData supports basic operations to the graph, including adding nodes and edges. GraphData also supports adding features which are in tensor form, and attributes which are of arbitrary form to the correspondingnodes or edges. Batching operations is also supported by GraphData.

Graph Representation¶

class graph4nlp.data.data.GraphData(src=None, device: str = None)¶

Represent a single graph with additional attributes.

Attributes

batch_edge_features: Edge version of self.batch_node_features
batch_node_features: Get a view of the batched(padded) version of the node features.
edge_attributes: Get the edge attributes in a list.
edge_features: Get all the edge features in a dictionary.
edges: Return an edge view of the edges and the corresponding data
node_attributes: Access node attribute dictionary
node_features: Access and modify node feature vectors (tensor).
nodes: Return a node view through which the user can access the features and attributes.
split_edge_features
split_node_features

Methods

`add_edge`(src, tgt)	Add one edge to the graph.
`add_edges`(src, tgt)	Add a bunch of edges to the graph.
`add_nodes`(node_num)	Add a number of nodes to the graph.
`adj_matrix`([batch_view, post_processing_fn])	Returns the adjacency matrix of the graph.
`copy_batch_info`(batch)	Copy all the information related to the batching.
`edge_ids`(src, tgt)	Convert the given endpoints to edge indices.
`from_dense_adj`(adj)	Construct a graph from a dense (2-D NxN) adjacency matrix with the edge weights represented by the value of the matrix entries.
`from_dgl`(dgl_g)	Build the graph from dgl.DGLGraph
`from_graphdata`(src)	Build a clone from a source GraphData
`from_scipy_sparse_matrix`(adj)	Construct a graph from a sparse adjacency matrix with the edge weights represented by the value of the matrix entries.
`get_all_edges`()	Get all the edges in the graph
`get_edge_feature`(edges)	Get the feature of the given edges.
`get_edge_feature_names`()	Get all the names of edge features
`get_edge_num`()	Get the number of edges in the graph
`get_node_attrs`(nodes)	Get the attributes of the given nodes.
`get_node_features`(nodes)	Get the node feature dictionary of the nodes
`get_node_num`()	Get the number of nodes in the graph.
`node_feature_names`()	Get the names of node features.
`remove_all_edges`()	Remove all the edges and the corresponding features and attributes in GraphData.
`set_edge_feature`(edges, new_data)	Set edge feature
`set_node_features`(nodes, new_data)	Set the features of the nodes with the given new_data`.
`sparse_adj`([batch_view])	Return the scipy.sparse.coo_matrix form of the adjacency matrix :param batch_view: Whether to return the split view of the adjacency matrix.
`split_features`(input_tensor[, type])	Convert a tensor from [N, ] to [B, N_max, ] with zero padding according to the batch information stored in the graph.
`to`(device)	Move the GraphData object to different devices(cpu, gpu, etc.).
`to_dgl`()	Convert to dgl.DGLGraph Note that there will be some information loss when calling this function, e.g.

add_edge(src: int, tgt: int)¶

Add one edge to the graph.

Parameters

srcint: Source node index
tgtint: Target node index

Raises

ValueError: If one of the endpoints of the edge doesn’t exist in the graph.

add_edges(src: Union[int, List[int]], tgt: Union[int, List[int]]) → None¶

Add a bunch of edges to the graph.

Parameters

srcint or list: Source node indices
tgtint or list: Target node indices

Raises

ValueError: If the lengths of src and tgt don’t match or one of the list contains no element.

add_nodes(node_num: int)¶

Add a number of nodes to the graph.

Parameters: node_num (int) – The number of nodes to be added

adj_matrix(batch_view: bool = False, post_processing_fn: Callable = None) → torch.Tensor¶

Returns the adjacency matrix of the graph. Returns a 2D tensor if it is a single graph and a 3D tensor if it is a batched graph, with the matrices padded with 0 (B x N x N)

Parameters

batch_viewbool: Whether to return a batched view of the adjacency matrix(3D, True) or not (2D).
post_processing_fnfunction: A callback function which takes a binary adjacency matrix (2D) and do some post-processing on it. The return of this function should also be N x N

Returns

torch.Tensor:: The adjacency matrix (N x N if batch_view=False and B x N x N if batch_view=True).

property batch_edge_features¶

Edge version of self.batch_node_features

Returns

BatchEdgeFeatView

property batch_node_features¶

(B, N, D)

Returns

BatchNodeFeatView

Type: Get a view of the batched(padded) version of the node features. Shape

copy_batch_info(batch: Any) → None¶

Copy all the information related to the batching. :param batch: The source batch from which the information comes.

Returns

None

property edge_attributes¶

Get the edge attributes in a list. :returns: A list of dictionaries. Each dictionary represents all

the attributes on the corresponding edge.

Return type: list

property edge_features¶

Get all the edge features in a dictionary. :returns: Edge features with the keys being the feature names and values

be the corresponding tensors.

Return type: dict

edge_ids(src: Union[int, List[int]], tgt: int) → List[Any]¶

Convert the given endpoints to edge indices.

Parameters

srcint or list: The index of source node(s).
tgtint or list: The index of target node(s).

Returns

list: The index of corresponding edges.

Raises

TypeError: If the parameters are of wrong types.
EdgeNotFoundException: If the edge is not in the graph.

property edges¶

Return an edge view of the edges and the corresponding data

Returns

edgesEdgeView

from_dense_adj(adj: torch.Tensor)¶

Construct a graph from a dense (2-D NxN) adjacency matrix with the edge weights represented by the value of the matrix entries.

Parameters

adjtorch.Tensor: The tensor representing the adjacency matrix.

Returns

self

from_dgl(dgl_g: dgl.heterograph.DGLHeteroGraph)¶

Build the graph from dgl.DGLGraph

Parameters

dgl_gdgl.DGLGraph: The source graph

from_graphdata(src: Any)¶: Build a clone from a source GraphData

from_scipy_sparse_matrix(adj: scipy.sparse.coo.coo_matrix)¶

Construct a graph from a sparse adjacency matrix with the edge weights represented by the value of the matrix entries. :param adj: The object representing the sparse adjacency matrix. :type adj: scipy.sparse.coo_matrix

Returns

self

get_all_edges() → List[Tuple[int, int]]¶

Get all the edges in the graph

Returns

edgeslist: List of edges. Each edge is in the shape of the endpoint tuple (src, dst).

get_edge_feature(edges: List[int]) → Dict[str, torch.Tensor]¶

Get the feature of the given edges.

Parameters

edgeslist: Edge indices

Returns

dict: The dictionary containing all relevant features.

get_edge_feature_names()¶: Get all the names of edge features

get_edge_num() → int¶

Get the number of edges in the graph

Returns

num_edgesint: The number of edges

get_node_attrs(nodes: Union[int, slice]) → List[Any]¶

Get the attributes of the given nodes.

Parameters

nodesint or slice: The given node index

Returns

list: The node attribute dictionary.

get_node_features(nodes: Union[int, slice]) → Dict[str, torch.Tensor]¶

Get the node feature dictionary of the nodes

Parameters

nodes: int or slice: The nodes to be accessed

Returns

node_features: dict: The reference dict of the actual tensor

get_node_num() → int¶

Get the number of nodes in the graph.

Returns

num_nodes: int: The number of nodes in the graph.

property node_attributes¶

Access node attribute dictionary

Returns

node_attributeslist: The list of node attributes

node_feature_names() → List[str]¶

Get the names of node features.

Returns

List[str]: The collection of feature names.

property node_features¶

Access and modify node feature vectors (tensor). This property can be accessed in a dict-of-dict fashion, with the order being [name][index]. ‘name’ indicates the name of the feature vector. ‘index’ selects the specific nodes to be accessed. When accessed independently, returns the feature dictionary with the format {name: tensor}.

Returns

NodeFeatView

Examples

>>> g = GraphData()
>>> g.add_nodes(10)
>>> import torch
>>> g.node_features['x'] = torch.rand((10, 10))
>>> g.node_features['x'][0]
torch.Tensor([0.1036, 0.6757, 0.4702, 0.8938, 0.6337, 0.3290,
              0.6739, 0.1091, 0.7996, 0.0586])

property nodes¶

Return a node view through which the user can access the features and attributes.

A NodeView object provides a high-level view of the underlying storage of the features and supports both query and modification to the original storage.

Returns

node: NodeView: The node view

remove_all_edges()¶

Remove all the edges and the corresponding features and attributes in GraphData.

Returns

None

Examples

>>> g = GraphData()
>>> g.add_nodes(10)
>>> g.add_edges(list(range(0, 9, 1)), list(range(1, 10, 1)))
# Added some feature tensors to the edges
>>> g.edge_features['random'] = torch.rand((9, 1024, 1024))
# Remove all edges and the corresponding data. The tensor memory is freed now.
>>> g.remove_all_edges()

set_edge_feature(edges: Union[int, slice, List[int]], new_data: Dict[str, torch.Tensor])¶

Set edge feature

Parameters

edgesint or list or slice: Edge indices
new_datadict: New data

Raises

SizeMismatchException: If the size of the new features does not match the node number

set_node_features(nodes: Union[int, slice], new_data: Dict[str, torch.Tensor]) → None¶

Set the features of the nodes with the given new_data`.

Parameters

nodesint or slice: The nodes involved
new_datadict: The new data to write. Key indicates feature name and value indicates the actual value.

Raises

SizeMismatchException: If the size of the new features does not match the node number

sparse_adj(batch_view: bool = False) → Union[torch.Tensor, List[torch.Tensor]]¶

Return the scipy.sparse.coo_matrix form of the adjacency matrix :param batch_view: Whether to return the split view of the adjacency matrix.

Return a list of COO matrix if True

Returns

torch.Tensor or list of torch.Tensor

split_features(input_tensor: torch.Tensor, type: str = 'node') → torch.Tensor¶

Convert a tensor from [N, *] to [B, N_max, *] with zero padding according to the batch information stored in the graph.

Parameters

input_tensortorch.Tensor
The original tensor to be split.
typestr
‘node’ or ‘edge’. Indicates the source of batch information.

Returns

torch.Tensor
The split tensor.

to(device: str)¶

Move the GraphData object to different devices(cpu, gpu, etc.). The usage of this method is similar to that of torch.Tensor and dgl.DGLGraph

Parameters

device: str: The target device.

Returns

self

to_dgl() → dgl.heterograph.DGLHeteroGraph¶

Convert to dgl.DGLGraph Note that there will be some information loss when calling this function, e.g. the batch-related information will not be copied to DGLGraph since it is only intended for computation.

Returns

gdgl.DGLGraph: The converted dgl.DGLGraph

graph4nlp.data.data.to_batch(graphs: List[graph4nlp.pytorch.data.data.GraphData] = None) → graph4nlp.pytorch.data.data.GraphData¶

Convert a list of GraphData to a large graph (a batch).

Parameters

graphslist of GraphData: The list of GraphData to be batched

Returns

GraphData: The large graph containing all the graphs in the batch.

class graph4nlp.data.dataset.Dataset(root, topology_builder, topology_subdir, tokenizer=<function word_tokenize>, lower_case=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, target_pretrained_word_emb_name=None, target_pretrained_word_emb_url=None, pretrained_word_emb_cache_dir='.vector_cache/', max_word_vocab_size=None, min_word_vocab_freq=1, use_val_for_vocab=False, seed=1234, thread_number=4, port=9000, timeout=15000, for_inference=False, reused_vocab_model=None, **kwargs)¶

Base class for datasets.

The dataset is organized in a two-layer index style. Direct access to the dataset object, e.g. Dataset[1], will first be converted to the access to the internal index list, which is then passed to access the actual data. This design is for the ease of sampling.

Parameters

root: str: The root directory path where the dataset is stored.

Examples

Suppose we have a Dataset containing 5 data items [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]. The indices of the 5 elements in the list are correspondingly [0, 1, 2, 3, 4]. Suppose the dataset is shuffled, which shuffles the internal index list, the consequent indices becomes [2, 3, 1, 4, 5]. Then an access to the dataset Dataset[2] will first access the indices[2] which is 1, and then use the received index to access the actual dataset, which will return the actual data item ‘b’. Now to the user the 3rd ([2]) element in the dataset got shuffled and is not ‘c’.

Attributes

processed_dir
processed_file_names
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	To be implemented in task-specific dataset base class.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

build_topology(data_items)¶: Build graph topology for each item in the dataset. The generated graph is bound to the graph attribute of the DataItem.

build_vocab()¶: Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

abstract static collate_fn(data_list)¶: Takes a list of data and convert it to a batch of data.

abstract download()¶: Download the raw data from the Internet.

abstract parse_file(file_path)¶: To be implemented in task-specific dataset base class.

property raw_dir¶: The directory where the raw data is stored.

property raw_file_paths¶: The paths to raw files.

read_raw_data()¶

Read raw data from the disk and put them in a dictionary (self.data). The raw data file should be organized as the format defined in self.parse_file() method.

This function calls self.parse_file() repeatedly and pass the file paths in self.raw_file_names once at a time.

This function builds self.data which is a dict of {int (index): DataItem}, where the id represents the index of the DataItem w.r.t. the whole dataset.

This function also builds the self.split_ids dictionary whose keys correspond to those of self.raw_file_names defined by the user, indicating the indices of each subset (e.g. train, val and test).

abstract vectorization(data_items)¶: Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.Text2TextDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, share_vocab=True, **kwargs)¶

The dataset for text-to-text applications.

Parameters

graph_name (str) –
The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
1. topology_builder
2. static_or_dynamic
If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.
root_dir (str, default=None) – The path of dataset.
topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.
topology_subdir (str) – The directory name of processed path.
static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)
dynamic_init_graph_name (str, default=None) –
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
1. dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.
dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.
dynamic_init_topology_aux_args (None,) – TBD.

Attributes

processed_dir
processed_file_names
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2TextDataItem'>])¶: Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list¶

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For Text2TextDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).

Parameters

file_path: str: The path of the input file.

Returns

list: The indices of data items in the file w.r.t. the whole dataset.

Examples

input: list job use languageid0 job ( ANS ) , language ( ANS , languageid0 )

DataItem:: input_text=”list job use languageid0”, output_text=”job ( ANS ) , language ( ANS , languageid0 )”

vectorization(data_items)¶: Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.Text2TreeDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, share_vocab=True, **kwargs)¶

Attributes

processed_dir
processed_file_names
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	For tree decoder we also need the vectorize the tree output.

process_data_items
register_datapipe_as_function
register_function

build_vocab()¶: Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2TreeDataItem'>])¶: Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list¶

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For Text2TreeDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).

Parameters

file_path: str: The path of the input file.

Returns

list: The indices of data items in the file w.r.t. the whole dataset.

Examples

input: list job use languageid0 job ( ANS ) , language ( ANS , languageid0 )

DataItem: input_text=”list job use languageid0”, output_text=”job ( ANS ) , language ( ANS , languageid0 )”

vectorization(data_items)¶: For tree decoder we also need the vectorize the tree output.

class graph4nlp.data.dataset.Text2LabelDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, **kwargs)¶

The dataset for text-to-label applications. :param graph_name: The name of graph construction method. E.g., “dependency”.

Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:

topology_builder

static_or_dynamic

If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

Parameters

root_dir (str, default=None) – The path of dataset.
topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.
topology_subdir (str) – The directory name of processed path.
static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)
dynamic_init_graph_name (str, default=None) –
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
1. dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.
dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.
dynamic_init_topology_aux_args (None,) – TBD.

Attributes

processed_dir
processed_file_names
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

build_vocab()¶: Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2LabelDataItem'>])¶: Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list¶

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For Text2LabelDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).

Parameters

file_path: str: The path of the input file.

Returns

list: The indices of data items in the file w.r.t. the whole dataset.

Examples

input: How far is it from Denver to Aspen ? NUM

DataItem: input_text=”How far is it from Denver to Aspen ?”, output_label=”NUM”

vectorization(data_items)¶: Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.DoubleText2TextDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, share_vocab=True, **kwargs)¶

The dataset for double-text-to-text applications.

Parameters

graph_name (str) –
The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
1. topology_builder
2. static_or_dynamic
If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.
root_dir (str, default=None) – The path of dataset.
topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.
topology_subdir (str) – The directory name of processed path.
static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)
dynamic_init_graph_name (str, default=None) –
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
1. dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.
dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.
dynamic_init_topology_aux_args (None,) – TBD.

Attributes

processed_dir
processed_file_names
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.DoubleText2TextDataItem'>])¶: Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list¶

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For DoubleText2TextDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ). # TODO: update example

Parameters

file_path: str: The path of the input file.

Returns

list: The indices of data items in the file w.r.t. the whole dataset.

Examples

input: list job use languageid0 job ( ANS ) , language ( ANS , languageid0 )

DataItem:: input_text=”list job use languageid0”, input_text2=”list job use languageid0”, output_text=”job ( ANS ) , language ( ANS , languageid0 )”

vectorization(data_items)¶: Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.SequenceLabelingDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, tag_types: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, **kwargs)¶

Attributes

processed_dir
processed_file_names
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

build_vocab()¶: Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.SequenceLabelingDataItem'>])¶: Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list¶

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset. For SequenceLabelingDataset, the format of the input file should contain lines of tokens, each line representing one record of token at first column and its tag at the last column. Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset. For SequenceLabelingDataset, the format of the input file should contain lines of tokens, each line representing one record of token at first column and its tag at the last column.

Examples

“EU I-ORG: rejects O German I-MISC”

vectorization(data_items)¶: Convert tokens to indices which can be processed by downstream models.

class graph4nlp.data.dataset.KGCompletionDataset(root_dir: str = None, topology_builder=None, topology_subdir: str = None, **kwargs)¶

Attributes

processed_dir
processed_file_names
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

build_topology(data_items)¶: Build graph topology for each item in the dataset. The generated graph is bound to the graph attribute of the DataItem.

build_vocab()¶: Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.KGCompletionDataItem'>])¶: Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list¶

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For KGCompletionDataset, the format of the input file should contain lines of input, each line representing one record of data.

Parameters

file_path: str: The path of the input file.

Returns

list: The indices of data items in the file w.r.t. the whole dataset.

Examples

input: {“e1”: “person100”, “e2”: “None”, “rel”: “term6”, “rel_eval”: “None”, “e2_multi1”: “person90 person80 person59 person82 person63 person77 person85 person83 person56”, “e2_multi2”: “None”}

DataItem: e1=”person100”: e2=”None” rel=”term6” …

read_raw_data()¶

Read raw data from the disk and put them in a dictionary (self.data). The raw data file should be organized as the format defined in self.parse_file() method.

This function calls self.parse_file() repeatedly and pass the file paths in self.raw_file_names once at a time.

This function builds self.data which is a dict of {int (index): DataItem}, where the id represents the index of the DataItem w.r.t. the whole dataset.

This function also builds the self.split_ids dictionary whose keys correspond to those of self.raw_file_names defined by the user, indicating the indices of each subset (e.g. train, val and test).

vectorization(data_items)¶: Convert tokens to indices which can be processed by downstream models.

Base Dataset Class¶

class graph4nlp.data.dataset.Dataset(root, topology_builder, topology_subdir, tokenizer=<function word_tokenize>, lower_case=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, target_pretrained_word_emb_name=None, target_pretrained_word_emb_url=None, pretrained_word_emb_cache_dir='.vector_cache/', max_word_vocab_size=None, min_word_vocab_freq=1, use_val_for_vocab=False, seed=1234, thread_number=4, port=9000, timeout=15000, for_inference=False, reused_vocab_model=None, **kwargs)

Base class for datasets.

The dataset is organized in a two-layer index style. Direct access to the dataset object, e.g. Dataset[1], will first be converted to the access to the internal index list, which is then passed to access the actual data. This design is for the ease of sampling.

Parameters

root: str: The root directory path where the dataset is stored.

Examples

Suppose we have a Dataset containing 5 data items [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]. The indices of the 5 elements in the list are correspondingly [0, 1, 2, 3, 4]. Suppose the dataset is shuffled, which shuffles the internal index list, the consequent indices becomes [2, 3, 1, 4, 5]. Then an access to the dataset Dataset[2] will first access the indices[2] which is 1, and then use the received index to access the actual dataset, which will return the actual data item ‘b’. Now to the user the 3rd ([2]) element in the dataset got shuffled and is not ‘c’.

Attributes

processed_dir
processed_file_names
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	To be implemented in task-specific dataset base class.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

build_topology(data_items): Build graph topology for each item in the dataset. The generated graph is bound to the graph attribute of the DataItem.

build_vocab(): Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

abstract static collate_fn(data_list): Takes a list of data and convert it to a batch of data.

abstract download(): Download the raw data from the Internet.

abstract parse_file(file_path): To be implemented in task-specific dataset base class.

property raw_dir: The directory where the raw data is stored.

property raw_file_paths: The paths to raw files.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data). The raw data file should be organized as the format defined in self.parse_file() method.

This function calls self.parse_file() repeatedly and pass the file paths in self.raw_file_names once at a time.

This function builds self.data which is a dict of {int (index): DataItem}, where the id represents the index of the DataItem w.r.t. the whole dataset.

This function also builds the self.split_ids dictionary whose keys correspond to those of self.raw_file_names defined by the user, indicating the indices of each subset (e.g. train, val and test).

abstract vectorization(data_items): Convert tokens to indices which can be processed by downstream models.

Task Level Dataset Base Class¶

class graph4nlp.data.dataset.Text2LabelDataset(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, **kwargs)

The dataset for text-to-label applications. :param graph_name: The name of graph construction method. E.g., “dependency”.

Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:

topology_builder

static_or_dynamic

If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

Parameters

root_dir (str, default=None) – The path of dataset.
topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.
topology_subdir (str) – The directory name of processed path.
static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)
dynamic_init_graph_name (str, default=None) –
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
1. dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.
dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.
dynamic_init_topology_aux_args (None,) – TBD.

Attributes

processed_dir
processed_file_names
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

build_vocab(): Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.

static collate_fn(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2LabelDataItem'>]): Takes a list of data and convert it to a batch of data.

parse_file(file_path) → list

Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.

For Text2LabelDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).

Parameters

file_path: str: The path of the input file.

Returns

list: The indices of data items in the file w.r.t. the whole dataset.

Examples

input: How far is it from Denver to Aspen ? NUM

DataItem: input_text=”How far is it from Denver to Aspen ?”, output_label=”NUM”

vectorization(data_items): Convert tokens to indices which can be processed by downstream models.