graph4nlp.data¶
The Graph4NLP library uses the class GraphData
as the representation
for structured data (graphs). GraphData
supports basic operations to
the graph, including adding nodes and edges. GraphData
also supports
adding features which are in tensor form, and attributes which are of arbitrary
form to the correspondingnodes or edges. Batching operations is also supported
by GraphData
.
Graph Representation¶
-
class
graph4nlp.data.data.
GraphData
(src=None, device: str = None)¶ Represent a single graph with additional attributes.
- Attributes
batch_edge_features
Edge version of self.batch_node_features
batch_node_features
Get a view of the batched(padded) version of the node features.
edge_attributes
Get the edge attributes in a list.
edge_features
Get all the edge features in a dictionary.
edges
Return an edge view of the edges and the corresponding data
node_attributes
Access node attribute dictionary
node_features
Access and modify node feature vectors (tensor).
nodes
Return a node view through which the user can access the features and attributes.
- split_edge_features
- split_node_features
Methods
add_edge
(src, tgt)Add one edge to the graph.
add_edges
(src, tgt)Add a bunch of edges to the graph.
add_nodes
(node_num)Add a number of nodes to the graph.
adj_matrix
([batch_view, post_processing_fn])Returns the adjacency matrix of the graph.
copy_batch_info
(batch)Copy all the information related to the batching.
edge_ids
(src, tgt)Convert the given endpoints to edge indices.
from_dense_adj
(adj)Construct a graph from a dense (2-D NxN) adjacency matrix with the edge weights represented by the value of the matrix entries.
from_dgl
(dgl_g)Build the graph from dgl.DGLGraph
from_graphdata
(src)Build a clone from a source GraphData
Construct a graph from a sparse adjacency matrix with the edge weights represented by the value of the matrix entries.
Get all the edges in the graph
get_edge_feature
(edges)Get the feature of the given edges.
Get all the names of edge features
Get the number of edges in the graph
get_node_attrs
(nodes)Get the attributes of the given nodes.
get_node_features
(nodes)Get the node feature dictionary of the nodes
Get the number of nodes in the graph.
Get the names of node features.
Remove all the edges and the corresponding features and attributes in GraphData.
set_edge_feature
(edges, new_data)Set edge feature
set_node_features
(nodes, new_data)Set the features of the nodes with the given new_data`.
sparse_adj
([batch_view])Return the scipy.sparse.coo_matrix form of the adjacency matrix :param batch_view: Whether to return the split view of the adjacency matrix.
split_features
(input_tensor[, type])Convert a tensor from [N, *] to [B, N_max, *] with zero padding according to the batch information stored in the graph.
to
(device)Move the GraphData object to different devices(cpu, gpu, etc.).
to_dgl
()Convert to dgl.DGLGraph Note that there will be some information loss when calling this function, e.g.
-
add_edge
(src: int, tgt: int)¶ Add one edge to the graph.
- Parameters
- srcint
Source node index
- tgtint
Target node index
- Raises
- ValueError
If one of the endpoints of the edge doesn’t exist in the graph.
-
add_edges
(src: Union[int, List[int]], tgt: Union[int, List[int]]) → None¶ Add a bunch of edges to the graph.
- Parameters
- srcint or list
Source node indices
- tgtint or list
Target node indices
- Raises
- ValueError
If the lengths of src and tgt don’t match or one of the list contains no element.
-
add_nodes
(node_num: int)¶ Add a number of nodes to the graph.
- Parameters
node_num (int) – The number of nodes to be added
-
adj_matrix
(batch_view: bool = False, post_processing_fn: Callable = None) → torch.Tensor¶ Returns the adjacency matrix of the graph. Returns a 2D tensor if it is a single graph and a 3D tensor if it is a batched graph, with the matrices padded with 0 (B x N x N)
- Parameters
- batch_viewbool
Whether to return a batched view of the adjacency matrix(3D, True) or not (2D).
- post_processing_fnfunction
A callback function which takes a binary adjacency matrix (2D) and do some post-processing on it. The return of this function should also be N x N
- Returns
- torch.Tensor:
The adjacency matrix (N x N if batch_view=False and B x N x N if batch_view=True).
-
property
batch_edge_features
¶ Edge version of self.batch_node_features
- Returns
- BatchEdgeFeatView
-
property
batch_node_features
¶ (B, N, D)
- Returns
- BatchNodeFeatView
- Type
Get a view of the batched(padded) version of the node features. Shape
-
copy_batch_info
(batch: Any) → None¶ Copy all the information related to the batching. :param batch: The source batch from which the information comes.
- Returns
- None
-
property
edge_attributes
¶ Get the edge attributes in a list. :returns: A list of dictionaries. Each dictionary represents all
the attributes on the corresponding edge.
- Return type
list
-
property
edge_features
¶ Get all the edge features in a dictionary. :returns: Edge features with the keys being the feature names and values
be the corresponding tensors.
- Return type
dict
-
edge_ids
(src: Union[int, List[int]], tgt: int) → List[Any]¶ Convert the given endpoints to edge indices.
- Parameters
- srcint or list
The index of source node(s).
- tgtint or list
The index of target node(s).
- Returns
- list
The index of corresponding edges.
- Raises
- TypeError
If the parameters are of wrong types.
- EdgeNotFoundException
If the edge is not in the graph.
-
property
edges
¶ Return an edge view of the edges and the corresponding data
- Returns
- edgesEdgeView
-
from_dense_adj
(adj: torch.Tensor)¶ Construct a graph from a dense (2-D NxN) adjacency matrix with the edge weights represented by the value of the matrix entries.
- Parameters
- adjtorch.Tensor
The tensor representing the adjacency matrix.
- Returns
- self
-
from_dgl
(dgl_g: dgl.heterograph.DGLHeteroGraph)¶ Build the graph from dgl.DGLGraph
- Parameters
- dgl_gdgl.DGLGraph
The source graph
-
from_graphdata
(src: Any)¶ Build a clone from a source GraphData
-
from_scipy_sparse_matrix
(adj: scipy.sparse.coo.coo_matrix)¶ Construct a graph from a sparse adjacency matrix with the edge weights represented by the value of the matrix entries. :param adj: The object representing the sparse adjacency matrix. :type adj: scipy.sparse.coo_matrix
- Returns
- self
-
get_all_edges
() → List[Tuple[int, int]]¶ Get all the edges in the graph
- Returns
- edgeslist
List of edges. Each edge is in the shape of the endpoint tuple (src, dst).
-
get_edge_feature
(edges: List[int]) → Dict[str, torch.Tensor]¶ Get the feature of the given edges.
- Parameters
- edgeslist
Edge indices
- Returns
- dict
The dictionary containing all relevant features.
-
get_edge_feature_names
()¶ Get all the names of edge features
-
get_edge_num
() → int¶ Get the number of edges in the graph
- Returns
- num_edgesint
The number of edges
-
get_node_attrs
(nodes: Union[int, slice]) → List[Any]¶ Get the attributes of the given nodes.
- Parameters
- nodesint or slice
The given node index
- Returns
- list
The node attribute dictionary.
-
get_node_features
(nodes: Union[int, slice]) → Dict[str, torch.Tensor]¶ Get the node feature dictionary of the nodes
- Parameters
- nodes: int or slice
The nodes to be accessed
- Returns
- node_features: dict
The reference dict of the actual tensor
-
get_node_num
() → int¶ Get the number of nodes in the graph.
- Returns
- num_nodes: int
The number of nodes in the graph.
-
property
node_attributes
¶ Access node attribute dictionary
- Returns
- node_attributeslist
The list of node attributes
-
node_feature_names
() → List[str]¶ Get the names of node features.
- Returns
- List[str]
The collection of feature names.
-
property
node_features
¶ Access and modify node feature vectors (tensor). This property can be accessed in a dict-of-dict fashion, with the order being [name][index]. ‘name’ indicates the name of the feature vector. ‘index’ selects the specific nodes to be accessed. When accessed independently, returns the feature dictionary with the format {name: tensor}.
- Returns
- NodeFeatView
Examples
>>> g = GraphData() >>> g.add_nodes(10) >>> import torch >>> g.node_features['x'] = torch.rand((10, 10)) >>> g.node_features['x'][0] torch.Tensor([0.1036, 0.6757, 0.4702, 0.8938, 0.6337, 0.3290, 0.6739, 0.1091, 0.7996, 0.0586])
-
property
nodes
¶ Return a node view through which the user can access the features and attributes.
A NodeView object provides a high-level view of the underlying storage of the features and supports both query and modification to the original storage.
- Returns
- node: NodeView
The node view
-
remove_all_edges
()¶ Remove all the edges and the corresponding features and attributes in GraphData.
- Returns
- None
Examples
>>> g = GraphData() >>> g.add_nodes(10) >>> g.add_edges(list(range(0, 9, 1)), list(range(1, 10, 1))) # Added some feature tensors to the edges >>> g.edge_features['random'] = torch.rand((9, 1024, 1024)) # Remove all edges and the corresponding data. The tensor memory is freed now. >>> g.remove_all_edges()
-
set_edge_feature
(edges: Union[int, slice, List[int]], new_data: Dict[str, torch.Tensor])¶ Set edge feature
- Parameters
- edgesint or list or slice
Edge indices
- new_datadict
New data
- Raises
- SizeMismatchException
If the size of the new features does not match the node number
-
set_node_features
(nodes: Union[int, slice], new_data: Dict[str, torch.Tensor]) → None¶ Set the features of the nodes with the given new_data`.
- Parameters
- nodesint or slice
The nodes involved
- new_datadict
The new data to write. Key indicates feature name and value indicates the actual value.
- Raises
- SizeMismatchException
If the size of the new features does not match the node number
-
sparse_adj
(batch_view: bool = False) → Union[torch.Tensor, List[torch.Tensor]]¶ Return the scipy.sparse.coo_matrix form of the adjacency matrix :param batch_view: Whether to return the split view of the adjacency matrix.
Return a list of COO matrix if True
- Returns
- torch.Tensor or list of torch.Tensor
-
split_features
(input_tensor: torch.Tensor, type: str = 'node') → torch.Tensor¶ Convert a tensor from [N, *] to [B, N_max, *] with zero padding according to the batch information stored in the graph.
- Parameters
- input_tensortorch.Tensor
- The original tensor to be split.
- typestr
- ‘node’ or ‘edge’. Indicates the source of batch information.
- Returns
- torch.Tensor
- The split tensor.
-
to
(device: str)¶ Move the GraphData object to different devices(cpu, gpu, etc.). The usage of this method is similar to that of torch.Tensor and dgl.DGLGraph
- Parameters
- device: str
The target device.
- Returns
- self
-
to_dgl
() → dgl.heterograph.DGLHeteroGraph¶ Convert to dgl.DGLGraph Note that there will be some information loss when calling this function, e.g. the batch-related information will not be copied to DGLGraph since it is only intended for computation.
- Returns
- gdgl.DGLGraph
The converted dgl.DGLGraph
-
graph4nlp.data.data.
to_batch
(graphs: List[graph4nlp.pytorch.data.data.GraphData] = None) → graph4nlp.pytorch.data.data.GraphData¶ Convert a list of GraphData to a large graph (a batch).
- Parameters
- graphslist of GraphData
The list of GraphData to be batched
- Returns
- GraphData
The large graph containing all the graphs in the batch.
-
class
graph4nlp.data.dataset.
Dataset
(root, topology_builder, topology_subdir, tokenizer=<function word_tokenize>, lower_case=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, target_pretrained_word_emb_name=None, target_pretrained_word_emb_url=None, pretrained_word_emb_cache_dir='.vector_cache/', max_word_vocab_size=None, min_word_vocab_freq=1, use_val_for_vocab=False, seed=1234, thread_number=4, port=9000, timeout=15000, for_inference=False, reused_vocab_model=None, **kwargs)¶ Base class for datasets.
The dataset is organized in a two-layer index style. Direct access to the dataset object, e.g. Dataset[1], will first be converted to the access to the internal index list, which is then passed to access the actual data. This design is for the ease of sampling.
- Parameters
- root: str
The root directory path where the dataset is stored.
Examples
Suppose we have a Dataset containing 5 data items [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]. The indices of the 5 elements in the list are correspondingly [0, 1, 2, 3, 4]. Suppose the dataset is shuffled, which shuffles the internal index list, the consequent indices becomes [2, 3, 1, 4, 5]. Then an access to the dataset Dataset[2] will first access the indices[2] which is 1, and then use the received index to access the actual dataset, which will return the actual data item ‘b’. Now to the user the 3rd ([2]) element in the dataset got shuffled and is not ‘c’.
- Attributes
- processed_dir
- processed_file_names
- processed_file_paths
raw_dir
The directory where the raw data is stored.
- raw_file_names
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)To be implemented in task-specific dataset base class.
Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
build_topology
(data_items)¶ Build graph topology for each item in the dataset. The generated graph is bound to the graph attribute of the DataItem.
-
build_vocab
()¶ Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.
-
abstract static
collate_fn
(data_list)¶ Takes a list of data and convert it to a batch of data.
-
abstract
download
()¶ Download the raw data from the Internet.
-
abstract
parse_file
(file_path)¶ To be implemented in task-specific dataset base class.
-
property
raw_dir
¶ The directory where the raw data is stored.
-
property
raw_file_paths
¶ The paths to raw files.
-
read_raw_data
()¶ Read raw data from the disk and put them in a dictionary (self.data). The raw data file should be organized as the format defined in self.parse_file() method.
This function calls self.parse_file() repeatedly and pass the file paths in self.raw_file_names once at a time.
This function builds self.data which is a dict of {int (index): DataItem}, where the id represents the index of the DataItem w.r.t. the whole dataset.
This function also builds the self.split_ids dictionary whose keys correspond to those of self.raw_file_names defined by the user, indicating the indices of each subset (e.g. train, val and test).
-
abstract
vectorization
(data_items)¶ Convert tokens to indices which can be processed by downstream models.
-
class
graph4nlp.data.dataset.
Text2TextDataset
(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, share_vocab=True, **kwargs)¶ The dataset for text-to-text applications.
- Parameters
graph_name (str) –
The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
topology_builder
static_or_dynamic
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.root_dir (str, default=None) – The path of dataset.
topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.
topology_subdir (str) – The directory name of processed path.
static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)
dynamic_init_graph_name (str, default=None) –
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.
dynamic_init_topology_aux_args (None,) – TBD.
- Attributes
- processed_dir
- processed_file_names
- processed_file_paths
raw_dir
The directory where the raw data is stored.
- raw_file_names
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
build_vocab
()Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
static
collate_fn
(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2TextDataItem'>])¶ Takes a list of data and convert it to a batch of data.
-
parse_file
(file_path) → list¶ Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.
For Text2TextDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).
- Parameters
- file_path: str
The path of the input file.
- Returns
- list
The indices of data items in the file w.r.t. the whole dataset.
Examples
input: list job use languageid0 job ( ANS ) , language ( ANS , languageid0 )
- DataItem:
input_text=”list job use languageid0”, output_text=”job ( ANS ) , language ( ANS , languageid0 )”
-
vectorization
(data_items)¶ Convert tokens to indices which can be processed by downstream models.
-
class
graph4nlp.data.dataset.
Text2TreeDataset
(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, share_vocab=True, **kwargs)¶ - Attributes
- processed_dir
- processed_file_names
- processed_file_paths
raw_dir
The directory where the raw data is stored.
- raw_file_names
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)For tree decoder we also need the vectorize the tree output.
process_data_items
register_datapipe_as_function
register_function
-
build_vocab
()¶ Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.
-
static
collate_fn
(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2TreeDataItem'>])¶ Takes a list of data and convert it to a batch of data.
-
parse_file
(file_path) → list¶ Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.
For Text2TreeDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).
- Parameters
- file_path: str
The path of the input file.
- Returns
- list
The indices of data items in the file w.r.t. the whole dataset.
Examples
input: list job use languageid0 job ( ANS ) , language ( ANS , languageid0 )
DataItem: input_text=”list job use languageid0”, output_text=”job ( ANS ) , language ( ANS , languageid0 )”
-
vectorization
(data_items)¶ For tree decoder we also need the vectorize the tree output.
-
class
graph4nlp.data.dataset.
Text2LabelDataset
(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, **kwargs)¶ The dataset for text-to-label applications. :param graph_name: The name of graph construction method. E.g., “dependency”.
- Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
topology_builder
static_or_dynamic
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.- Parameters
root_dir (str, default=None) – The path of dataset.
topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.
topology_subdir (str) – The directory name of processed path.
static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)
dynamic_init_graph_name (str, default=None) –
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.
dynamic_init_topology_aux_args (None,) – TBD.
- Attributes
- processed_dir
- processed_file_names
- processed_file_paths
raw_dir
The directory where the raw data is stored.
- raw_file_names
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
build_vocab
()¶ Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.
-
static
collate_fn
(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2LabelDataItem'>])¶ Takes a list of data and convert it to a batch of data.
-
parse_file
(file_path) → list¶ Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.
For Text2LabelDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).
- Parameters
- file_path: str
The path of the input file.
- Returns
- list
The indices of data items in the file w.r.t. the whole dataset.
Examples
input: How far is it from Denver to Aspen ? NUM
DataItem: input_text=”How far is it from Denver to Aspen ?”, output_label=”NUM”
-
vectorization
(data_items)¶ Convert tokens to indices which can be processed by downstream models.
-
class
graph4nlp.data.dataset.
DoubleText2TextDataset
(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, share_vocab=True, **kwargs)¶ The dataset for double-text-to-text applications.
- Parameters
graph_name (str) –
The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
topology_builder
static_or_dynamic
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.root_dir (str, default=None) – The path of dataset.
topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.
topology_subdir (str) – The directory name of processed path.
static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)
dynamic_init_graph_name (str, default=None) –
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.
dynamic_init_topology_aux_args (None,) – TBD.
- Attributes
- processed_dir
- processed_file_names
- processed_file_paths
raw_dir
The directory where the raw data is stored.
- raw_file_names
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
build_vocab
()Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
static
collate_fn
(data_list: [<class 'graph4nlp.pytorch.data.dataset.DoubleText2TextDataItem'>])¶ Takes a list of data and convert it to a batch of data.
-
parse_file
(file_path) → list¶ Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.
For DoubleText2TextDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ). # TODO: update example
- Parameters
- file_path: str
The path of the input file.
- Returns
- list
The indices of data items in the file w.r.t. the whole dataset.
Examples
input: list job use languageid0 job ( ANS ) , language ( ANS , languageid0 )
- DataItem:
input_text=”list job use languageid0”, input_text2=”list job use languageid0”, output_text=”job ( ANS ) , language ( ANS , languageid0 )”
-
vectorization
(data_items)¶ Convert tokens to indices which can be processed by downstream models.
-
class
graph4nlp.data.dataset.
SequenceLabelingDataset
(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, tag_types: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, **kwargs)¶ - Attributes
- processed_dir
- processed_file_names
- processed_file_paths
raw_dir
The directory where the raw data is stored.
- raw_file_names
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
build_vocab
()¶ Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.
-
static
collate_fn
(data_list: [<class 'graph4nlp.pytorch.data.dataset.SequenceLabelingDataItem'>])¶ Takes a list of data and convert it to a batch of data.
-
parse_file
(file_path) → list¶ Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset. For SequenceLabelingDataset, the format of the input file should contain lines of tokens, each line representing one record of token at first column and its tag at the last column. Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset. For SequenceLabelingDataset, the format of the input file should contain lines of tokens, each line representing one record of token at first column and its tag at the last column.
Examples
- “EU I-ORG
rejects O German I-MISC”
-
vectorization
(data_items)¶ Convert tokens to indices which can be processed by downstream models.
-
class
graph4nlp.data.dataset.
KGCompletionDataset
(root_dir: str = None, topology_builder=None, topology_subdir: str = None, **kwargs)¶ - Attributes
- processed_dir
- processed_file_names
- processed_file_paths
raw_dir
The directory where the raw data is stored.
- raw_file_names
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
build_topology
(data_items)¶ Build graph topology for each item in the dataset. The generated graph is bound to the graph attribute of the DataItem.
-
build_vocab
()¶ Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.
-
static
collate_fn
(data_list: [<class 'graph4nlp.pytorch.data.dataset.KGCompletionDataItem'>])¶ Takes a list of data and convert it to a batch of data.
-
parse_file
(file_path) → list¶ Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.
For KGCompletionDataset, the format of the input file should contain lines of input, each line representing one record of data.
- Parameters
- file_path: str
The path of the input file.
- Returns
- list
The indices of data items in the file w.r.t. the whole dataset.
Examples
input: {“e1”: “person100”, “e2”: “None”, “rel”: “term6”, “rel_eval”: “None”, “e2_multi1”: “person90 person80 person59 person82 person63 person77 person85 person83 person56”, “e2_multi2”: “None”}
- DataItem: e1=”person100”
e2=”None” rel=”term6” …
-
read_raw_data
()¶ Read raw data from the disk and put them in a dictionary (self.data). The raw data file should be organized as the format defined in self.parse_file() method.
This function calls self.parse_file() repeatedly and pass the file paths in self.raw_file_names once at a time.
This function builds self.data which is a dict of {int (index): DataItem}, where the id represents the index of the DataItem w.r.t. the whole dataset.
This function also builds the self.split_ids dictionary whose keys correspond to those of self.raw_file_names defined by the user, indicating the indices of each subset (e.g. train, val and test).
-
vectorization
(data_items)¶ Convert tokens to indices which can be processed by downstream models.
Base Dataset Class¶
-
class
graph4nlp.data.dataset.
Dataset
(root, topology_builder, topology_subdir, tokenizer=<function word_tokenize>, lower_case=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, target_pretrained_word_emb_name=None, target_pretrained_word_emb_url=None, pretrained_word_emb_cache_dir='.vector_cache/', max_word_vocab_size=None, min_word_vocab_freq=1, use_val_for_vocab=False, seed=1234, thread_number=4, port=9000, timeout=15000, for_inference=False, reused_vocab_model=None, **kwargs) Base class for datasets.
The dataset is organized in a two-layer index style. Direct access to the dataset object, e.g. Dataset[1], will first be converted to the access to the internal index list, which is then passed to access the actual data. This design is for the ease of sampling.
- Parameters
- root: str
The root directory path where the dataset is stored.
Examples
Suppose we have a Dataset containing 5 data items [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]. The indices of the 5 elements in the list are correspondingly [0, 1, 2, 3, 4]. Suppose the dataset is shuffled, which shuffles the internal index list, the consequent indices becomes [2, 3, 1, 4, 5]. Then an access to the dataset Dataset[2] will first access the indices[2] which is 1, and then use the received index to access the actual dataset, which will return the actual data item ‘b’. Now to the user the 3rd ([2]) element in the dataset got shuffled and is not ‘c’.
- Attributes
- processed_dir
- processed_file_names
- processed_file_paths
raw_dir
The directory where the raw data is stored.
- raw_file_names
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)To be implemented in task-specific dataset base class.
Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
build_topology
(data_items) Build graph topology for each item in the dataset. The generated graph is bound to the graph attribute of the DataItem.
-
build_vocab
() Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.
-
abstract static
collate_fn
(data_list) Takes a list of data and convert it to a batch of data.
-
abstract
download
() Download the raw data from the Internet.
-
abstract
parse_file
(file_path) To be implemented in task-specific dataset base class.
-
property
raw_dir
The directory where the raw data is stored.
-
property
raw_file_paths
The paths to raw files.
-
read_raw_data
() Read raw data from the disk and put them in a dictionary (self.data). The raw data file should be organized as the format defined in self.parse_file() method.
This function calls self.parse_file() repeatedly and pass the file paths in self.raw_file_names once at a time.
This function builds self.data which is a dict of {int (index): DataItem}, where the id represents the index of the DataItem w.r.t. the whole dataset.
This function also builds the self.split_ids dictionary whose keys correspond to those of self.raw_file_names defined by the user, indicating the indices of each subset (e.g. train, val and test).
-
abstract
vectorization
(data_items) Convert tokens to indices which can be processed by downstream models.
Task Level Dataset Base Class¶
-
class
graph4nlp.data.dataset.
Text2LabelDataset
(graph_name: str, root_dir: str = None, static_or_dynamic: str = None, topology_builder: Union[graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase, graph4nlp.pytorch.modules.graph_construction.base.DynamicGraphConstructionBase] = <class 'graph4nlp.pytorch.modules.graph_construction.dependency_graph_construction.DependencyBasedGraphConstruction'>, topology_subdir: str = None, dynamic_init_graph_name: str = None, dynamic_init_topology_builder: graph4nlp.pytorch.modules.graph_construction.base.StaticGraphConstructionBase = None, dynamic_init_topology_aux_args=None, **kwargs) The dataset for text-to-label applications. :param graph_name: The name of graph construction method. E.g., “dependency”.
- Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
topology_builder
static_or_dynamic
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.- Parameters
root_dir (str, default=None) – The path of dataset.
topology_builder (Union[StaticGraphConstructionBase, DynamicGraphConstructionBase], default=None) – The graph construction class.
topology_subdir (str) – The directory name of processed path.
static_or_dynamic (str, default='static') – The graph type. Expected in (‘static’, ‘dynamic’)
dynamic_init_graph_name (str, default=None) –
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.dynamic_init_topology_builder (StaticGraphConstructionBase) – The graph construction class.
dynamic_init_topology_aux_args (None,) – TBD.
- Attributes
- processed_dir
- processed_file_names
- processed_file_paths
raw_dir
The directory where the raw data is stored.
- raw_file_names
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
build_vocab
() Build the vocabulary. If self.use_val_for_vocab is True, use both training set and validation set for building the vocabulary. Otherwise only the training set is used.
-
static
collate_fn
(data_list: [<class 'graph4nlp.pytorch.data.dataset.Text2LabelDataItem'>]) Takes a list of data and convert it to a batch of data.
-
parse_file
(file_path) → list Read and parse the file specified by file_path. The file format is specified by each individual task-specific base class. Returns all the indices of data items in this file w.r.t. the whole dataset.
For Text2LabelDataset, the format of the input file should contain lines of input, each line representing one record of data. The input and output is separated by a tab( ).
- Parameters
- file_path: str
The path of the input file.
- Returns
- list
The indices of data items in the file w.r.t. the whole dataset.
Examples
input: How far is it from Denver to Aspen ? NUM
DataItem: input_text=”How far is it from Denver to Aspen ?”, output_label=”NUM”
-
vectorization
(data_items) Convert tokens to indices which can be processed by downstream models.