graph4nlp.datasets¶
graph4nlp.datasets module contains various common datasets implemented based on graph4nlp.data.dataset.
All Datasets¶
-
class
graph4nlp.datasets.
JobsDataset
(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, seed=None, word_emb_size=300, share_vocab=True, lower_case=True, thread_number=1, port=9000, for_inference=None, reused_vocab_model=None)¶ - Parameters
- root_dir: str
The path of dataset.
- graph_name: str
The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
topology_builder
static_or_dynamic
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.- topology_builder: GraphConstructionBase, default=None
The graph construction class.
- topology_subdir: str
The directory name of processed path.
- static_or_dynamic: str, default=’static’
The graph type. Expected in (‘static’, ‘dynamic’)
- edge_strategy: str, default=None
The edge strategy. Expected in (None, ‘homogeneous’, ‘as_node’). If set None, it will be ‘homogeneous’.
- merge_strategy: str, default=None
The strategy to merge sub-graphs. Expected in (None, ‘tailhead’, ‘user_define’). If set None, it will be ‘tailhead’.
- share_vocab: bool, default=False
Whether to share the input vocabulary with the output vocabulary.
- dynamic_init_graph_name: str, default=None
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the
graph_name
and set the parameters above.- dynamic_init_topology_builder: GraphConstructionBase
The graph construction class.
- dynamic_init_topology_aux_args: None,
TBD.
- Attributes
- processed_dir
processed_file_names
At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
- processed_file_paths
raw_dir
The directory where the raw data is stored.
raw_file_names
3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
build_vocab
()Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
download
()¶ Download the raw data from the Internet.
-
property
processed_file_names
¶ ‘vocab’, ‘data’ and ‘split_ids’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names
¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.
JobsDatasetForTree
(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_jobs>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)¶ - Attributes
- processed_dir
processed_file_names
At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
- processed_file_paths
raw_dir
The directory where the raw data is stored.
raw_file_names
3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
build_vocab
()Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)For tree decoder we also need the vectorize the tree output.
process_data_items
register_datapipe_as_function
register_function
-
download
()¶ Download the raw data from the Internet.
-
property
processed_file_names
¶ ‘vocab’, ‘data’ and ‘split_ids’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names
¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.
GeoDatasetForTree
(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_geo>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)¶ - Attributes
- processed_dir
processed_file_names
At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
- processed_file_paths
raw_dir
The directory where the raw data is stored.
raw_file_names
3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
build_vocab
()Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)For tree decoder we also need the vectorize the tree output.
process_data_items
register_datapipe_as_function
register_function
-
download
()¶ Download the raw data from the Internet.
-
property
processed_file_names
¶ ‘vocab’, ‘data’ and ‘split_ids’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names
¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.
KinshipDataset
(root_dir=None, topology_subdir='kgc', word_emb_size=300, **kwargs)¶ - Attributes
- processed_dir
processed_file_names
At least 2 reserved keys should be fiiled: ‘vocab’ and ‘data’.
- processed_file_paths
raw_dir
The directory where the raw data is stored.
raw_file_names
3 reserved keys: ‘train’, ‘val’ (optional), ‘test’.
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
build_vocab
()Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
download
()¶ Download the raw data from the Internet.
-
property
processed_file_names
¶ ‘vocab’ and ‘data’.
- Type
At least 2 reserved keys should be fiiled
-
property
raw_dir
¶ The directory where the raw data is stored.
-
property
raw_file_names
¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.
MawpsDatasetForTree
(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_mawps>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)¶ - Attributes
- processed_dir
processed_file_names
At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
- processed_file_paths
raw_dir
The directory where the raw data is stored.
raw_file_names
3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
build_vocab
()Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)For tree decoder we also need the vectorize the tree output.
process_data_items
register_datapipe_as_function
register_function
-
download
()¶ Download the raw data from the Internet.
-
property
processed_file_names
¶ ‘vocab’, ‘data’ and ‘split_ids’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names
¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.
SQuADDataset
(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, share_vocab=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, max_word_vocab_size=None, min_word_vocab_freq=1, tokenizer=<bound method RegexpTokenizer.tokenize of RegexpTokenizer(pattern=' ', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)>, word_emb_size=None, **kwargs)¶ - Attributes
- processed_dir
processed_file_names
At least 2 reserved keys should be fiiled: ‘vocab’ and ‘data’.
- processed_file_paths
raw_dir
The directory where the raw data is stored.
raw_file_names
3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
build_vocab
()Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
download
()¶ Download the raw data from the Internet.
-
property
processed_file_names
¶ ‘vocab’ and ‘data’.
- Type
At least 2 reserved keys should be fiiled
-
property
raw_file_names
¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.
TrecDataset
(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, max_word_vocab_size=None, min_word_vocab_freq=1, tokenizer=<bound method RegexpTokenizer.tokenize of RegexpTokenizer(pattern=' ', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)>, word_emb_size=None, **kwargs)¶ - Attributes
- processed_dir
processed_file_names
At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘label’.
- processed_file_paths
raw_dir
The directory where the raw data is stored.
raw_file_names
3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths
The paths to raw files.
Methods
build_topology
(data_items)Build graph topology for each item in the dataset.
build_vocab
()Build the vocabulary.
collate_fn
(data_list)Takes a list of data and convert it to a batch of data.
download
()Download the raw data from the Internet.
parse_file
(file_path)Read and parse the file specified by file_path.
read_raw_data
()Read raw data from the disk and put them in a dictionary (self.data).
vectorization
(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
download
()¶ Download the raw data from the Internet.
-
property
processed_file_names
¶ ‘vocab’, ‘data’ and ‘label’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names
¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys