graph4nlp.datasets¶
graph4nlp.datasets module contains various common datasets implemented based on graph4nlp.data.dataset.
All Datasets¶
-
class
graph4nlp.datasets.JobsDataset(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, seed=None, word_emb_size=300, share_vocab=True, lower_case=True, thread_number=1, port=9000, for_inference=None, reused_vocab_model=None)¶ - Parameters
- root_dir: str
The path of dataset.
- graph_name: str
The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:
topology_builderstatic_or_dynamic
If you need to customize your graph construction method, you should rename the
graph_nameand set the parameters above.- topology_builder: GraphConstructionBase, default=None
The graph construction class.
- topology_subdir: str
The directory name of processed path.
- static_or_dynamic: str, default=’static’
The graph type. Expected in (‘static’, ‘dynamic’)
- edge_strategy: str, default=None
The edge strategy. Expected in (None, ‘homogeneous’, ‘as_node’). If set None, it will be ‘homogeneous’.
- merge_strategy: str, default=None
The strategy to merge sub-graphs. Expected in (None, ‘tailhead’, ‘user_define’). If set None, it will be ‘tailhead’.
- share_vocab: bool, default=False
Whether to share the input vocabulary with the output vocabulary.
- dynamic_init_graph_name: str, default=None
The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:
dynamic_init_topology_builder
If you need to customize your graph construction method, you should rename the
graph_nameand set the parameters above.- dynamic_init_topology_builder: GraphConstructionBase
The graph construction class.
- dynamic_init_topology_aux_args: None,
TBD.
- Attributes
- processed_dir
processed_file_namesAt least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
- processed_file_paths
raw_dirThe directory where the raw data is stored.
raw_file_names3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_pathsThe paths to raw files.
Methods
build_topology(data_items)Build graph topology for each item in the dataset.
build_vocab()Build the vocabulary.
collate_fn(data_list)Takes a list of data and convert it to a batch of data.
download()Download the raw data from the Internet.
parse_file(file_path)Read and parse the file specified by file_path.
read_raw_data()Read raw data from the disk and put them in a dictionary (self.data).
vectorization(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
download()¶ Download the raw data from the Internet.
-
property
processed_file_names¶ ‘vocab’, ‘data’ and ‘split_ids’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.JobsDatasetForTree(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_jobs>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)¶ - Attributes
- processed_dir
processed_file_namesAt least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
- processed_file_paths
raw_dirThe directory where the raw data is stored.
raw_file_names3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_pathsThe paths to raw files.
Methods
build_topology(data_items)Build graph topology for each item in the dataset.
build_vocab()Build the vocabulary.
collate_fn(data_list)Takes a list of data and convert it to a batch of data.
download()Download the raw data from the Internet.
parse_file(file_path)Read and parse the file specified by file_path.
read_raw_data()Read raw data from the disk and put them in a dictionary (self.data).
vectorization(data_items)For tree decoder we also need the vectorize the tree output.
process_data_items
register_datapipe_as_function
register_function
-
download()¶ Download the raw data from the Internet.
-
property
processed_file_names¶ ‘vocab’, ‘data’ and ‘split_ids’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.GeoDatasetForTree(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_geo>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)¶ - Attributes
- processed_dir
processed_file_namesAt least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
- processed_file_paths
raw_dirThe directory where the raw data is stored.
raw_file_names3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_pathsThe paths to raw files.
Methods
build_topology(data_items)Build graph topology for each item in the dataset.
build_vocab()Build the vocabulary.
collate_fn(data_list)Takes a list of data and convert it to a batch of data.
download()Download the raw data from the Internet.
parse_file(file_path)Read and parse the file specified by file_path.
read_raw_data()Read raw data from the disk and put them in a dictionary (self.data).
vectorization(data_items)For tree decoder we also need the vectorize the tree output.
process_data_items
register_datapipe_as_function
register_function
-
download()¶ Download the raw data from the Internet.
-
property
processed_file_names¶ ‘vocab’, ‘data’ and ‘split_ids’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.KinshipDataset(root_dir=None, topology_subdir='kgc', word_emb_size=300, **kwargs)¶ - Attributes
- processed_dir
processed_file_namesAt least 2 reserved keys should be fiiled: ‘vocab’ and ‘data’.
- processed_file_paths
raw_dirThe directory where the raw data is stored.
raw_file_names3 reserved keys: ‘train’, ‘val’ (optional), ‘test’.
raw_file_pathsThe paths to raw files.
Methods
build_topology(data_items)Build graph topology for each item in the dataset.
build_vocab()Build the vocabulary.
collate_fn(data_list)Takes a list of data and convert it to a batch of data.
download()Download the raw data from the Internet.
parse_file(file_path)Read and parse the file specified by file_path.
read_raw_data()Read raw data from the disk and put them in a dictionary (self.data).
vectorization(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
download()¶ Download the raw data from the Internet.
-
property
processed_file_names¶ ‘vocab’ and ‘data’.
- Type
At least 2 reserved keys should be fiiled
-
property
raw_dir¶ The directory where the raw data is stored.
-
property
raw_file_names¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.MawpsDatasetForTree(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_mawps>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)¶ - Attributes
- processed_dir
processed_file_namesAt least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
- processed_file_paths
raw_dirThe directory where the raw data is stored.
raw_file_names3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_pathsThe paths to raw files.
Methods
build_topology(data_items)Build graph topology for each item in the dataset.
build_vocab()Build the vocabulary.
collate_fn(data_list)Takes a list of data and convert it to a batch of data.
download()Download the raw data from the Internet.
parse_file(file_path)Read and parse the file specified by file_path.
read_raw_data()Read raw data from the disk and put them in a dictionary (self.data).
vectorization(data_items)For tree decoder we also need the vectorize the tree output.
process_data_items
register_datapipe_as_function
register_function
-
download()¶ Download the raw data from the Internet.
-
property
processed_file_names¶ ‘vocab’, ‘data’ and ‘split_ids’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.SQuADDataset(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, share_vocab=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, max_word_vocab_size=None, min_word_vocab_freq=1, tokenizer=<bound method RegexpTokenizer.tokenize of RegexpTokenizer(pattern=' ', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)>, word_emb_size=None, **kwargs)¶ - Attributes
- processed_dir
processed_file_namesAt least 2 reserved keys should be fiiled: ‘vocab’ and ‘data’.
- processed_file_paths
raw_dirThe directory where the raw data is stored.
raw_file_names3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_pathsThe paths to raw files.
Methods
build_topology(data_items)Build graph topology for each item in the dataset.
build_vocab()Build the vocabulary.
collate_fn(data_list)Takes a list of data and convert it to a batch of data.
download()Download the raw data from the Internet.
parse_file(file_path)Read and parse the file specified by file_path.
read_raw_data()Read raw data from the disk and put them in a dictionary (self.data).
vectorization(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
download()¶ Download the raw data from the Internet.
-
property
processed_file_names¶ ‘vocab’ and ‘data’.
- Type
At least 2 reserved keys should be fiiled
-
property
raw_file_names¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys
-
class
graph4nlp.datasets.TrecDataset(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, max_word_vocab_size=None, min_word_vocab_freq=1, tokenizer=<bound method RegexpTokenizer.tokenize of RegexpTokenizer(pattern=' ', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)>, word_emb_size=None, **kwargs)¶ - Attributes
- processed_dir
processed_file_namesAt least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘label’.
- processed_file_paths
raw_dirThe directory where the raw data is stored.
raw_file_names3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_pathsThe paths to raw files.
Methods
build_topology(data_items)Build graph topology for each item in the dataset.
build_vocab()Build the vocabulary.
collate_fn(data_list)Takes a list of data and convert it to a batch of data.
download()Download the raw data from the Internet.
parse_file(file_path)Read and parse the file specified by file_path.
read_raw_data()Read raw data from the disk and put them in a dictionary (self.data).
vectorization(data_items)Convert tokens to indices which can be processed by downstream models.
process_data_items
register_datapipe_as_function
register_function
-
download()¶ Download the raw data from the Internet.
-
property
processed_file_names¶ ‘vocab’, ‘data’ and ‘label’.
- Type
At least 3 reserved keys should be fiiled
-
property
raw_file_names¶ ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
- Type
3 reserved keys