graph4nlp.datasets¶

graph4nlp.datasets module contains various common datasets implemented based on graph4nlp.data.dataset.

All Datasets¶

class graph4nlp.datasets.JobsDataset(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, seed=None, word_emb_size=300, share_vocab=True, lower_case=True, thread_number=1, port=9000, for_inference=None, reused_vocab_model=None)¶

Parameters

root_dir: str

The path of dataset.

graph_name: str

The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:

topology_builder

static_or_dynamic

If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

topology_builder: GraphConstructionBase, default=None

The graph construction class.

topology_subdir: str

The directory name of processed path.

static_or_dynamic: str, default=’static’

The graph type. Expected in (‘static’, ‘dynamic’)

edge_strategy: str, default=None

The edge strategy. Expected in (None, ‘homogeneous’, ‘as_node’). If set None, it will be ‘homogeneous’.

merge_strategy: str, default=None

The strategy to merge sub-graphs. Expected in (None, ‘tailhead’, ‘user_define’). If set None, it will be ‘tailhead’.

share_vocab: bool, default=False

Whether to share the input vocabulary with the output vocabulary.

dynamic_init_graph_name: str, default=None

The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:

dynamic_init_topology_builder

If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

dynamic_init_topology_builder: GraphConstructionBase

The graph construction class.

dynamic_init_topology_aux_args: None,

TBD.

Attributes

processed_dir
processed_file_names: At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names: 3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

download()¶: Download the raw data from the Internet.

property processed_file_names¶

‘vocab’, ‘data’ and ‘split_ids’.

Type: At least 3 reserved keys should be fiiled

property raw_file_names¶

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type: 3 reserved keys

class graph4nlp.datasets.JobsDatasetForTree(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_jobs>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)¶

Attributes

processed_dir
processed_file_names: At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names: 3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	For tree decoder we also need the vectorize the tree output.

process_data_items
register_datapipe_as_function
register_function

download()¶: Download the raw data from the Internet.

property processed_file_names¶

‘vocab’, ‘data’ and ‘split_ids’.

Type: At least 3 reserved keys should be fiiled

property raw_file_names¶

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type: 3 reserved keys

class graph4nlp.datasets.GeoDatasetForTree(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_geo>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)¶

Attributes

processed_dir
processed_file_names: At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names: 3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	For tree decoder we also need the vectorize the tree output.

process_data_items
register_datapipe_as_function
register_function

download()¶: Download the raw data from the Internet.

property processed_file_names¶

‘vocab’, ‘data’ and ‘split_ids’.

Type: At least 3 reserved keys should be fiiled

property raw_file_names¶

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type: 3 reserved keys

class graph4nlp.datasets.KinshipDataset(root_dir=None, topology_subdir='kgc', word_emb_size=300, **kwargs)¶

Attributes

processed_dir
processed_file_names: At least 2 reserved keys should be fiiled: ‘vocab’ and ‘data’.
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names: 3 reserved keys: ‘train’, ‘val’ (optional), ‘test’.
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

download()¶: Download the raw data from the Internet.

property processed_file_names¶

‘vocab’ and ‘data’.

Type: At least 2 reserved keys should be fiiled

property raw_dir¶: The directory where the raw data is stored.

property raw_file_names¶

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type: 3 reserved keys

class graph4nlp.datasets.MawpsDatasetForTree(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_mawps>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)¶

Attributes

processed_dir
processed_file_names: At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names: 3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	For tree decoder we also need the vectorize the tree output.

process_data_items
register_datapipe_as_function
register_function

download()¶: Download the raw data from the Internet.

property processed_file_names¶

‘vocab’, ‘data’ and ‘split_ids’.

Type: At least 3 reserved keys should be fiiled

property raw_file_names¶

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type: 3 reserved keys

class graph4nlp.datasets.SQuADDataset(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, share_vocab=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, max_word_vocab_size=None, min_word_vocab_freq=1, tokenizer=<bound method RegexpTokenizer.tokenize of RegexpTokenizer(pattern=' ', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)>, word_emb_size=None, **kwargs)¶

Attributes

processed_dir
processed_file_names: At least 2 reserved keys should be fiiled: ‘vocab’ and ‘data’.
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names: 3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

download()¶: Download the raw data from the Internet.

property processed_file_names¶

‘vocab’ and ‘data’.

Type: At least 2 reserved keys should be fiiled

property raw_file_names¶

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type: 3 reserved keys

class graph4nlp.datasets.TrecDataset(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, max_word_vocab_size=None, min_word_vocab_freq=1, tokenizer=<bound method RegexpTokenizer.tokenize of RegexpTokenizer(pattern=' ', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)>, word_emb_size=None, **kwargs)¶

Attributes

processed_dir
processed_file_names: At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘label’.
processed_file_paths
raw_dir: The directory where the raw data is stored.
raw_file_names: 3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.
raw_file_paths: The paths to raw files.

Methods

`build_topology`(data_items)	Build graph topology for each item in the dataset.
`build_vocab`()	Build the vocabulary.
`collate_fn`(data_list)	Takes a list of data and convert it to a batch of data.
`download`()	Download the raw data from the Internet.
`parse_file`(file_path)	Read and parse the file specified by file_path.
`read_raw_data`()	Read raw data from the disk and put them in a dictionary (self.data).
`vectorization`(data_items)	Convert tokens to indices which can be processed by downstream models.

process_data_items
register_datapipe_as_function
register_function

download()¶: Download the raw data from the Internet.

property processed_file_names¶

‘vocab’, ‘data’ and ‘label’.

Type: At least 3 reserved keys should be fiiled

property raw_file_names¶

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type: 3 reserved keys