graph4nlp.datasets

graph4nlp.datasets module contains various common datasets implemented based on graph4nlp.data.dataset.

All Datasets

class graph4nlp.datasets.JobsDataset(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, seed=None, word_emb_size=300, share_vocab=True, lower_case=True, thread_number=1, port=9000, for_inference=None, reused_vocab_model=None)
Parameters
root_dir: str

The path of dataset.

graph_name: str

The name of graph construction method. E.g., “dependency”. Note that if it is in the provided graph names (i.e., “dependency”, “constituency”, “ie”, “node_emb”, “node_emb_refine”), the following parameters are set by default and users can’t modify them:

  1. topology_builder

  2. static_or_dynamic

If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

topology_builder: GraphConstructionBase, default=None

The graph construction class.

topology_subdir: str

The directory name of processed path.

static_or_dynamic: str, default=’static’

The graph type. Expected in (‘static’, ‘dynamic’)

edge_strategy: str, default=None

The edge strategy. Expected in (None, ‘homogeneous’, ‘as_node’). If set None, it will be ‘homogeneous’.

merge_strategy: str, default=None

The strategy to merge sub-graphs. Expected in (None, ‘tailhead’, ‘user_define’). If set None, it will be ‘tailhead’.

share_vocab: bool, default=False

Whether to share the input vocabulary with the output vocabulary.

dynamic_init_graph_name: str, default=None

The graph name of the initial graph. Expected in (None, “line”, “dependency”, “constituency”). Note that if it is in the provided graph names (i.e., “line”, “dependency”, “constituency”), the following parameters are set by default and users can’t modify them:

  1. dynamic_init_topology_builder

If you need to customize your graph construction method, you should rename the graph_name and set the parameters above.

dynamic_init_topology_builder: GraphConstructionBase

The graph construction class.

dynamic_init_topology_aux_args: None,

TBD.

Attributes
processed_dir
processed_file_names

At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.

processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names

3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

download()

Download the raw data from the Internet.

property processed_file_names

‘vocab’, ‘data’ and ‘split_ids’.

Type

At least 3 reserved keys should be fiiled

property raw_file_names

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type

3 reserved keys

class graph4nlp.datasets.JobsDatasetForTree(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_jobs>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)
Attributes
processed_dir
processed_file_names

At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.

processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names

3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

For tree decoder we also need the vectorize the tree output.

process_data_items

register_datapipe_as_function

register_function

download()

Download the raw data from the Internet.

property processed_file_names

‘vocab’, ‘data’ and ‘split_ids’.

Type

At least 3 reserved keys should be fiiled

property raw_file_names

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type

3 reserved keys

class graph4nlp.datasets.GeoDatasetForTree(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_geo>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)
Attributes
processed_dir
processed_file_names

At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.

processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names

3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

For tree decoder we also need the vectorize the tree output.

process_data_items

register_datapipe_as_function

register_function

download()

Download the raw data from the Internet.

property processed_file_names

‘vocab’, ‘data’ and ‘split_ids’.

Type

At least 3 reserved keys should be fiiled

property raw_file_names

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type

3 reserved keys

class graph4nlp.datasets.KinshipDataset(root_dir=None, topology_subdir='kgc', word_emb_size=300, **kwargs)
Attributes
processed_dir
processed_file_names

At least 2 reserved keys should be fiiled: ‘vocab’ and ‘data’.

processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names

3 reserved keys: ‘train’, ‘val’ (optional), ‘test’.

raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

download()

Download the raw data from the Internet.

property processed_file_names

‘vocab’ and ‘data’.

Type

At least 2 reserved keys should be fiiled

property raw_dir

The directory where the raw data is stored.

property raw_file_names

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type

3 reserved keys

class graph4nlp.datasets.MawpsDatasetForTree(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, merge_strategy='tailhead', edge_strategy=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='6B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, val_split_ratio=0, word_emb_size=300, share_vocab=True, enc_emb_size=300, dec_emb_size=300, min_word_vocab_freq=1, tokenizer=<function tokenize_mawps>, max_word_vocab_size=100000, for_inference=False, reused_vocab_model=None)
Attributes
processed_dir
processed_file_names

At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘split_ids’.

processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names

3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

For tree decoder we also need the vectorize the tree output.

process_data_items

register_datapipe_as_function

register_function

download()

Download the raw data from the Internet.

property processed_file_names

‘vocab’, ‘data’ and ‘split_ids’.

Type

At least 3 reserved keys should be fiiled

property raw_file_names

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type

3 reserved keys

class graph4nlp.datasets.SQuADDataset(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, share_vocab=True, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, max_word_vocab_size=None, min_word_vocab_freq=1, tokenizer=<bound method RegexpTokenizer.tokenize of RegexpTokenizer(pattern=' ', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)>, word_emb_size=None, **kwargs)
Attributes
processed_dir
processed_file_names

At least 2 reserved keys should be fiiled: ‘vocab’ and ‘data’.

processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names

3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

download()

Download the raw data from the Internet.

property processed_file_names

‘vocab’ and ‘data’.

Type

At least 2 reserved keys should be fiiled

property raw_file_names

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type

3 reserved keys

class graph4nlp.datasets.TrecDataset(root_dir, topology_subdir, graph_name, static_or_dynamic='static', topology_builder=None, dynamic_init_graph_name=None, dynamic_init_topology_builder=None, dynamic_init_topology_aux_args=None, pretrained_word_emb_name='840B', pretrained_word_emb_url=None, pretrained_word_emb_cache_dir=None, max_word_vocab_size=None, min_word_vocab_freq=1, tokenizer=<bound method RegexpTokenizer.tokenize of RegexpTokenizer(pattern=' ', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)>, word_emb_size=None, **kwargs)
Attributes
processed_dir
processed_file_names

At least 3 reserved keys should be fiiled: ‘vocab’, ‘data’ and ‘label’.

processed_file_paths
raw_dir

The directory where the raw data is stored.

raw_file_names

3 reserved keys: ‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

raw_file_paths

The paths to raw files.

Methods

build_topology(data_items)

Build graph topology for each item in the dataset.

build_vocab()

Build the vocabulary.

collate_fn(data_list)

Takes a list of data and convert it to a batch of data.

download()

Download the raw data from the Internet.

parse_file(file_path)

Read and parse the file specified by file_path.

read_raw_data()

Read raw data from the disk and put them in a dictionary (self.data).

vectorization(data_items)

Convert tokens to indices which can be processed by downstream models.

process_data_items

register_datapipe_as_function

register_function

download()

Download the raw data from the Internet.

property processed_file_names

‘vocab’, ‘data’ and ‘label’.

Type

At least 3 reserved keys should be fiiled

property raw_file_names

‘train’, ‘val’ (optional), ‘test’. Represent the split of dataset.

Type

3 reserved keys