Data Loader¶

DataLoader is a required layer of any Pipeline, it provides several methods for loading data from raw text file and preprocessing data by apply TextProcessor layer. As mentioned before, default data format of text corpus should be described as follows:

Polarity first, and text is in following line. Training samples are separated by \n\n.

# train.txt
polarity_01
sentence_01

polarity_02
sentence_02

...

# test.txt
polarity_01
sentence_01

polarity_02
sentence_02

...

data_loader = DataLoader(text_processor=text_processor)
data = data_loader(train='train.txt', test='test.txt')

You can set your own delimiter (separator between polarity and text), line_separator (separator between samples). For instance, if delimiter='\t' and line_separator='\n', your data should be:

# train.txt
polarity_01     sentence_01
polarity_02     sentence_02
...

# test.txt
polarity_01     sentence_01
polarity_02     sentence_02
...

data_loader = DataLoader(text_processor=text_processor, delimiter='\t', line_separator='\n')
data = data_loader(train='train.txt', test='test.txt')

DataLoader will return a sentivi.data.data_loader.Corpus instance when executed.

class sentivi.data.DataLoader(delimiter: Optional[str] = '\n', line_separator: Optional[str] = '\n\n', n_grams: Optional[int] = 1, text_processor: Optional[sentivi.text_processor.TextProcessor] = None, max_length: Optional[int] = 256, mode: Optional[str] = 'sentivi')¶

DataLoader is an inheritance class of DataLayer.

__init__(delimiter: Optional[str] = '\n', line_separator: Optional[str] = '\n\n', n_grams: Optional[int] = 1, text_processor: Optional[sentivi.text_processor.TextProcessor] = None, max_length: Optional[int] = 256, mode: Optional[str] = 'sentivi')¶

Parameters

delimiter – separator between polarity and text
line_separator – separator between samples
n_grams – n-gram(s) use to split, for TextEncoder such as word2vec or transformer, n-gram should be 1
text_processor – sentivi.text_processor.TextProcessor instance
max_length – maximum length of input text

forward(*args, **kwargs)¶

Execute loading data pipeline

Parameters

args – arbitrary arguments
kwargs – arbitrary keyword arguments

Returns

loaded data

Return type

sentivi.data.data_loader.Corpus

Corpus