Data Loader¶
DataLoader
is a required layer of any Pipeline
, it provides several methods for loading data from raw text file
and preprocessing data by apply TextProcessor
layer. As mentioned before, default data format of text corpus should
be described as follows:
Polarity first, and text is in following line. Training samples are separated by \n\n
.
# train.txt
polarity_01
sentence_01
polarity_02
sentence_02
...
# test.txt
polarity_01
sentence_01
polarity_02
sentence_02
...
data_loader = DataLoader(text_processor=text_processor)
data = data_loader(train='train.txt', test='test.txt')
You can set your own delimiter
(separator between polarity and text), line_separator
(separator between samples). For
instance, if delimiter='\t'
and line_separator='\n'
, your data should be:
# train.txt
polarity_01 sentence_01
polarity_02 sentence_02
...
# test.txt
polarity_01 sentence_01
polarity_02 sentence_02
...
data_loader = DataLoader(text_processor=text_processor, delimiter='\t', line_separator='\n')
data = data_loader(train='train.txt', test='test.txt')
DataLoader
will return a sentivi.data.data_loader.Corpus
instance when executed.
-
class
sentivi.data.
DataLoader
(delimiter: Optional[str] = '\n', line_separator: Optional[str] = '\n\n', n_grams: Optional[int] = 1, text_processor: Optional[sentivi.text_processor.TextProcessor] = None, max_length: Optional[int] = 256)¶ DataLoader is an inheritance class of DataLayer.
-
__init__
(delimiter: Optional[str] = '\n', line_separator: Optional[str] = '\n\n', n_grams: Optional[int] = 1, text_processor: Optional[sentivi.text_processor.TextProcessor] = None, max_length: Optional[int] = 256)¶ - Parameters
delimiter – separator between polarity and text
line_separator – separator between samples
n_grams – n-gram(s) use to split, for TextEncoder such as word2vec or transformer, n-gram should be 1
text_processor – sentivi.text_processor.TextProcessor instance
max_length – maximum length of input text
-
forward
(*args, **kwargs)¶ Execute loading data pipeline
- Parameters
args – arbitrary arguments
kwargs – arbitrary keyword arguments
- Returns
loaded data
- Return type
-