Data Loader

DataLoader is a required layer of any Pipeline, it provides several methods for loading data from raw text file and preprocessing data by apply TextProcessor layer. As mentioned before, default data format of text corpus should be described as follows:

Polarity first, and text is in following line. Training samples are separated by \n\n.

# train.txt
polarity_01
sentence_01

polarity_02
sentence_02

...

# test.txt
polarity_01
sentence_01

polarity_02
sentence_02

...
data_loader = DataLoader(text_processor=text_processor)
data = data_loader(train='train.txt', test='test.txt')

You can set your own delimiter (separator between polarity and text), line_separator (separator between samples). For instance, if delimiter='\t' and line_separator='\n', your data should be:

# train.txt
polarity_01     sentence_01
polarity_02     sentence_02
...

# test.txt
polarity_01     sentence_01
polarity_02     sentence_02
...
data_loader = DataLoader(text_processor=text_processor, delimiter='\t', line_separator='\n')
data = data_loader(train='train.txt', test='test.txt')

DataLoader will return a sentivi.data.data_loader.Corpus instance when executed.

class sentivi.data.DataLoader(delimiter: Optional[str] = '\n', line_separator: Optional[str] = '\n\n', n_grams: Optional[int] = 1, text_processor: Optional[sentivi.text_processor.TextProcessor] = None, max_length: Optional[int] = 256, mode: Optional[str] = 'sentivi')

DataLoader is an inheritance class of DataLayer.

__init__(delimiter: Optional[str] = '\n', line_separator: Optional[str] = '\n\n', n_grams: Optional[int] = 1, text_processor: Optional[sentivi.text_processor.TextProcessor] = None, max_length: Optional[int] = 256, mode: Optional[str] = 'sentivi')
Parameters
  • delimiter – separator between polarity and text

  • line_separator – separator between samples

  • n_grams – n-gram(s) use to split, for TextEncoder such as word2vec or transformer, n-gram should be 1

  • text_processor – sentivi.text_processor.TextProcessor instance

  • max_length – maximum length of input text

forward(*args, **kwargs)

Execute loading data pipeline

Parameters
  • args – arbitrary arguments

  • kwargs – arbitrary keyword arguments

Returns

loaded data

Return type

sentivi.data.data_loader.Corpus