Text Encoder¶

TextEncoder is a class that receives pre-processed data from sentivi.data.DataLoader, its responsibility is to provide appropriate data to the respective classifications.

text_encoder = TextEncoder('one-hot') # ['one-hot', 'word2vec', 'bow', 'tf-idf', 'transformer']

One-hot Encoding

The simplest encoding type of TextEncoder, each token will be represented as a one-hot vector. These vectors indicate the look-up index of given token in corpus vocabulary. For example, vocab = ['I', 'am', 'a', 'student']:

one-hot('I') = [1, 0, 0, 0]

one-hot('am') = [0, 1, 0, 0]

one-hot('a') = [0, 0, 1, 0]

one-hot('student') = [0, 0, 0, 1]

Bag-of-Words

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words. More detail: https://machinelearningmastery.com/gentle-introduction-bag-words-model/

Term Frequency - Inverse Document Frequency

tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. This tf-idf version implemented in TextEncoder is logarithmically scaled version. More detail: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Word2Vec

Word2vec (Mikolov et al, 2013) is a method to efficiently create word embeddings using distributed representation method. This implementation using gensim, it is required model_path argument in initialization stage. For vietnamese, word2vec model should be downloaded from https://github.com/sonvx/word2vecVN

text_encoder = TextEncoder(encode_type='word2vec', model_path='./pretrained/wiki.vi.model.bin.gz')

Transformer

Transformer text encoder is equivalent to transformer.AutoTokenizer from https://huggingface.co/transformers

class sentivi.data.TextEncoder(encode_type: Optional[str] = None, model_path: Optional[str] = None)¶

__init__(encode_type: Optional[str] = None, model_path: Optional[str] = None)¶

Simple text encode layer

Parameters

encode_type – one type in [‘one-hot’, ‘word2vec’, ‘bow’, ‘tf-idf’, ‘transformer’]
model_path – model is required for word2vec option

bow(x, vocab, n_grams) → numpy.ndarray¶

Bag-of-Words encoder

Parameters

x – list of texts
vocab – corpus vocabulary
n_grams – n-grams parameters

Returns

Bag-of-Words vectors

Return type

numpy.ndarray

forward(x: Optional[sentivi.data.data_loader.Corpus], *args, **kwargs)¶

Execute text encoder pipeline

Parameters

x – sentivi.data.data_loader.Corpus instance
args – arbitrary arguments
kwargs – arbitrary keyword arguments

Returns

Training and Test batch encoding

Return type

Tuple, Tuple

one_hot(x, vocab, n_grams) → numpy.ndarray¶

Convert corpus into batch of one-hot vectors.

Parameters

x – list of texts
vocab – corpus vocabulary
n_grams – n-grams parameters

Returns

one-hot vectors

Return type

numpy.ndarray

predict(x, vocab, n_grams, *args, **kwargs)¶

Encode text for prediction purpose

Parameters

x – list of text
vocab – corpus vocabulary
n_grams – n-grams parameters
args – arbitrary arguments
kwargs – arbitrary keyword arguments

Returns

encoded text

Return type

transformers.BatchEncoding

tf_idf(x, vocab, n_grams) → numpy.ndarray¶

Simple TF-IDF vectors

Parameters

x – list of texts
vocab – corpus vocabulary
n_grams – n-grams parameters

Returns

encoded vectors

Return type

numpy.ndarray

transformer_tokenizer(tokenizer, x)¶

Transformer tokenizer and encoder

Parameters

tokenizer – transformer.AutoTokenizer
x – list of texts

Returns

encoded vectors

Return type

BatchEncoding

word2vec(x, n_grams) → numpy.ndarray¶

word2vec embedding

Parameters

x – list of texts
n_grams – n-grams parameters

Returns

encoded vectors

Return type

numpy.ndarray