Text Encoder

TextEncoder is a class that receives pre-processed data from sentivi.data.DataLoader, its responsibility is to provide appropriate data to the respective classifications.

text_encoder = TextEncoder('one-hot') # ['one-hot', 'word2vec', 'bow', 'tf-idf', 'transformer']
One-hot Encoding

The simplest encoding type of TextEncoder, each token will be represented as a one-hot vector. These vectors indicate the look-up index of given token in corpus vocabulary. For example, vocab = ['I', 'am', 'a', 'student']:

  • one-hot('I') = [1, 0, 0, 0]

  • one-hot('am') = [0, 1, 0, 0]

  • one-hot('a') = [0, 0, 1, 0]

  • one-hot('student') = [0, 0, 0, 1]

Bag-of-Words

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words. More detail: https://machinelearningmastery.com/gentle-introduction-bag-words-model/

Term Frequency - Inverse Document Frequency

tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. This tf-idf version implemented in TextEncoder is logarithmically scaled version. More detail: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Word2Vec

Word2vec (Mikolov et al, 2013) is a method to efficiently create word embeddings using distributed representation method. This implementation using gensim, it is required model_path argument in initialization stage. For vietnamese, word2vec model should be downloaded from https://github.com/sonvx/word2vecVN

text_encoder = TextEncoder(encode_type='word2vec', model_path='./pretrained/wiki.vi.model.bin.gz')
Transformer

Transformer text encoder is equivalent to transformer.AutoTokenizer from https://huggingface.co/transformers

class sentivi.data.TextEncoder(encode_type: Optional[str] = None, model_path: Optional[str] = None)
__init__(encode_type: Optional[str] = None, model_path: Optional[str] = None)

Simple text encode layer

Parameters
  • encode_type – one type in [‘one-hot’, ‘word2vec’, ‘bow’, ‘tf-idf’, ‘transformer’]

  • model_path – model is required for word2vec option

bow(x, vocab, n_grams) → numpy.ndarray

Bag-of-Words encoder

Parameters
  • x – list of texts

  • vocab – corpus vocabulary

  • n_grams – n-grams parameters

Returns

Bag-of-Words vectors

Return type

numpy.ndarray

forward(x: Optional[sentivi.data.data_loader.Corpus], *args, **kwargs)

Execute text encoder pipeline

Parameters
  • x – sentivi.data.data_loader.Corpus instance

  • args – arbitrary arguments

  • kwargs – arbitrary keyword arguments

Returns

Training and Test batch encoding

Return type

Tuple, Tuple

one_hot(x, vocab, n_grams) → numpy.ndarray

Convert corpus into batch of one-hot vectors.

Parameters
  • x – list of texts

  • vocab – corpus vocabulary

  • n_grams – n-grams parameters

Returns

one-hot vectors

Return type

numpy.ndarray

predict(x, vocab, n_grams, *args, **kwargs)

Encode text for prediction purpose

Parameters
  • x – list of text

  • vocab – corpus vocabulary

  • n_grams – n-grams parameters

  • args – arbitrary arguments

  • kwargs – arbitrary keyword arguments

Returns

encoded text

Return type

transformers.BatchEncoding

tf_idf(x, vocab, n_grams) → numpy.ndarray

Simple TF-IDF vectors

Parameters
  • x – list of texts

  • vocab – corpus vocabulary

  • n_grams – n-grams parameters

Returns

encoded vectors

Return type

numpy.ndarray

transformer_tokenizer(tokenizer, x)

Transformer tokenizer and encoder

Parameters
  • tokenizer – transformer.AutoTokenizer

  • x – list of texts

Returns

encoded vectors

Return type

BatchEncoding

word2vec(x, n_grams) → numpy.ndarray

word2vec embedding

Parameters
  • x – list of texts

  • n_grams – n-grams parameters

Returns

encoded vectors

Return type

numpy.ndarray