Text Encoder¶
TextEncoder
is a class that receives pre-processed data from sentivi.data.DataLoader
, its responsibility is
to provide appropriate data to the respective classifications.
text_encoder = TextEncoder('one-hot') # ['one-hot', 'word2vec', 'bow', 'tf-idf', 'transformer']
- One-hot Encoding
The simplest encoding type of
TextEncoder
, each token will be represented as a one-hot vector. These vectors indicate the look-up index of given token in corpus vocabulary. For example,vocab = ['I', 'am', 'a', 'student']
:one-hot('I') = [1, 0, 0, 0]
one-hot('am') = [0, 1, 0, 0]
one-hot('a') = [0, 0, 1, 0]
one-hot('student') = [0, 0, 0, 1]
- Bag-of-Words
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words. More detail: https://machinelearningmastery.com/gentle-introduction-bag-words-model/
- Term Frequency - Inverse Document Frequency
tf–idf
or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Thistf-idf
version implemented inTextEncoder
is logarithmically scaled version. More detail: https://en.wikipedia.org/wiki/Tf%E2%80%93idf- Word2Vec
Word2vec (Mikolov et al, 2013) is a method to efficiently create word embeddings using distributed representation method. This implementation using gensim, it is required
model_path
argument in initialization stage. For vietnamese, word2vec model should be downloaded from https://github.com/sonvx/word2vecVNtext_encoder = TextEncoder(encode_type='word2vec', model_path='./pretrained/wiki.vi.model.bin.gz')
- Transformer
Transformer text encoder is equivalent to
transformer.AutoTokenizer
from https://huggingface.co/transformers
-
class
sentivi.data.
TextEncoder
(encode_type: Optional[str] = None, model_path: Optional[str] = None)¶ -
__init__
(encode_type: Optional[str] = None, model_path: Optional[str] = None)¶ Simple text encode layer
- Parameters
encode_type – one type in [‘one-hot’, ‘word2vec’, ‘bow’, ‘tf-idf’, ‘transformer’]
model_path – model is required for word2vec option
-
bow
(x, vocab, n_grams) → numpy.ndarray¶ Bag-of-Words encoder
- Parameters
x – list of texts
vocab – corpus vocabulary
n_grams – n-grams parameters
- Returns
Bag-of-Words vectors
- Return type
numpy.ndarray
-
forward
(x: Optional[sentivi.data.data_loader.Corpus], *args, **kwargs)¶ Execute text encoder pipeline
- Parameters
x – sentivi.data.data_loader.Corpus instance
args – arbitrary arguments
kwargs – arbitrary keyword arguments
- Returns
Training and Test batch encoding
- Return type
Tuple, Tuple
-
one_hot
(x, vocab, n_grams) → numpy.ndarray¶ Convert corpus into batch of one-hot vectors.
- Parameters
x – list of texts
vocab – corpus vocabulary
n_grams – n-grams parameters
- Returns
one-hot vectors
- Return type
numpy.ndarray
-
predict
(x, vocab, n_grams, *args, **kwargs)¶ Encode text for prediction purpose
- Parameters
x – list of text
vocab – corpus vocabulary
n_grams – n-grams parameters
args – arbitrary arguments
kwargs – arbitrary keyword arguments
- Returns
encoded text
- Return type
transformers.BatchEncoding
-
tf_idf
(x, vocab, n_grams) → numpy.ndarray¶ Simple TF-IDF vectors
- Parameters
x – list of texts
vocab – corpus vocabulary
n_grams – n-grams parameters
- Returns
encoded vectors
- Return type
numpy.ndarray
-
transformer_tokenizer
(tokenizer, x)¶ Transformer tokenizer and encoder
- Parameters
tokenizer – transformer.AutoTokenizer
x – list of texts
- Returns
encoded vectors
- Return type
BatchEncoding
-
word2vec
(x, n_grams) → numpy.ndarray¶ word2vec embedding
- Parameters
x – list of texts
n_grams – n-grams parameters
- Returns
encoded vectors
- Return type
numpy.ndarray
-