Text Processor

Sentivi provides a simple text processor layer base on regular expression. TextProcessor must be defined as a attribute of DataLoader layer, it is a required parameter.

List of pre-built methods can be initialized as follows:

text_processor = TextProcessor(methods=['remove_punctuation', 'word_segmentation'])

# or add methods sequentially
text_processor = TextProcessor()
text_processor.remove_punctuation()
text_processor.word_segmentation()

text_processor('Trường đại học,   Tôn Đức Thắng, Hồ; Chí Minh.')

Result:

Trường đại_học Tôn_Đức_Thắng Hồ_Chí_Minh

You can also add more regex pattern:

text_processor.add_pattern(r'[0-9]', '')

Or you can add your own method, use-defined method should be a lambda function.

text_processor.add_method(lambda x: x.strip())

Split n-grams example:

TextProcessor.n_gram_split('bài tập phân tích cảm xúc', n_grams=3)
['bài tập phân', 'tập phân tích', 'phân tích cảm', 'tích cảm xúc']
class sentivi.text_processor.TextProcessor(methods: Optional[list] = None)

A simple text processor base on regex

__init__(methods: Optional[list] = None)

Initialize TextProcessor instance

Parameters

methods – list of text preprocessor methods need to be applied, for example: [‘remove_punctuation’, ‘word_segmentation’]

add_method(method)

Add your method into TextProcessor

Parameters

method – Lambda function

Returns

add_pattern(pattern, replace_text)

It is equivalent to re.sub()

Parameters
  • pattern – regex pattern

  • replace_text – replace text

Returns

capitalize()

It is equivalent to str.upper()

Returns

capitalize_first()

Capitalize first letter of a given text

Returns

lower()

Lower text

Returns

static n_gram_split(_x, _n_grams)

Split text into n-grams form

Parameters
  • _x – Input text

  • _n_grams – n-grams

Returns

List of words

Return type

List

remove_punctuation()

Remove punctuation of given text

Returns

word_segmentation()

Using PyVi to tokenize Vietnamese text. Note that this feature only use for Vietnamese text analysis.

Returns