Text Processor¶

Sentivi provides a simple text processor layer base on regular expression. TextProcessor must be defined as a attribute of DataLoader layer, it is a required parameter.

List of pre-built methods can be initialized as follows:

text_processor = TextProcessor(methods=['remove_punctuation', 'word_segmentation'])

# or add methods sequentially
text_processor = TextProcessor()
text_processor.remove_punctuation()
text_processor.word_segmentation()

text_processor('Trường đại học,   Tôn Đức Thắng, Hồ; Chí Minh.')

Result:

Trường đại_học Tôn_Đức_Thắng Hồ_Chí_Minh

You can also add more regex pattern:

text_processor.add_pattern(r'[0-9]', '')

Or you can add your own method, use-defined method should be a lambda function.

text_processor.add_method(lambda x: x.strip())

Split n-grams example:

TextProcessor.n_gram_split('bài tập phân tích cảm xúc', n_grams=3)

['bài tập phân', 'tập phân tích', 'phân tích cảm', 'tích cảm xúc']

class sentivi.text_processor.TextProcessor(methods: Optional[list] = None)¶

A simple text processor base on regex

__init__(methods: Optional[list] = None)¶

Initialize TextProcessor instance

Parameters: methods – list of text preprocessor methods need to be applied, for example: [‘remove_punctuation’, ‘word_segmentation’]

add_method(method)¶

Add your method into TextProcessor

Parameters: method – Lambda function
Returns

add_pattern(pattern, replace_text)¶

It is equivalent to re.sub()

Parameters

pattern – regex pattern
replace_text – replace text

Returns

capitalize()¶

It is equivalent to str.upper()

Returns

capitalize_first()¶

Capitalize first letter of a given text

Returns

lower()¶

Lower text

Returns

static n_gram_split(_x, _n_grams)¶

Split text into n-grams form

Parameters

_x – Input text
_n_grams – n-grams

Returns

List of words

Return type

List

remove_punctuation()¶

Remove punctuation of given text

Returns

word_segmentation()¶

Using PyVi to tokenize Vietnamese text. Note that this feature only use for Vietnamese text analysis.

Returns