Text Processor¶
Sentivi provides a simple text processor layer base on regular expression. TextProcessor
must be defined as a attribute
of DataLoader
layer, it is a required parameter.
List of pre-built methods can be initialized as follows:
text_processor = TextProcessor(methods=['remove_punctuation', 'word_segmentation'])
# or add methods sequentially
text_processor = TextProcessor()
text_processor.remove_punctuation()
text_processor.word_segmentation()
text_processor('Trường đại học, Tôn Đức Thắng, Hồ; Chí Minh.')
Result:
Trường đại_học Tôn_Đức_Thắng Hồ_Chí_Minh
You can also add more regex pattern:
text_processor.add_pattern(r'[0-9]', '')
Or you can add your own method, use-defined method should be a lambda function.
text_processor.add_method(lambda x: x.strip())
Split n-grams example:
TextProcessor.n_gram_split('bài tập phân tích cảm xúc', n_grams=3)
['bài tập phân', 'tập phân tích', 'phân tích cảm', 'tích cảm xúc']
-
class
sentivi.text_processor.
TextProcessor
(methods: Optional[list] = None)¶ A simple text processor base on regex
-
__init__
(methods: Optional[list] = None)¶ Initialize TextProcessor instance
- Parameters
methods – list of text preprocessor methods need to be applied, for example: [‘remove_punctuation’, ‘word_segmentation’]
-
add_method
(method)¶ Add your method into TextProcessor
- Parameters
method – Lambda function
- Returns
-
add_pattern
(pattern, replace_text)¶ It is equivalent to re.sub()
- Parameters
pattern – regex pattern
replace_text – replace text
- Returns
-
capitalize
()¶ It is equivalent to str.upper()
- Returns
-
capitalize_first
()¶ Capitalize first letter of a given text
- Returns
-
lower
()¶ Lower text
- Returns
-
static
n_gram_split
(_x, _n_grams)¶ Split text into n-grams form
- Parameters
_x – Input text
_n_grams – n-grams
- Returns
List of words
- Return type
List
-
remove_punctuation
()¶ Remove punctuation of given text
- Returns
-
word_segmentation
()¶ Using PyVi to tokenize Vietnamese text. Note that this feature only use for Vietnamese text analysis.
- Returns
-