Chinese text segmentation

Chinese text segmentation, aka Chinese word segmentation, is the foundational technology of natural language processing, natural language understanding, especially search engine, speech synthesis (Text-to-speech system), automatic speech recognition, and machine translation. This process is called Tokenisation (Text-to-Token Conversion), because raw text usually contains number, name, abbreviation, symbol, punctuation etc which must be translated into meanful words. Yet there are so many different segment strategy and approach to optimise the tokenised result, but unfortunately the final results are still very poor until today. We find out many important factors that especially resolved the ambiguity in Chinese sentence and that can separate mixed Chinese text into words with very high precision, and established a sample computable model of natural language processing that have been integrated as the infrastructure of our in-developing embedded machine translation system. The sample sentential semantic model and engine can also be easily applied to other natural languages.

The structure of Chinese sentence is very flexible. The accuracy of Chinese word segmentation heavily depends on the text context, and one sentence can always have several ambiguity results. Therefore, any good word segmentation algorithm must have the ability to handle such complexity. e.g:

我|方可|答應|你的|要求。 (then I can promise your request.)
我方|可|答應|你的|要求。 (We can promise you.)
...

We developed an prototype engine that be designed to has an extremely small footprint, requires very small memory and storage so that it can be easily used in deeply embedded systems. The precompiled binary is about 190Kb and plus data is up to about 370KB, an extremely small space occupied. On the other side, the engine can also be easily used in server environment, including concurrent execution task, machine cluster, cloud service etc.

In it, a small morphological and semantic prototype analyser used to improve the artificial intelligence, that can separate unknown words, unknown names of things, place, person and organisation more accurate. This feature enable the engine is independent with external dictionary or corpus and still keep the kernel very small. At the same time, it also attached an interface that allow the end user append external words manually.

The engine is portable, cross platform, written with ANSI C from scratch without external dependencies, so it can be easily ported to all platform, binding with different programming languages, and run anywhere. The machine translation system based on it is under continuously developing, it was designed for deeply embedded system and completely different with the translation service from Google that based on server side architecture. The project needs your help, donation or investment to turn it to final product.

Demonstration

$./bamboo -s "我們可以建議貴方在報價上定下限額嗎?"
我們|可以|建議|貴方|在|報價|上|定|下限額|嗎|?

$./bamboo

 /_)_  _ _  /_ _  _ 
/_)/_|/ / //_//_//_/

Bamboo v0.2
Copyright (c) 2015 sevenuc.com.
Usage: bamboo [options] sentence

Options:
  -k convert mandarin characters to kanji.
  -s separate sentence into words.
  sentence: should be quoted.

References

1. Raven's Standard IQ Test, you might want to do some accurate and standard "IQ Test" for fun or serious things, this test suite is suitable from 5-year-old child to 95 year elders.

2. About artificial intelligence, you might be interested in the ancient strategy game which dates back more than 2000 years. The Chinese Chess, it contains the ancient oriental profound philosophy and wisdom, that results it very IQ challenge and very attractive unique style than any game you've been played, you can challenge it in your whole life, and it's suitable for both children and parents. Chinese Chess for Beginners explains the basic rules of the game clearly and in detail so that you can start playing right away. You can download the funny app from App store and enjoy it.







Links