Bamboo - Tiny Chinese word segmentation engine

text segmentation, artificial intelligence, morphological analyser, machine translation.

Chinese text segmentation, aka Chinese word segmentation, is the foundational technology of natural language processing, natural language understanding, especially search engine and machine translation. Yet there are so many different segment strategy and approach to optimise the result, but unfortunately all of they are still generate poor final result until today. We find out almost all important factors that directly effect Chinese words in text segmentation that will be applied for machine translation. We invented a brand-new approach that especially resolved the ambiguity in Chinese sentence and separate mixed Chinese text into words with very high precision without performance lose, and established a computable model of natural language processing that have been integrated as the infrastructure of our machine translation system. The sentential semantic model and engine can also be easily applied to other natural languages of mankind.

An very powerful morphological and semantic analyser have been built in the engine, that can accurate separate unknown words, unknown names of things, place, person and organisation etc. This feature enable the engine independent with external dictionary or corpus and still keep the engine kernel very small, but also attached an interface that allow the end user manually append some external words.

The segmentation engine designed to an extremely small size, require very small memory and storage so that it can be easily used in embedded system. The precompiled binary is about 190Kb and plus data is up to about 370KB, extremely small space occupied. On the other side, the engine can also be easily used in server environment, including concurrent execution, machine cluster, cloud service etc.

The engine is portable, cross platform, written in ANSI C from scratch without external dependencies, so it can be ported to all platform, programming language, and run anywhere. The machine translation system based on it is under continuously updating, it was designed for deeply embedded environment and completely different with the translation service from Google that based on server side architecture. The project needs your help, donation or investment to make continuous improvement. The brand-new programming language and computer system based on the engine are also in the initial development stage, also needs your enthusiasm and help.

Don't have any doubts for the ability of the engine because of its extremely small size. This page provides a demo app that help you to input some Chinese sentence and then get separated words generated by the engine, try it and then make a comment.

Demonstration

 bamboo_0.1.tar.gz (376KB)
md5sum: fbc93db02f9013391393222492c291c1
shasum: 87c44fa48eb23eca55645970522c468e578591f524c289fac316dc9610cbdd0c

Standalone executable binary, the program is a command line tool that can segment Chinese text into words, and translate Mandarin to Kanji or vice versa. The demo app is just a simple prototype with limited features, the release version in productive environment has more function, high accurate rate, and high performance.


References

About artificial intelligence, you may be interested in the ancient strategy game which dates back more than 2000 years. The Chinese Chess, it contains the ancient oriental profound philosophy and wisdom, that results it very IQ challenge and very attractive unique styles than any game you've been played, you can challenge it in your whole life, and it's suitable for both children and parents. Chinese Chess for Beginners explains the basic rules of the game clearly and in detail so that you can start playing right away. You can download the funny app from App store and have your fun time.

Links

  1. Tiny Cantonese TTS engine.
  2. Translation between Cantonese and Mandarin.
  3. Text segmentation on Wikipedia.
  4. Natural language processing on Wikipedia.
  5. Natural language understanding on Wikipedia.