Job Recruitment Website - Property management - Coding rules of Microsoft pinyin input method

Coding rules of Microsoft pinyin input method

Microsoft Pinyin input method is the third generation Chinese character coding input method in China. The basic work of Chinese input method began in August 1974. This project is also called "Project 748", and one of its main achievements is the Chinese character frequency table, which provides important basic data for Chinese character information processing for the first time. After statistical analysis of Chinese character frequency tables and other word tables, the National Bureau of Standards issued the first national standard in the field of Chinese character information processing, Basic Set of Chinese Character Coding for Information Interchange (GB23 12-80). This is an epoch-making and far-reaching standard in the history of Chinese character information processing in China.

198 1 year, Wuhan University and others published the statistical results of root frequency in Xinhua Dictionary. From 65438 to 0984, the National Character Reform Commission and Wuhan University published the results of dynamic statistical analysis of strokes, components and structures of Chinese characters within the scope of Cihai.

1985, the National Character Reform Commission and Shanxi University published the results of statistical analysis of human surnames.

1986 Beijing institute of aeronautics and astronautics and Xinhua news agency published new statistics on the frequency of Chinese characters' use and circulation based on a large corpus by using computer technology. Beijing Normal University, Shanghai Jiaotong University, Beijing Language Institute, etc. Statistics on the usage frequency of modern Chinese vocabulary based on large-scale corpus are published respectively. Peking University Institute of Computer Language has also established a comprehensive, informative and convenient Chinese corpus and word attribute database centered on Chinese grammar, which has played an important role in promoting the development of Chinese character coding keyboard input technology.

There are GB 13000. 1 UCS for information technology, GB 18030 "Extension of Basic Set of Chinese Character Coded Character Set for Information Technology Information Interchange" and GB1803/kloc-0. GB 15834 "Usage of Punctuation Symbols" and GB/T 18220-2000 "General Requirements for Chinese Character Input on Information Technology General Keyboard" will be published soon. The specifications issued by the Language Committee include GF300 1 Information Processing GB 13000. 1 Chinese Character Components Specification, GF 3002 "GB13000.1Character Set Chinese Character Stroke Order Specification" and GF3003 "General Keyboard for Information Processing".

The following are some introductions from the first generation input method to the Microsoft input method:

The first generation of Chinese character coding input method

In 1983, the Sixth Institute of the Ministry of Electronic Industry officially published the first Chinese disk operating system CC-DOS, which is of epoch-making significance in the history of Chinese information processing. CC-DOS is an extension and modification of PC-DOS. In the widely used CC-DOS 2. 1 version, there are simplified spelling, prefix and suffix code, quick code and positioning code input methods, which have covered the main input methods including phonetic code, shape code, phonetic code and digital code, and played a pioneering role in the popularization of computer applications in China.

At that time, the most widely used input methods were simplified spelling and initials and finals coding. Simple spelling is a pure tone code, which uses a pinyin method between full spelling and double spelling to compress vowels with three or more letters. Prefix and suffix codes are pure form codes, which contain 97 parts and are divided into 52 categories. There are not many rules to follow in the mapping of components to keyboard letters, and the memory is very large; When encoding, the prefix and suffix only take one component. For unlisted deformable components, users need to guess their key positions. Neither method supports associations or phrases, and both methods have many duplicate codes. Therefore, when inputting, the operation of selecting and turning pages is very frequent, and the eyes need to constantly scan the prompt lines to find the required words in numerous duplicate codes, which is very tiring and slow to input, and it is even more impossible to realize blind typing.

Fast codes are realized by compressing pinyin and adding codes, which can disperse duplicate codes to some extent. Because the encoding method of fast code has no rules to follow, it has never been really applied. The position code can only be completely input by memorizing the digital codes of 6763 Chinese characters and symbols, so it was basically not used by anyone except punctuation marks with no other input methods at that time.

Another early Chinese character coding input method was telegraph code. Telegraph code is a 4-bit equal-length code scheme. The code words used range from 0000 to 9999, which can represent 10000 characters (including Chinese characters, letters and symbols). Telegraph codes do not have duplicate codes, but the regularity of coding is not strong and it is difficult to remember. Therefore, it is completely transplanted to the computer for the needs of people who are already familiar with telegraph codes in the post and telecommunications departments, and it has no meaning to ordinary Chinese character importers.

1986, Lenovo Group simultaneously launched Lenovo Hanka and Lenovo Chinese Character Environment. At first, people used association to speed up the input of Chinese characters. At that time, there was no phrase input method, and Lenovo technology was refreshing. Chinese character input process is changed from original coding to->; Turn pages-> Select-> Password ... becomes code-> Select-> Choose ..., so many input methods later adopted this technology. However, according to the standard of modern Chinese character coding input technology, Lenovo technology still has two fatal weaknesses. One disadvantage is that if the words to be entered later and the words that have been entered before cannot form a phrase, the association will fail. Another weakness is that human-computer interaction is too frequent in Lenovo selection. Although the average code length is shortened, the actual input speed will decrease.

In short, the characteristics of the first generation of Chinese character coding input method are: in DOS environment, input is carried out in units of single characters, special prompt lines are provided at the bottom of the screen, and a large number of repeated words are displayed, resulting in frequent page turning and selection operations; Use the number keys to select repeated words, and use ALT+ number keys to repeatedly select repeated words appearing in the prompt line; Even the input of extremely commonly used punctuation marks requires the use of positioning codes, which is very inconvenient; The adoption of association technology improves the input efficiency, but its effect is quite limited; Switch between various input methods (including switching to English) through the compound function key ALT+Fn(F 1-F 12); Supports full-angle and half-angle modes, but does not support Chinese punctuation; Phrase input and custom phrases are not supported.

The second generation Chinese character coding input method

1986, Stone Company cooperated with Mitsui Products Co., Ltd. to launch Stone MS-2400 Chinese electronic typewriter, which announced the arrival of the professional electronic typing era in China. With the widespread use of four-way typewriters, the five-stroke input method tied to four-way typewriters first spread, and then the two-tone input method tied to four-way typewriters invented by Liu Weimin was also widely used at that time.

Wu Bi font is the most typical scheme of pure form code component class. In Wu Bi fonts, components are often called roots. Wu Bi font adopts 130 radical. The basic radicals are divided into five categories according to strokes, corresponding to the five areas on the general keyboard. Each category is divided into five groups, and each group corresponds to a keyboard letter. In a Chinese character, the relationship between roots can be divided into four types: single root, scattered root, connected root and cross root. When splitting Chinese characters, we should follow the principle of "focusing on the big, giving consideration to intuition, being able to pick up or not pick up, being able to disperse or not". Wu Bi font divides Chinese characters into three types: key-named Chinese characters, root-named Chinese characters and non-key-named Chinese characters, which are subject to different coding rules. In addition, the word code has one, two and three levels of simplified codes, which are composed of the first, second and third letters of the corresponding full code. Wu Bi font divides phrases into three types: two-character, three-character and multi-character. Two words are coded by taking the first two roots of each word in order. Three words are encoded by the first root of the first two words and the first two roots of the last word in sequence. Multi-word takes the first root of the first, second, third and last word in order to encode.

Wu Bi font has very complicated coding rules in exchange for lower repetition rate in GB23 12-80 character set. When forced simplified codes are adopted, the rate of low duplicate codes can be further reduced. Put phrase coding into the remaining coding space of all codewords to realize mixed word coding. As long as the number of phrases included is small, the possibility of repetition is relatively small. The general Wu Bi font itself does not have the function of word-making. These characteristics of Wu Bi font just meet the needs of professional typing and become one of the important reasons for its popularity in the era of professional typing.

Although Wu Bi font has achieved great success in the market, its problems can not be ignored. First of all, Wu Bi fonts are difficult to learn and easy to forget. Besides complicated coding rules, it has many exceptions to remember. It is very common for Wubi typists to get stuck in some commonly used words when typing. At this time, it is necessary to temporarily switch to pinyin input method. Secondly, the expansibility of Wu Bi font is poor. When the character set changes from GB23 12-80 to GBK and GB 18030, when the number of phrases increases, a large number of duplicate codes will appear in the code points of Wu Bi font with code length of 4, which makes it lose the advantage of low duplicate code rate. Wu Bi font adopts four-code automatic screen-loading strategy, and there is no duplicate code. The increase of 4-yard duplicate code forces typists to take their eyes off the manuscript to confirm their input, thus reducing the input speed. Finally, the fatal weakness of Wu Bi fonts is its poor standardization. Zhang Xiaocun and others strongly criticized this: "Wu Bi font violates the norms of language and writing. There is great randomness in the division of Chinese characters, which has a negative impact on the basic cultural quality of the people. Its influence on standardizing Chinese character education is directly proportional to the expansion of its application scope [20]. "

Dual-tone input method is a clever pure tone code [23][25]. The biggest feature of dual-tone input method is "defining words by words and associating them reversely" to alleviate the problem of excessive repetition of single words in pure tone code mode. Because the number of two words is relatively large, you can always find a two-word word, and its first word is the word you want to enter. If these two words are in the first candidate position of the prompt line, the selection key can be omitted; Otherwise, you need to use the number keys to select. If you need the whole two words, you can add a space bar to enter the second word. That is to say, if you use double spelling and the technology of "defining words by words", the average number of keystrokes in the input of common words can reach 2.5 times, which basically avoids the problems of scanning prompt lines, turning pages and selecting in the traditional pinyin+association mode. In the two-tone input method, for three words and four words, the initials of each word are input as codes, and spaces are added if necessary. For words you don't know, you can enter "\" to call "Handwriting Simulation". Although you can customize this phrase, it does not support online word formation. When creating words, you need to use an external text editor to input codes and corresponding phrases in a defined format.

Dual-tone input method is a great progress in the history of pinyin Chinese character input, which was welcomed by many non-professional typists at that time. However, it also has some serious shortcomings, so that almost no one uses it anymore. First of all, although the input efficiency has been greatly improved compared with the traditional pinyin, there is still a considerable gap compared with the later sentence-level pinyin input methods such as intelligent ABC. In addition, many words can be determined by multiple words when defining words by words, while some words are difficult to find words to determine, and users often feel at a loss. Although the dual-tone input method provides many other methods to solve the problem of text input. For example, there are six auxiliary rules for the input of surnames such as Deng, Guo and Yao. It is not easy to remember these methods and judge when to use them. Because words can only be made offline, it is inconvenient to customize phrases.

The third generation Chinese character coding input method

By the end of 1990s, with the further reduction of the price of microcomputers, the further enhancement of their storage and processing capabilities, the popularization of Windows graphic operating system and the rise of the Internet, the user interface became very friendly, and microcomputers entered ordinary people's families and primary and secondary schools in China on a large scale, thus realizing the great popularity of microcomputers in China.

The popularity of microcomputer makes typing a basic skill for everyone, just like writing Chinese characters; The need for someone to type is actually a sign of illiteracy, just like the need for someone to write a ghost. This has created a large group of non-professional typists. Typists are rapidly disappearing as a profession. The operation mode of general users when typing is "want to type", which is completely different from the "blind typing" mode of professional typists. "Blind typing" requires the operator to look at the screen as little as possible, and the feedback information provided by the input method only comes in handy occasionally when the operator can't "blind type"; The operator always looks at the screen when he wants to type, and the way the input method provides feedback and the amount of feedback information will have a great impact on the operator's input activities. Windows graphical operating system provides a prerequisite for enriching man-machine interface and can meet the needs of diversified feedback information.

The powerful storage and processing capabilities of modern microcomputers provide a material basis for the birth of new storage-intensive and processing-intensive input methods. Input method program is no longer limited to 64KB memory resident in DOS era. Gigabit computing speed enables complex intelligent algorithms to be put into operation. The capacity of hard disk has not only expanded from megabytes to gigabytes, but also the speed of accessing hard disk has been greatly improved compared with the DOS era. It is no longer a problem to store a huge thesaurus on the hard disk and find it quickly.

After extensive computer education in primary and secondary schools, students began to learn typing at an early age. The relationship between Chinese character coding and language education has been put forward. The minimum requirement is that Chinese character coding cannot conflict with language knowledge. Ideally, the coding input of Chinese characters should be combined with the learning of language knowledge to promote each other.

Under the above background, the third generation of Chinese character coding input method came into being, and its guiding ideology is: standardized, easy to learn, easy to use, and try to keep the input speed. During this period, the research of intelligent pinyin input method climaxes again and again, and there are also pure shape codes with strokes or pen pairs as input units, and pinyin shape codes based on initials and strokes (or pen pairs) [29]-[48].

(1) Intelligent Pinyin Input Method

According to its realization principle, intelligent pinyin input method can be divided into four types: based on understanding, based on pragmatic statistics, based on template matching and based on context.

Intelligent ABC is the most widely used quasi-sentence-level pinyin input method on Windows operating system at present, because it converts phrases and phrases instead of whole sentences. The biggest feature of intelligent ABC is that it is very convenient to customize phrases and adjust the order of repeated words. Users only need to input according to their own ideas, and there is no need for manual word segmentation when inputting. The system will automatically segment words one by one from front to back. When there are no phrases, the system will automatically display repeated words in single-word mode for users to choose; Once the user selects and forms a new word, the system can remember it. When the system word segmentation is incorrect or the word provided by the system is not what the user needs, the user can also modify it, and the system can also remember the modification made by the user. After a long period of use, if the user has not changed, the system gradually adapts to the user's usage habits, making the user's input process comfortable.

Intelligent ABC also has a lot to improve. First of all, you can use full spelling, short spelling and mixed spelling at will, and the importer can convert sounds and words at any time. There are too many ways to let users know which is the best. It looks very flexible, but in fact it gives users the task of optimizing input. But most users are not experts in this field, so it is impossible to complete this task well, which leads to many detours or bad and inefficient input habits. Secondly, the accuracy of phonetic-word conversion is not high, and sentences change frequently, which leads to unsatisfactory input speed, and even using double spelling is not as efficient as natural codes.

Microsoft Pinyin is a real intelligent input method of sentence-level phonetic conversion, and it is the crystallization of many years of scientific research results of Microsoft natural language processing technology. Microsoft Pinyin uses Pinyin as the input method of Chinese characters, and users can easily use and master this Chinese character input technology without special study and training. Microsoft Pinyin adopts sentence-based whole sentence conversion mode, and users can input whole sentences continuously without manual word segmentation and selection of candidate words, which not only ensures the fluency of users' thinking, but also greatly improves the input efficiency. The man-machine interface provided by Microsoft Pinyin is very distinctive. The word combination window can be embedded at the insertion cursor of the text being input, which reduces the moving frequency of the user's line of sight when inputting and greatly improves the ease of use of the input method. Key-by-key conversion and prompt the conversion result, so that users don't have to decide when to convert. There is no limit to the length of codes that users can enter. When the system length exceeds the upper limit or encounters a full stop, the system will automatically switch, so that users can continue to input without interruption. Because of its wide context, Microsoft Pinyin can achieve high conversion accuracy. By default, Microsoft Pinyin refuses users to input short spelling and mixed spelling, which can guide users to develop good input habits.

There are also some problems with Microsoft Pinyin. First of all, it is cumbersome and inefficient to edit in a sentence when the coding input is wrong or the conversion is incorrect. Secondly, when changing keys one by one, the content that has been converted correctly will often be modified incorrectly, and the user has to monitor the correctness of the input content at any time, which is very tired when there are many converted contents. In addition, Microsoft Pinyin does not provide a method to speed up the input of words, nor does it provide a method to input unknown words, which is an incomplete input method.

Input method based on strokes (or pen pairs) and/or initials

The study of input method can be greatly improved by using the two simplest features of Chinese characters: strokes and initials [12]. However, the strokes of Chinese characters are generally divided into five types. Too few types of strokes will inevitably increase the coding length, thus affecting the input speed. Therefore, how to shorten the code length and improve the input efficiency has become a key issue for the success of this input method.

The double stroke code developed by Fujian Double Stroke Code Software Development Co., Ltd. is a pure shape code based on strokes. In order to overcome the problem of too few stroke types, a new stroke type "cross" is introduced into the double stroke code, which expands the number of stroke types to six. When fetching codes, every two strokes are taken in sequence to form a pen pair, and * * * can form 36 different pen pairs, and key positions can be selected for input in the corresponding key position area on the keyboard. In addition, the double-stroke code also stipulates that the code next to the sick word, "mouth", "hand side" and "day" should be taken as a whole. According to the combination shapes of different structures of Chinese characters, Chinese characters are divided into three basic fonts, namely, left-right type, up-down type and comprehensive type. No matter what kind of Chinese characters are, they are coded according to four codes. The input method of double code phrase is: input the first two codes of each word with two words; For three words, enter the first and last codes of the first two words, and for four or more words, enter the first codes of the first, second, third and last words.

The advantages of two-stroke code are: compared with the traditional component input method, the memory is reduced a lot; The average code length after using pen pair and 36 keys coding is also quite short; If you press a single stroke instead of a pair of strokes, you can easily transplant the double-stroke code to the numeric keyboard. However, the disadvantages of double-stroke code are also very obvious: as a stroke input method, its code fetching and coding rules are complex and there are many exceptions, so it is still very difficult to learn; The number keys in the upper row are used for coding, which is inconvenient for tapping and conflicts with common digital input, thus affecting the actual input speed.

The two-stroke input method invented by Mr. Chen Jinsong is one of the widely used input methods at present. It is an input method based on initials and strokes, or simply based on strokes. The 30 coded characters of the two-stroke input method are distributed in six areas on the general keyboard, namely, five double-stroke areas and 1 single-stroke areas. According to the second stroke of double stroke or single stroke, the region is located in the order of horizontal, vertical, left, dot and fold. However, setting the 10 key of a radical requires memory. Two-stroke input method divides Chinese characters into single words and combined words according to font structure. When inputting Chinese characters, the first code is the first letter of Chinese Pinyin, and the strokes are the second code, with a maximum of four codes. If it's less than four yards, you should take it all. If you can't row double, you should row single. Monographs do not need to be split; The first code takes pinyin initials, and the second code takes stroke codes in stroke order, with a maximum of four codes. The combined Chinese characters are divided into two halves. According to the rules of Chinese character stroke order, the first part is the first half, and the second part is the second half. The first code takes pinyin initials, the first half of the second code takes the first and second strokes, the second half of the third code takes the first and second strokes, and the second half of the fourth code takes the third and fourth strokes. The coding rules of two-stroke input method phrases are: two words take the first two codes of each word, three words take the first two codes of the first word and the first two codes of the last two words, and four or more words take the first three codes and the last two codes.

The advantages of two-stroke input method are: the coding rules are simpler than two-stroke code, and only 30 coded characters are used; The initial consonant and stroke are used to encode Chinese character feature information, which enhances the recognition ability of words with the same code and improves the input efficiency. For words you don't know, you can also input them in full shape; If you press a single stroke instead of a pen pair, you can also transplant it to a numeric keyboard very easily. However, there are still some problems in the two-stroke input method: because of the use of pen pairs and radicals, it is necessary to distinguish between single words and combined words for different coding, and it is still difficult to learn and use.

Numeric keyboard coding input method

So far, the number of mobile phones in China has exceeded 300 million, and the output value of mobile phone short messages has exceeded 5 billion yuan. The number of mobile phone users has exceeded that of PC users, and the number of people who use mobile phones to input Chinese characters far exceeds that of people who use general keyboards to input Chinese characters.

At present, T9 Pinyin and T9 Stroke Input Method of Tejie Company of the United States, Word Energy Stroke Input Method of Ziyuan Company of Canada and iTap Input Method of Motorola Company monopolize the mobile phone input method market in Chinese mainland, Hong Kong and Taiwan. However, foreign mobile phone numeric keyboard input method is not satisfactory. Take stroke input as an example, iTap uses 9 strokes, characters can use 8 strokes, and T9 uses 5 strokes. The same stroke, different mobile phones can be placed in different positions, and the input speed is not ideal.

In order to break the embarrassing situation that foreign mobile phone input methods monopolize the mobile phone market in China and the mobile phone input methods are not standardized. The first China Mobile Phone Chinese Input Competition and Chinese Character Digital Input Technology Application Summit Forum, hosted by China Chinese Information Society and hosted by Golden Code Press (HK) Co., Ltd., was held in the Great Hall of the People on June 5438+065438+1October 2 1 2004 for three days. Among the 32 teams, 23 teams participated in the Chinese character digital code input competition for analog mobile phones, and 9 teams participated in the Chinese character input competition for mobile phones. In addition to the digital coding schemes that have already entered the competition, five digital strokes by Mr. Wang Yongmin [49] and left and right digital strokes by Mr. Zheng [50] have attracted more attention. The following only introduces the most widely used T9 pinyin and T9 strokes, as well as the golden code and popular digital code that won the championship in the first mobile phone Chinese input contest.

In essence, T9 Pinyin is an early universal keyboard input technology with full spelling and full association. Its most important innovation is that it can be judged whether it can be combined into legal Mandarin syllables according to the key combination on the mobile phone keyboard, thus avoiding the disadvantages of traditional multi-key input of a pinyin letter. However, when the key combination is suitable for multiple legal Mandarin syllables, and the default Mandarin syllable is not what the user needs, the user still has to manually select it. In addition, the long spelling, the need to press the 1 key to enter the selection state and the excessive human-computer interaction caused by association all make the input efficiency of T9 Pinyin very low, which is very difficult for people with poor Mandarin.

T9 strokes are divided into five categories: horizontal, vertical, apostrophe, dot and fold, which are represented by 1, 2, 3, 4 and 5 respectively. When recording Chinese characters, input them in order of strokes, one by one, several per screen, and high frequency is preferred. The longest input can reach 12, which supports Lenovo. Because five keys represent five strokes respectively, there is no need to make intelligent judgment on the combination like T9 Pinyin, and the internal processing logic is very simple. T9 strokes make full use of the rich stroke information and short code positions of unequal length codes, and can directly select repeated code words, so its actual input efficiency is higher than T9 pinyin.

The popular numeric code [5 1] uses 10 digits to encode words. In addition to using 1, 2, 3, 4 and 5 to represent five kinds of strokes, 6, 7, 8, 9 and 0 are also used to represent five kinds of components: intersection, insertion, eight, small and mouth. The first, second, third, fourth and last five codes are used for single words in stroke order, and the actual code length is taken when it is insufficient, and the phrase code length is 6 digits. Popular digital codes use quite a few stroke combinations as components, but because of their clear classification, they are easier to remember than many similar input methods, and the meticulous coding rules reduce the repetition rate, which makes them stand out from the competition. However, it should be noted that it uses many components, and the coding rules are not simple, so it is still quite difficult to learn.

The golden code encodes words with nine numbers. In addition to using 1, 2, 3, 4 and 5 to represent five strokes respectively, 6, 7, 8 and 9 are also used to represent four types of components: mouth, ten, eight and swish. Coding time zones are divided into prefixes and suffixes, and also distinguish between single characters and combined characters. When the prompt line is not empty, 0, * and # are used as selection keys. The biggest feature of golden code is that when the number used for coding and the input code cannot form another word code, the number keys can be used to select words with the same code, which greatly increases the key selection ability of input method and shortens the dynamic average code length; Combined with the use of unequal length codes with high frequency foresight, there is basically no need to turn pages when inputting, which further improves the input efficiency. But there is no standard to distinguish prefix and suffix, which often varies from person to person; Dynamically using the remaining coding keys to select words with duplicate codes also causes the position of the selection keys to change too much, which increases the burden of human-computer interaction.

These are some familiar input methods that I have come into contact with since I worked in computer. I wonder if they can help you.

Previous article:How about Zhangjiakou Zhongcheng Property Service Co., Ltd.?
Next article:How to write the annual work plan