Job Recruitment Website - Job information - Phrase structure rules

Phrase structure rules

1. Introduction

Starting from machine translation and artificial intelligence research in the 1950s, NLP (Natural

Language Processing, natural language processing) has been around for a long time. half a century of history.

In this process, the academic community has proposed many important theories and methods and achieved rich results

. The author believes that there are three landmark contributions in this field in the past two decades: (1) Complex feature sets and unified grammar; (2) Lexicalism in linguistic research ;(3)

Corpus methods and statistical language models. These three results will continue to have a profound impact on linguistics, computational linguistics

and NLP research. In order to better understand the significance of these results, two facts related to this are first introduced.

2. Two facts

2.1 Fact one - Phrase structure grammar cannot effectively describe natural language

In natural language processing, in order to recognize a To enter the syntactic structure of a sentence, you must first segment the words in the sentence one by one, then look up the dictionary and assign an appropriate index to each word in the sentence

Part of speech (part of speech); then use syntactic rules to identify the syntactic components contained in the sentence one by one, such as noun phrases, verb phrases, clauses, etc. Then

then determine the syntactic function of each phrase, such as subject, predicate, object, etc., and its semantic role,

and finally obtain the meaning representation of the sentence, such as a logical semantic expression. This is the entire process of syntactic analysis

.

The first fact to be mentioned in this article is: Phrase Structure Grammar (PSG for short) cannot effectively describe natural language. PSG occupies an important position in Chomsky's linguistic theory

and plays a pivotal role in the syntactic description of natural language

. But it has some fundamental weaknesses, mainly manifested in the fact that it uses single tags like parts of speech and phrases

classes, so it cannot effectively specify and explain structural ambiguities in natural languages

Question. Please look at the "V+N" combination in Chinese. If we assign words such as "attack, commission, investigation" as verbs (V); words such as "strength, method, piracy, Party A" as nouns (

N), and agree that "strength of crackdown" and "method of entrustment" are noun phrases (NP), and "combating piracy" and "entrusting Party A" are verb phrases (VP), then the following will be produced Two syntactic rules with different meanings:

(1) NP→VN

(2) VP→VN

Replacement In short, when the computer observes the "V+N" part-of-speech sequences appearing adjacently in the text, it still cannot determine whether they constitute NP or VP. We call such ambiguity "phrase-type ambiguity". For example:

·The company is recruiting [Sales V Personnel N] NP.

·The earth is constantly [changing V shape N]VP.

Next, let’s look at the combination of “N+V”, which will also produce rule pairs with phrase type ambiguity.

For example:

(3) NP →NV Example: market research; political influence.

(4) S→NV Example: Prices rise; the situation is stable.

The mark S represents a clause.

Not only that, sometimes when the machine observes adjacent "N+V" part-of-speech sequences, it cannot even determine whether they are in the same phrase. In other words, the "N+V" part-of-speech sequence

may form a noun phrase NP or a clause S, or it may not be in the same phrase at all. This ambiguity is later called "phrase boundary ambiguity".

The following are two related examples:

·China's [Railway N Construction V] NP is developing rapidly.

·[China’s railway N] NP is being constructed very quickly.

In the former example, "railway construction" forms an NP; while in the latter example, these two

adjacent words belong to two different phrases. This is enough to show that

PSG based on a single tag cannot fully describe the syntactic ambiguity phenomenon in natural language. Let's look at some more such

examples.

(5) NP→V N1 de N2

(6) VP→V N1 de N2

Where de represents the structural particle "of". For example, "[peeling an apple] VP's knife" is NP; and

"peeling [an apple] NP" is VP. There is both phrase type ambiguity and phrase

boundary ambiguity. For example, the two adjacent words "cut V apple N" may form a VP, or they may be in two adjacent phrases.

(7) NP→P N1 de N2

(8) PP→P N1 de N2

In the rules, P and PP represent prepositions and prepositional phrases respectively. . For example, "[to Shanghai] PP's impression

" is NP; and "[to students in Shanghai] NP" is PP. The adjacent words "pair P Shanghai N"

may form one PP, or may be in two phrases.

(9) NP→NumP N1 de N2

Where NumP represents a quantity phrase. Although rule (9) represents an NP, it can represent two structural meanings:

(9a) NumP〔N1 de N2〕NP such as: five〔 Staff of the company〕NP

(9b)〔NumP N1〕NP de N2 Such as: [Five companies] Staff of NP

(10) NP→N1 N2 N3

Rule (10) also represents an NP, but "N1+N2" is combined first, or "N2+N3"

When combined first, there will be two different structural methods and meanings, namely:

(10a) [N1 N2] NP N3 Such as: [Modern Chinese] NP Dictionary

(10b) N1 [N2 N3] NP Such as: New Edition [Chinese Dictionary] NP

The first factual statement discussed above:

· Due to insufficient binding force, the single-tagged PSG rule cannot fully resolve the ambiguity of phrase types and

phrase boundaries. In mathematical language, PSG rules are necessary but not sufficient

. Therefore, the machine only judges whether it is a phrase or what phrase it is based on a part-of-speech sequence on the right side of the rule, which has some uncertainty.

·Using complex feature sets and lexicalism methods to reconstruct the grammatical system of natural language is the most important effort made by the global linguistic community in the past two decades.

2.2 Fact 2 - The coverage of phrase structure rules is limited

Through the investigation of large-scale corpus, it was found that the distribution of phrase rules in a language is consistent with

Zipf's Law. Zipf is a statistician and linguist. He proposed that if statistics were collected on a certain language unit (whether letters or words), the frequency of occurrence of this language unit in a corpus would be recorded. As F, and assign an integer rank R to each unit in descending order of frequency. It is found that the product of R and F is approximately

a constant.

That is

Fw...w) P (w│w) P (w│ww).

..P (w[,n]│ww...w[,n-1 ]) (1)

In the formula, P(w) represents the occurrence probability of the first word w, and P(w│w) represents

the occurrence probability of the first word w. The conditional probability of the occurrence of two words w, and so on. It is not difficult to see that in order to predict the occurrence probability of word w[,n], the occurrence probabilities of all words preceding it must be known. Computationally, this is too complex. If it is approximately believed that the occurrence probability of any word w[,i] is only related to its immediately preceding word, then the calculation can be greatly simplified. This is the so-called binary model (bigram), which is obtained from equation (1):

P(W)≈P(w)Ⅱ[,i=2,… ,n]P(w[,i]│w[,i-1]) (2)

In the formula II[,i=2,…,n]P(w[,i]│ w[,i-1]) represents the multiplication of multiple probabilities.

It is important to point out that these probability parameters can be estimated through large-scale corpora

. For example, binary probability

P(w[,i]│w[,i-1])≈count(w[,i-1]w[,i])/count(w[,i - 1

]) (3)

Where count (...) represents the cumulative number of times a specific word sequence appears in the entire corpus.

If the total number of words in the corpus is N, then the occurrence probability of any word w[,i] in the corpus can be estimated

as follows:

P(w)≈count(w[,i])/N

Similarly, if it is approximately believed that the occurrence of any word w[,i] is only related to its immediately preceding two words,

p>

You will get a ternary model (trigram):

P(W)≈P(w)P(w│w)Ⅱ[,i=3,…,n]P( w[,i]

│w[,i-2]w[,-1]) (5)

The statistical language model method is a bit like weather forecasting. The large-scale corpus used to estimate probability parameters is like the meteorological records accumulated in a region over the years, and using the ternary model to make weather forecasts is like making weather forecasts based on the previous Use two days of weather conditions to predict the weather for the day. Of course weather forecasts

cannot be 100% correct. This can also be regarded as a characteristic of probability and statistics methods.

3.3.1 Speech recognition

As an alternative to Chinese character keyboard input on computers, speech recognition is increasingly favored by people in the information industry

. The so-called dictation machine is such a product. According to reports, China's mobile phone users have exceeded 100 million. With the popularity of mobile phones and personal digital assistants (PDAs), especially when these portable When all devices can access the Internet wirelessly, the majority of users are more eager to input short text messages through voice recognition or handwriting pads instead of small keyboards.

In fact, the speech recognition task can be regarded as the maximum value problem of calculating the following conditional probability:

W[*]=argmax[,W]P(W│speech signal)< /p>

=argmax[,W]P(speech signal│W)P(W)/

P(speech signal)

=argmax[,W]P (speech signal│W)P(W) (6)

The mathematical symbol argmax[,w] in the formula represents the calculation of conditional probability P (W

for different candidate word sequences W │speech signal) value, so that W[*] becomes the word sequence with the largest conditional probability value. This is the recognition result selected by the computer. In other words, through the calculation of equation (6), the computer finds the word string W[

*] that is most suitable for the current input speech signal speech signal.

The second line of equation (6) is the result of transcoding using Bayes’ law, because the conditional probability P (

speech signal│W) is relatively easy to estimate. The denominator P (speech signal) of the formula is a constant for the given speech signal and does not affect the calculation of the maximum value, so it can be deleted from the formula. In the results shown in the third row, P (W) is the statistical language model mentioned above. Generally, the ternary model shown in equation (5) is used; P (speech signal│W ) is called the acoustic model

At this point, readers may have understood that the pinyin-Chinese character conversion task in the Chinese Pinyin input method is actually implemented using the same method, and both use The Chinese language model (i.e. binary or ternary model) is the same model.

The dictation machine products currently on the market and the Microsoft Pinyin Input Method (version 3.0) are all implemented using the word

ternary model, and almost no syntactic-semantic analysis methods are used. Because according to comparable evaluation results, the error rate of the Pinyin-Chinese character conversion system implemented using the ternary model is about 50% lower than that of other products.

3.3.2 Part-of-Speech Tagging

About 14% of the word types in a vocabulary have more than one part-of-speech. In a corpus

, about 30% of the words have more than one part of speech. Therefore, tagging every

word in a text with part-of-speech tagging is to resolve part-of-speech ambiguity through contextual constraints. Historically, there have been two automatic part-of-speech tagging systems. One uses context-sensitive rules, called TAGGIT (1971), and the other applies a binary model of parts of speech, called CLAWS (1987) (see Garside et al. 1989) . Both systems implemented part-of-speech tagging on 1 million words of English unrestricted text. The results show that the annotation accuracy rate of the CLAWS system using statistical language models is much higher than that of the TAGGIT system based on the rule method.

Please see the comparison in the table below:

System name TAGGIT (1971) CLAWS (1987) Number of tags 86 133 methods 3000 CSG rules Hidden Markov model annotation accuracy 77% 96% test corpus Brown LOB

Let C and W represent the part-of-speech tag sequence and word sequence respectively, then the problem of part-of-speech tagging can be regarded as calculating the maximum value of the following conditional probability:

C[*]=argmax[,C]P(C│W)

=argmax[,C]P(W│C)P(C)/P(W )

≈argmax[,C]┡[,i=1,…,n]P(w[,i]│c[,i])P(c[,i]│c[, i

-1]) (7)

where P (C│W) is the entry in which the part-of-speech mark sequence C appears when the input word sequence W is known

p>

Probability of pieces. The mathematical symbol argmax[,C] means that by examining different candidate part-of-speech tag sequences C

, we find the part-of-speech tag sequence C[*] that maximizes the conditional probability. The latter should be the result of

the part-of-speech tagging of W.

The second line of the formula is the result of translating using Bayes’ law. Since the denominator P (W) is a constant for the given

W, it does not affect the maximum value. Calculations can be removed from formulas. Then conduct an approximate analysis of the formula

. First, the independence assumption is introduced, and it is believed that the appearance probability of any word w[,i]

is only related to the part-of-speech tag c[,i] of the current word, and is related to the surrounding (context) < /p>

Part-of-speech tags are irrelevant. Therefore, the lexical probability can be calculated as follows:

P (W│C)≈Ⅱ[,i=1,…,n]P (w[,i]│c[,i]) (8)< /p>

Secondly, a binary hypothesis is adopted, that is, the probability of occurrence of any part-of-speech mark c[,i] is approximately the same as that of its immediately preceding part-of-speech mark c[,i-1 ]related. Then

P(C)≈P(c)Ⅱ[,i=2,…,n]P(c[,i]│c[,i-1]) (9)

P(c[,i]│c[,i-1]) is the transition probability of the part-of-speech tag, also called a binary model based on part-of-speech

.

Both of the above two probability parameters can be estimated separately through the corpus with part-of-speech tags:

P(w[,i]│c[,i])≈count(w [,i],c[,i])/count(c[,i]) (

10)

P(c[,i]│c[,i- 1])≈count(c[,i-1]c[,i])/count(c[,i-1]

) (11)

According to literature reports , using statistical language model methods, the correct rate of part-of-speech tagging in Chinese and English can reach about 96% (Bai Shuanhu 1992).

3.3.3 Attachment ambiguity of prepositional phrase PP

In English, whether a prepositional phrase is attached to the preceding noun or the preceding verb is a matter of sentence

A common structural ambiguity problem in analysis. The following example shows how to use the corpus method to solve this problem, and how high the accuracy rate can be achieved with this method.

Example: Pierre Vinken, 61 years old, joined the board as a

nonexecutive director.

Let A=1 represent noun attachment, A=0 means Verb attachment, then the PP attachment problem of the above example sentence can be expressed as:

(A=0, V=joined, N1=board, P=as, N2=director)< /p>

Let V, N1, and N2 represent the center words of the verb phrase, object phrase, and preposition phrase in the sentence respectively,

And in a corpus with syntactic annotation (also called a tree bank ), the probability of the following four-tuple is counted

P[,r]:

P[,r]=(A=1│V=v, N1=n1, P= p, N2=n2) (10)

The algorithm for PP attachment judgment on the input sentence is as follows:

If P[,r]=(1│v,n1,p, n2) ≥ 0.5,

then it is judged that PP is attached to n1,

otherwise it is judged that PP is attached to v.

The corpus used in the experiment of Collins & Brooks (1995) was annotated by the University of Pennsylvania

The "Wall Street Journal" (WSJ) treebank, which includes: training set 20,801 quaternions Set, test

Test set 3,097 quadruples. They made the following analysis on the upper and lower limits of the accuracy of automatic determination of PP attachment:

They are all regarded as noun attachment (i.e. A≡1) 59.0%

Only considering the most common attachment of preposition p 72.2%

The three experts judged 88.2% based on only four center words

The three experts judged 93.2% based on the whole sentence

< p>Obviously, the lower limit of automatic judgment accuracy is 72.2%, because the machine cannot do worse than just considering the most common attachment of the preposition p in the sentence; the upper limit is 88.2%, because the machine does not It may be better than the judgment made by three experts based on the four central words.

The paper reports that among the 3,097 quadruples tested, the system correctly judged 2,606 quadruples

, so the average accuracy was 84.1%. . This should be said to be a pretty good result compared to the upper limit of 88.2% mentioned above.

4. Conclusion

The efforts of linguists, whether using complex feature sets and unified grammar, or lexicalism

methods, are based on the original so-called made significant contributions within the framework of rationalism. The lexical approach

is particularly worthy of admiration because it not only proposes a finer-grained representation of language knowledge

but also embodies an incremental development of language knowledge. and accumulated new ideas. What deserves special attention

is that in the development process of many vocabulary resources, corpus and statistical methods have played a big role

. This is also a welcome beginning for the integration of empiricist and rationalist methods. The author

believes that corpus methods and statistical language models are the mainstream of current natural language processing technology, and their practical value has been proven in many application systems. Research on statistical language models, especially in statistical modeling of structured objects, still has broad room for development.

References:

Aarts, Jan & Willen Meijs (eds.). 1990. Corpus Linguistics:

Theory and Practice〔C〕. Amsterdam: Rodopi .

Collins, M. and J. Brooks. 1995. Preposition phrase

attachment through a backed-off model〔P〕. In Proceedings of the

3rd Workshop of Very Large Corpora. Cambridge, Mass.

Garside, R., G. Leech and G. Sampson, (eds.). 1989. The

Computational Analysis of English : A Corpus-Based Approach〔C〕.

London: Longman.

Hudson, R. A. 1991. English Word Grammar〔M〕. Cambridge,

Mass .: Basil Blackwell.

Bai Shuanhu, 1992, Research on automatic Chinese part-of-speech tagging system [MA]. Master's thesis, Department of Computer Science and Technology, Tsinghua University

Master's thesis.

Dong Zhendong, Dong Qiang, 1997, HowNet [J]. "Language Application" Issue 3.

Yu Shiwen et al., 1998, "Detailed Explanation of Modern Chinese Grammar Information Dictionary" [M]. Beijing:

Tsinghua University Press.