Computers and Language
Perspectives in Computational Linguistics
Computational linguists study natural languages, such as English and Japanese, rather than computer languages, such as Fortran, Snobol, C++, or Java. The field of computational linguistics has two aims:
The technological. To enable computers to be used as aids in analyzing and processing natural language.
The psychological. To understand, by analogy with computers, more about how people process natural language.
From the technological perspective, there are, broadly speaking, three uses for natural language in computer applications:
Natural language interfaces to software. For example, demonstration systems have been built that let a user with a microphone ask for information about commercial airline flights--a kind of automated travel agent.
Document retrieval and information extraction from written text. For example, a computer system could scan newspaper articles or some other class of texts, looking for information about events of a particular type and enter into a database who did what to whom, and when and where.
Machine translation. Computer systems today can produce rough translations of texts from one language, say, Japanese, to another language, such as English.
Computational linguists adopting the psychological perspective hypothesize that at some abstract level, the brain is a kind of biological computer, and that an adequate answer to how people understand and generate language must be in terms formal and precise enough to be modeled by a computer.
Problems in Computational Linguistics
From both perspectives, a computational linguist will try to develop a set of rules and procedures, e.g. to recognize the syntactic structure of sentences or to resolve the references of pronouns.
One of the most significant problems in processing natural language is the problem of ambiguity. In
(1) I saw the man in the park with the telescope.
it is unclear whether I, the man, or the park has the telescope. If you are told by a fire inspector,
(2) There's a pile of inflammable trash next to your car. You are going to have to get rid of it.
whether you interpret the word 'it' as referring to the pile of trash or to the car will result in dramatic differences in the action you take. Ambiguities like these are pervasive in spoken utterances and written texts. Most ambiguities escape our notice because we are very good at resolving them using our knowledge of the world and of the context. But computer systems do not have much knowledge of the world and do not do a good job of making use of the context.
Approaches to Ambiguity
Efforts to solve the problem of ambiguity have focused on two potential solutions: knowledge-based and statistical.
In the knowledge-based approach, the system developers must encode a great deal of knowledge about the world and develop procedures to use it in determining the sense of texts. For the second example above, they would have to encode facts about the relative value of trash and cars, about the close connection between the concepts of 'trash' and 'getting rid of', about the concern of fire inspectors for things that are inflammable, and so on. The advantage of this approach is that it is more like the way people process language and thus more likely to be successful in the long run. The disadvantages are that the effort required to encode the necessary world knowledge is enormous, and that known procedures for using the knowledge are very inefficient.
In the statistical approach, a large corpus of annotated data is required. The system developers then write procedures that compute the most likely resolutions of the ambiguities, given the words or word classes and other easily determined conditions. For example, one might collect Word-Preposition-Noun triples and learn that the triple <saw, with, telescope> is more frequent in the corpus than the triples <man, with, telescope> and <park, with, telescope>. The advantages of this approach are that, once an annotated corpus is available, it can be done automatically, and it is reasonably efficient. The disadvantages are that the required annotated corpora are often very expensive to create and that the methods will yield the wrong analyses where the correct interpretation requires awareness of subtle contextual factors.
Allen, James F. 1995. Natural language understanding. Benjamin Cummings Pub. Co. 2nd edn. - The most comprehensive textbook on computational linguistics.
Computational Linguistics, Vol. 19, No. 1, March 1993: Special issue on using large corpora: I. - A good recent collection on statistical approaches to natural language processing.
Grosz, Barbara J., Karen Sparck Jones, and Bonnie Lynn Webber. 1986. Readings in natural language processing. Santa Monica, CA: Morgan Kaufmann. - A good collection of early papers in the field.
Pereira, Fernando C. N., and Barbara J. Grosz (eds). Artificial intelligence, Vol. 63, Nos 1-2: Special volume on natural language processing. - A good recent collection of papers primarily in the knowledge-based approach to natural language processing.
The principal organization for computational linguistics is the Association for Computational Linguistics.