Read the text and discuss its basic ideas relying on analyzing the modifiers of 2.2.12 as well as the questions and vocabulary notes given below.
THE PROPER PLACE OF MEN AND MACHINES IN LANGUAGE TECHNOLOGY
Computer-aided speech recognition has been in the research mainstream for the last few decades. “The proper place of men and machines in language translation”, that is, the right distribution of labor between the human translators and the computer-assisted translation system, is one of the key problems under investigation.
There is a statement attributed to Fred Jelinek “Every time I fire a linguist the results of speech recognition go up”, i.e. explicit linguistic knowledge is dispensable. This sentiment is related to a paradigmatic shift that happened in the computational linguistics in the beginning of the 1990s: with more and more data available and with the advance in the methods of machine learning, more approaches switched from careful encoding of linguistic phenomena to finding statistical correlations in texts (either annotated or raw). The vast majority of publications at major conferences on computational linguistics belong to this paradigm and present a fairly radical stance. The approach implies that it is redundant to encode linguistic knowledge explicitly; a completely automatic machine learning procedure can quickly produce a fast and reliable NLP component which rivals (and in some cases exceeds) the performance of hard-coded linguistic rules requiring the efforts of many person-months (if not years). Hence, the efforts of linguists need to be spent on creating data rather than writing rules. Thus, in this approach the human efforts are invested into creating annotated corpora, representing data and designing machine learning algorithms, while the machine is able to learn the links between the data. In the end, linguistic knowledge is induced from annotated corpora rather than explicitly hand-created by linguists. In a similar way, development of corpora is possible without manual selection of texts from a range of sources. It can be facilitated by crawling or using the API of a search engine and automatically annotating them with respect to their domains and genres.
The automatically induced rules also do not take the form of hard constraints, separating the possible from the impossible, but rather as graded constraints, distinguishing the more probable from the less probable. This makes the automatically acquitted models more robust to noise.
Nevertheless, the approach in question needs to be taken with a pinch of salt. First, it was reasonably successful since in fact it implicitly utilized some information about the language. Second, data representation in terms of tag labeling is sufficiently simple and efficient , but a taglabel lacks information about the internal structure of linguistic phenomena. For example, the agreement in case, number and gender may not be taken into account by the system processing noun phrases. If the set of training examples does not contain, say, a proper nominative singular masculine noun in this sequence, the tagger will fail to treat this sequence as a noun phrase. Actually, another problem in using purely statistical methods is the reliance on patterns present in training data. Each training set has its own peculiarities, which do not necessarily match the peculiarities of the application domain. The accuracy of taggers trained on particular corpora vary dramatically on other text genres. Annotating texts in the application domain to obtain more training data is expensive, so the tools are often used in new domains without formal evaluation of their accuracy. To the best of our knowledge, this problem is partly addressed by new approaches to machine learning using domain adaptation, which uses a training corpus from the source domain (with available annotated data), a small number of annotated examples from the target domain and a large number of unlabelled examples from the target domain.
In addition to the known problem of unknowns in the domain mismatch, there is a problemof unknown knowns, namely when peculiarities inherent in the annotated set are not obvious, while machine learning is likely to emphasize them for making classification decisions. In the end, the system might achieve reasonably good accuracy on the held-out portion of the annotated set (since it is drawn from the same distribution, while the accuracy could be irrelevant outside of the annotated set alone. For example, in the field of automatic genre classification it has been shown that a large number of texts on a particular topic within a genre heading can considerably affect the decisions made by the classifier, e.g., by treating texts on hurricanes and taxation as belonging to FAQs (Frequently Asked Questions). At the same time, a classifier based on POS trigrams is much less successful, but it suffers less from the transfer from one annotation set to another.
Finally, there are problems with correcting the results. An error produced by a rule-based tagger can be corrected by debugging, finding the incorrectly fired rule, modifying it and testing the performance again. A statistical model can be amended by modification of the learning parameters or by providing more data, but this is only indirectly related to the performance of the system in the case of an individual problem.
In either case, the main contribution of the survey analysis is two-fold. First, we reveal the advantages of the baseline for natural language processing using only statistical methods and minimal adjustment to the representation of source data. In spite of its minimalism, the baseline outperforms the majority of the rule-based systems. Second, the tools mentioned are available for linguistic research. This defines the entire pipeline, which starts with POS tagging of pre-tokenized texts, proceeds to lemmatization and ends with syntactic parsing.
From Arrange Cornell University Library, https://arxiv.org/abs/1505.03132
TEXT CONCORDANCE
agreement in case, number and gender -согласование по падежу, числу и роду (грамм.)
annotate - аннотировать, добавлять краткие комментарии к тексту для объяснения тех или иных фрагментов (e.g. an annotated edition of ‘Othello’); отмечать, размечать
annotated corpus - размеченный корпус, аннотированный корпус
API (Application Program Interface) -интерфейс прикладных программ
application domain -область применения
automatically acquitted models -автоматически полученные модели
corpus - собрание, совокупность, множество, коллекция, корпус; pl. corpuses or corpora
crawling- сканирование, «обход» серверов с помощью программного робота (crawler, краулера, «паука») с целью помещения доступных документов в базу данных поисковой системы
disambiguated- лишенный неоднозначности, со снятой неоднозначностью
domain mismatch- несоответствие областей
hard-coded linguistic rules - жестко кодируемые лингвистические правила
hard constraints - жесткие ограничения
held-out portion - установленная выборка, тестовая выборка
graded constraints - градиентные ограничения
lemma - инвариантная (базовая) форме слова, отвлеченная от грамматических вариаций (e.g. the verb “sing” or “to sing”, in abstraction from the varying word forms sing, sings, sang, sung, singing).
lemmatization - лемматизация - приведение слова к лемме - базовой инвариантной форме
lemmatize - приводить слово к базовой инвариантной форме - лемме
long- distance dependencies - отдаленные зависимости
machine learning - машинное обучение, обучение машины (модели, системы)
natural language processing (NLP) - обработка естественного языка
nominative singular masculine noun - существительное мужского рода единственного числа в именительном падеже
pipeline - конвейер, непрерывная последовательность, цепь
pre-tokenized text - предварительно разбитый (на предложения, слова) текст
problem of unknown knowns - проблема неизвестных известных
syntactic parsing - синтаксический разбор, ~ анализ
tag - тег, маркер, метка
tag label - маркер тега
tag labeling - разметка тегов
tagger - теггер ( a piece of software that adds identifying or classifying tags to pieces of text or data)
tagging - разметка; part-of-speech tagging (POS) - частеречная разметка
training data - обучающие данные
training process - процесс обучения (модели)
training set -обучающая выборка
trigram -триграмма
DISCUSSION ISSUES
1) Fred Jelinek claimed that every time he fired a linguist the results of speech recognition went up.
What does the author mean saying that? What is the author’s opinion of linguistic knowledge functionality? Do you hold the same opinion on the issue at stake?
2) Why is computer-aided speech recognition in the mainstream of today’s research? Is this a recurrent theme in numerous scientific publications?
3) What are the two basic approaches to natural language computer processing described in the article? Which seems more efficient to you?
4) Does natural language computer processing correlate with any basics of human speech recognition? What kind of mental mechanisms prevail in human foreign language acquisition? Those based on hard-stated linguistic rules explicitly created by teachers? Those based on working through massive language corpora, both annotated and raw?
5) What are some basic procedures for designing a statistical language processing method?
Here is some info on the point.
Each procedure can be described in the following lines:
a) taking an annotated text corpus;
b) designing a simplified representation of annotations to convert the corpus into the format suitable for the learning tool to de used;
c) learning a model in several iterations to tune the learning parameters.
“Thus, in this approach the human efforts are invested into creating annotated corpora, representing data and designing machine learning algorithms, while the machine is able to learn the links between the data. In the end, linguistic knowledge is induced from annotated corpora rather than explicitly hand-crafted by linguists”. (See the text above).
Are there entirely statistical methods applied to building tools for language processing? Account for the difference between text annotation and explicit formulation of language rules. Which approach is closer to the natural process of human language acquisition?
6) According to computer experts, annotating texts implies tagging - assigning a label (tag) to each word concerned. Until the end of the 19980s this task had been usually performed by sets of carefully crafted rules for disambiguating contexts, e.g. for detecting contexts in which the form ‘known’ is a participle or a noun. Ken Church was one of the first researchers to show the possibility of abandoning the rules and relying exclusively on annotated data. In the frame of statistical approaches, tagging implies automatic derivation of decision trees or machine learning.
7) What kind of issues do researchers encounter with statistical approaches to natural language processing? What are the main advantages of using statistical methods for natural language processing?
Practice vocabulary.
A)Review the meanings:
concern = feeling of worry : have/ express/ voice concern (about/ over / at)
cause concern / be a cause of concern
growing / widespread concern
a matter of concern
important environmental concerns
main / major concerns
raise concerns
= something
important: main/primary/ major concern
= responsibility: be one’s concern
(singular)
= care with concern
(singular) with genuine concern
= absence of be of no concern of somebody
responsibility or
need to become
involved
B) Complete the sentences choosing a proper equivalent:
1) The Net advertisers’ … is how to make users click through to them.
A. some concern B. concerns C. primary concern D. a cause of concern
2) There is … (1) that half of every ad budget is regularly wasted. That raises important … (2) among advertisers.
(1) A. some concern B. concerns C. main concern D. raised concerns
(2) A. environmental concerns B. domestic concerns C. political concerns D. commercial concerns
3) My … is to gain my degree by the end of the course.
A. cause of concern B. matter of concern C. concerns D. major concern
4) Cognitive science considering no mental states raises some important … about its status.
A. conceptual concerns B. growing concerns C. causing concerns D. primary concerns
5) The toxicity of various contaminants released into the environment is becoming … .
A. concerns B. causing concerns C. a matter of great concern D. a matter of some concern
6) The lawyers’ … is to protect the rights of law-abiding people in the first place.
A. a matter of great concern B major concern C. concerns D. a cause for concern
7) If you do not submit your paper in time, it is … .
A. your concern B. your adviser’s concern C. somebody else’s concern D. no concern of yours
8) One’s private life is … members of the public.
A. a matter of great concern B. a cause of concern C. no concern of D. no concern
9) My colleague looked sympathetic and listened with … .
A. a cause of concern B. genuine concern C. with primary concern D. a matter of great concern