An analysis of translational complexity in English-Norwegian parallel texts

Martha Thunes
University of Bergen

Abstract

This article presents an empirical study where translational complexity is related to a notion of computability. Samples of English-Norwegian parallel texts have been analysed in order to estimate to what extent the given translations could have been produced automatically, assuming a rule-based approach to machine translation. The study compares two text types, fiction and law text, in order to see how these differ with respect to the question of automatisation. The results of the investigation indicate that automatic translation tools may be helpful in the case of the law texts, and the study concurs with the view that the usefulness of such tools is limited with respect to fiction. Although the chosen empirical method was originally designed to be of relevance to rule-based translation, it is of interest also to contrastive language studies, and to translation research. The analysed data capture some aspects of the relationship between the two language systems English and Norwegian, as well as certain features of translated text as compared to original texts.

1. Introduction

The present contribution reports on an empirical study of translational correspondences identified manually in selected English-Norwegian parallel texts. The study is motivated by two main research questions. Firstly, to what extent is it possible to automatise, or compute, the actual translation relation found in the investigated parallel texts? Secondly, is there a difference in the degree of translational complexity between the two text types law and fiction which have been included in the empirical material? In order to answer these questions, the parallel texts are analysed into pairs of translationally corresponding units, primarily at clause level, and then the degree of translational complexity is measured in each such correspondence. It should be noted that the analysed target texts have been produced by human translators; this is not a study of output produced by automatic translation systems.

The analysis is carried out within a strictly product-oriented approach; aspects related to translation methods, or to the cognitive processes behind translation, will not be considered. The applied notion of translational complexity will be defined in terms of the amount and types of information needed when a specific translation is produced from a given source expression. Since this conception of translational complexity is related to linguistic information, the present investigation is seen as relevant to linguistic approaches to machine translation (MT), commonly known as rule-based MT. [1] It is now generally agreed that if a certain level of translation quality is wanted in the output of MT systems, it is necessary to include some processing of linguistic information, and this motivates the chosen analytical approach since it assumes a translation procedure operating on linguistic information sources.

Moreover, the empirical analysis has revealed certain aspects of the investigated text pairs which are of relevance also to translation theory, and to the contrastive study of the language pair English-Norwegian.

2. Basic notions

In this study, ‘automatisation’ is understood simply as the generation of translations with no human intervention. The investigation is not related to any particular implementation, translation algorithm, or type of system architecture, although, as mentioned in section 1, it assumes the rule-based approach to MT.

Rather, the intention is to discuss automatisation with reference to information about languages by relating it to an assumption concerning predictability in the translational relation. I.e., we assume that there is a translational relation between the inventories of simple and complex linguistic signs in two languages which is predictable, and then also computable, from information about source and target language systems, and about how the languages correspond.

This means that a computable translation is linguistically predictable, i.e. predictable as one of possibly several alternative translations, and the basis for predicting it is the linguistic information coded in the source text, together with given, general information about the two languages and their interrelations. [2] It also means that non-computable translations cannot be predicted merely from these types of linguistic information, because non-computable translation tasks require access to additional information sources, such as various kinds of general or task-specific extra-linguistic information, or task-specific linguistic information from the context surrounding the source expression.

In order to answer the research questions given in section 1, a measurement of translational complexity is applied to the analysed texts. For this purpose, pairs of translationally corresponding linguistic units, primarily finite clauses, are identified as individual translation tasks, and ‘translational complexity’ is defined in the following way: in a given translation task, the degree of translational complexity is a factor determined by the amount and types of information needed to solve the task, as well as by the accessibility of these information sources, and the effort required when they are processed.

In the present approach, a scale of translational complexity is assumed, and, for analytical purposes, four main types of translational correspondences are identified on this scale. The four correspondence types are organised in a hierarchy, the correspondence type hierarchy, which reflects an increase in the degree of translational complexity. A dividing line between computable and non-computable translation tasks is to be drawn on a certain point across this scale of translational complexity (cf. section 3.2). The type hierarchy is presented in 3.1 with subsections.

3. Methodology

The method applied in this project involves a manual analysis of running parallel texts, and in this analysis, translationally corresponding linguistic units, or string pairs, are identified and classified according to the correspondence type hierarchy. This measures the degree of translational complexity in each string pair, viewed as an individual translation task. The chosen units of analysis will be presented in section 4.1.

3.1 The correspondence type hierarchy

The basic principles behind the type hierarchy were originally defined by Helge Dyvik, and they are implicit in the design of the experimental machine translation system PONS, documented in Dyvik (1990, 1995). A further development of his model is previously published in Thunes (1998), and the approach applied in this study is described in more detail in Thunes (2011).

In the following, the four main categories of the type hierarchy will be illustrated using examples of sentence pairs taken from a short story by the Norwegian author Bjørg Vik, and a published English translation.

3.1.1 Type 1

The least complex type of translational correspondence is referred to as type 1. An example is given in (1), where (1a) is the source sentence, and (1b) the target sentence:

(1a) Hun har vært en skjønnhet.
‘She has been a beauty.’
(1b) She has been a beauty.

The glossing of (1a) shows that the English target sentence corresponds word-by-word with the source sentence, and this is the characteristic of type 1. That is, in this category, the corresponding strings are pragmatically, semantically, and syntactically equivalent, down to the level of the sequence of word forms. Such correspondences are relatively infrequent in the language pair English-Norwegian. [3]

3.1.2 Type 2

In correspondences of type 2, there is also a very close match between the two strings, but there may be some formal differences. Firstly, the sequence of constituents may differ; cf. example (2):

(2a) Dessuten virket hun overlegen.
‘Also looked she haughty.’
(2b) She also looked haughty.

The glossing of (2a) illustrates the word order difference between the two strings. In the Norwegian sentence, there is subject-verb inversion: when a non-subject, such as the adverbial dessuten, appears sentence-initially, the verb-second restriction applies in Norwegian. In the English target sentence the subject comes first, and there is no inversion.

Secondly, there may be differences in the use of grammatical form words, as in example (3):

(3a) Leiligheten var ufattelig rotete.
‘Flat.def was unbelievably untidy.’
(3b) The flat was unbelievably untidy.

The point in example (3) is that there is no word form in (3a) matching the definite article in (3b), because Norwegian expresses the definite form of nouns by means of a suffix.

The criterion that defines type 2 correspondences is that every lexical word in the source string has a correspondent in the target string of the same lexical category and with the same syntactic function as the source word. This means that in type 2 correspondences, the two strings are pragmatically and semantically equivalent, and equivalent with respect to syntactic functions, even if there is at least one formal difference that makes the correspondence deviate from word-by-word translation. Type 2 is, like type 1, relatively infrequent in this language pair.

3.1.3 Type 3

In type 3 correspondences there is, as in types 1 and 2, pragmatic and semantic equivalence between source and target string, but there is not syntactic functional equivalence, because there is at least one structural difference violating equivalence between the two strings with respect to syntactic categories and functions. In the given language pair, type 3 seems to be more frequent than each of the two lower types. Type 3 can be illustrated by example (4):

(4a) Hildegun himlet lidende mot taket og svarte med uforskammet høflighet.
‘Hildegun rolled-eyes suffering towards ceiling.def and answered with brazen politeness.’
(4b) Hildegun rolled her eyes in suffering towards the ceiling and answered with brazen politeness.

In this string pair, the correspondence between the Norwegian verb phrase himlet and the English expression rolled her eyes violates syntactic functional equivalence, because himle is an intransitive verb, whereas rolled her eyes consists of a transitive verb phrase and a noun phrase functioning as direct object. Also, the Norwegian adverb phrase lidende corresponds with the English preposition phrase in suffering. Still, these two sentences correspond semantically.

3.1.4 Type 4

Finally, in type 4, the most complex correspondence type, there is no longer semantic equivalence between source and target string. There may be pragmatic equivalence, but not necessarily. In the present study, type 4 has turned out to be very important because it is the most frequent correspondence type in the analysed texts.

The defining characteristic of type 4 correspondences is that there is at least one linguistically non-predictable semantic deviation between source and target string. This can be illustrated by example (5):

(5a) Her kunne de snakke sammen uten å bli ropt inn for å gå i melkebutikken eller til bakeren.
‘Here could they talk together without to be called in for to go in milk-shop.def or to baker.def.’
(5b) They could talk here without being called in to go and buy milk or bread.

Without going into detail, it may be observed that the semantic difference between these sentences lies in the correspondence between the substrings for å gå i melkebutikken eller til bakeren and to go and buy milk or bread. These expressions do not denote the same activities, but it is inferrable from background information about the world that both activities can have the same result, i.e. the buying of milk or bread.

This illustrates what is involved in a linguistically non-predictable semantic deviation: the semantic difference between source and target expression — in the case of example (5), a difference in denotational properties — cannot be predicted on the basis of the information that is linguistically expressed in the source string, together with information about source and target languages, and about their interrelations. This means that in type 4 correspondences, additional information sources, such as information about general world knowledge, are needed in order to produce the particular target expression. In cases of this kind, there is normally one or more alternative translations which can be predicted from purely linguistic information sources, and which can be semantically equivalent to the original expression. With respect to (5), a linguistically predictable target expression could be to go to the milk shop or to the baker’s. That alternative is denotationally equivalent to the source expression, but it does not necessarily exhibit other properties that a translator may want to choose in a target text.

3.2 Comments on the classification model

Examples (1)–(5) show that the correspondence type hierarchy, as a classification model, reflects a gradual increase in linguistic divergence between source and target string, and the analysis of translational correspondences is based on the assumption that this increase is correlated with an increase in the degree of translational complexity. That is, a larger amount of information, and a greater processing effort, is required in order to solve translation tasks in correspondences of the higher types than in the lower types. Each correspondence type covers a class of translation tasks, and in the type hierarchy, the four classes are distinguished from each other on the basis of the amount and types of information necessary for solving translation tasks within each class. These matters are described in detail for each correspondence type in Thunes (2011), along with discussions of the accessibility of necessary information sources, and of required processing effort, within each type.

On the scale of translational complexity defined by the type hierarchy, the division between predictable and non-predictable translation is drawn between types 3 and 4. This means that correspondences of types 1, 2, and 3 together constitute the domain of linguistically predictable, or computable, translations, whereas type 4 correspondences belong to the non-predictable, or non-computable, domain, where semantic equivalence is not fulfilled.

A clear parallel to the increasing degree of complexity in the type hierarchy is found in Vinay & Darbelnet’s set of seven translation procedures, which they presented “in increasing order of difficulty”, ranging from the simplest method of translation to the most complex. [4] Although this is an interesting similarity, the present classification model is not related to Vinay & Darbelnet’s categorisation of methods. The correspondence type hierarchy is a product-oriented approach; it does not describe translation strategies, but correspondence relations between given source text units and their existing translations.

3.3 Related contributions

The type hierarchy is a fairly general classification model for translational correspondences, and it has been adopted by several other researchers in contrastive language studies. For the purpose of analysing word-order differences between English and Norwegian, Hasselgård (1996) employs a slightly modified version of the correspondence type hierarchy as defined by Dyvik (1993), and her approach is further developed in an English-Norwegian study of thematic structure (Hasselgård 1998). Elgemark (in progress) has adapted the analytical approach of Hasselgård (1998) to a contrastive study of clause-final constituents in English-Swedish. Modified versions of the correspondence type hierarchy as presented in Thunes (1998) are use by Tucunduva (2007), Silva (2008), and Azevedo (in print), all of which are studies where the model is applied for the purpose of analysing and describing translational correspondences in various types of English-Portuguese parallel texts.

Other related approaches to the analysis and description of translational correspondences in parallel texts are found in the works presented by Merkel (1999), Cyrus (2006), and Macken (2010).

4. Empirical investigation

The implementation of the present methodology involves manual compilation and classification of string pairs from parallel texts. The application of the type hierarchy requires a human, bilingually competent analyst, since the classification of the compiled correspondences demands a careful linguistic analysis of each string pair.

The parallel texts are analysed from beginning to end, as the human annotator identifies pairs of translationally related units, and the data are recorded by means of the software tool Text Pair Mapper, described in Dyvik (1993).

4.1 Units of analysis

The selection of units of analysis is influenced by the wish to make this study of translational complexity relevant to the field of machine translation, but without paying attention to specific algorithms for implementation. Since rule-based MT systems typically operate sentence by sentence, the finite clause is chosen as the basic unit of analysis. Another point motivating the choice is that in order to be of any use, an MT system must handle syntactic units at least as complex as those of the sentence level.

In this connection, ‘finite clause’ is understood simply as a syntactic unit containing a finite verb as its central element.  Thus, occurrences of finite verbs are in practice the basis for the identification of analysis units. Whenever a word form of this category is encountered, the syntactic unit in which it fills the function of main or auxiliary predicate is identified as a unit of analysis.

Thus, in this study a limited set of syntactic constructions serve as units of analysis, and it has been an aim to define analysis units that can be identified on the basis of syntactic structure. Matrix sentences and finite subclauses are then typically recorded as units of analysis. Also, lexical phrases with one or more finite clauses as syntactic complement constitute another major syntactic type among the recorded data (cf. (6a) in section 5.1, and (7b) in section 5.3). In such cases the finite clause is not identified as an independent unit, because the entire phrase is normally a more natural unit to be solved by a translation task than the syntactic complement in isolation. Moreover, string pairs are extracted also when only one of the two strings conforms with the syntactic criteria that define analysis units.

Since syntactically dependent constructions like finite subclauses occur as units of analysis, the data include nested correspondences where a superordinate string pair contains one or more embedded string pairs. E.g., if a finite subclause is embedded in a matrix sentence, as in When he came, we could leave (Norwegian: Da han kom, kunne vi dra), then two string pairs are extracted. One is the subclause and its match in the parallel text: [When he came,] – [Da han kom,]; the other is the matrix sentence and its correspondent: [[CP] we could leave]. – [[CP] kunne vi dra]. Square brackets here indicate the sentence units, and the syntactic category label CP represents the finite subclauses; cf. Thunes (2011: 201). In the superordinate string pair, the embedded correspondence is treated as a pair of opaque items, represented by their syntactic categories. [5]

4.2 The texts

The analysed data are recorded from a selected set of English-Norwegian parallel texts. The texts were written and translated during the years 1979–1996. The corpus covers both directions of translation, and it includes two text types, fiction and law texts. Comparable amounts of data have been compiled for each of the text types and directions of translation. Table 1 gives an overview of text type, direction of translation, and numbers of running words for each of the text pairs.

Authors and texts

Text type

Source lg.

Target lg.

No. of running words

Agreement on the European Economic Area, Articles 1–99
Avtale om Det europeiske økonomiske samarbeidsområde, artiklene 1–99
law text
English


Norwegian

9,202
8,015
Lov om petroleumsvirksomhet, §§1–65
Act relating to petroleum activities, Sections 1–65
law text
Norwegian


English

7,929
9,647
André Brink
The Wall of the Plague
Pestens mur
fiction
English


Norwegian

4,021
4,230
Doris Lessing
The Good Terrorist
Den gode terroristen
fiction
English


Norwegian

4,008
4,652
Erik Fosnes Hansen
Salme ved reisens slutt
Psalm at Journey’s End
fiction
Norwegian


English

4,022
4,395
Bjørg Vik
En håndfull lengsel
Out of Season and Other Stories
fiction
Norwegian


English

4,010
4,550
Total       68,681

Table 1. An overview of the analysed text pairs with respect to text type, direction of translation, and numbers of running words.

In the present study, law texts are chosen as a representative of restricted text types, and fiction as an example of a relatively unrestricted type.

In the case of law text, its writing, interpretation, as well as translation, are restricted by institutionalised norms belonging to the legal domain. Clarity, precision, and unambiguity are among the primary norms that govern the writing of law texts (Bhatia 2010: 38–39), and law text translation requires strictly that the legal content is the same in all language versions.

Fiction texts, on the other hand, are in no way as norm-governed as law texts are. A fiction text will to some extent be constrained by linguistic and stylistic norms, but its creation is determined by the individual choices of the author, and its reception by the subjective experiences of the readers. When fiction is translated, there may be other properties than the semantic content of the source text that are necessary to recreate in the target text, and which may motivate the choice of target expression.

The difference in restrictedness between the two text types is evident in several ways. Law texts have a rigid macrostructure; fiction does not. Whereas law texts exhibit limited inventories of, respectively, pragmatic functions and types of syntactic constructions, fiction texts are far more varied in these respects. [6]

4.3 How translational complexity is measured

In order to measure the degree of translational complexity in pieces of parallel texts, the classification model must be applied to running texts, without omitting any parts of them. Then, the distribution of the four correspondence types within a set of data provides a measurement of the degree of translational complexity in the parallel texts that the data are extracted from.

In the given language pair, the two least complex types (1–2) normally occur in pairs of short and syntactically simple strings of words, whereas pairs of longer and more complex strings tend to be of types 3–4. Thus, types 1 and 2 would appear as covering an unproportionally large amount of the analysed texts if the distribution of the main correspondence types would be presented merely on the basis of the numbers of string pairs (cf. Table 2 in a section 4.4). Hence, the proportions of text covered by the different correspondence types will be discussed in terms of the lengths of source and target text, respectively. For this purpose, the length of a recorded translational unit is counted as its number of word forms, and in the case of nested correspondences, the word forms in embedded strings are counted only once.

The most important aspect shown by the complexity measurements of this study is the division between computable and non-computable correspondences, i.e. how large is the proportion of the analysed texts covered by, on the one hand, string pairs of types 1, 2, and 3, and, on the other hand, string pairs of type 4. This division is meant to show to what extent it can be expected that an ideal, rule-based MT system could simulate the given translations, if provided with a full description of the two languages and their interrelations. Notably, this is not an estimate of how much of the given source texts that could be given some kind of linguistically predictable translation. Since English and Norwegian both belong to the Germanic language family, and are used in language communities which are, in cultural terms, not very far apart, the recorded data include probably only very few source expressions which have no linguistically predictable translation. [7] It should be emphasised, then, that this study tries to measure the proportion of predictable, and hence computable, translation within the specific, human-created target texts that have already been produced.

4.4 The results

Since the present investigation is based on hand-coded material, the data are of a relatively modest quantity (about 68,000 words). Their limited size prevents the detection of statistically significant results, and only tendencies may be observed within the recorded material. Hence, it is not possible to generalise about the degree of translational complexity in relation to the given language pair, to the two directions of translation within this pair, nor to the investigated text types. Still, on the basis of the recorded data, the results provide tentative answers to the research questions posed initially.

Concerning the automatisation issue, table 2 shows the complexity measurement across the entire collection of correspondences. By calculating the average values of the percentages given for source and target text lengths, respectively, we find that more than half of the data are included in non-computable correspondences: string pairs of type 4 constitute 55.2% of the compiled data, whereas the computable types 1, 2, and 3 together cover as little as 44.8% of all recorded string pairs. On the basis of this result, the conclusion is that with perfect information about source and target languages, an idealised rule-based MT system could have simulated less than half of the identified correspondences.

Total results, all text pairs

Type 1

Type 2

Type 3

Type 4

All types

Number of string pairs 601 272 1,347 2,219 4,439
Percentage of string pairs 13.5 6.1 30.4 50.0 100.0
Source text length (word forms) 1,906 1,642 12,179 19,263 34,990
Percentage of source text length 5.4 4.7 34.8 55.1 100.0
Target text length (word forms) 1,926 1,741 12,940 20,547 37,154
Percentage of target text length 5.2 4.7 34.8 55.3 100.0

Table 2. The global distribution of correspondence types in the investigated texts.

With respect to the text type issue, the results are summed up in table 3, which shows that the proportion of computable correspondences is on average 50.2% in the law data, and 39.6% in fiction. That is, the degree of complexity is, on average, lower in the selected pairs of law texts than in those of fiction.

Proportions of...

in law text

in fiction

in all data

computable translational correspondences (types 1, 2, 3) 50.2% 39.6% 44.8%
non-computable translational correspondences (type 4) 49.8% 60.4% 55.2%

Table 3. Differences in translational complexity between the two text types.

5. Discussion

Although the degree of translational complexity is found to be lower in the law texts than in those of fiction, the results do not indicate that while the analysed fiction texts appear as clearly unsuitable for automatic translation, the law texts appear as suitable. Across the investigated material, translational complexity is found to be so high that fully automatic translation does not seem to be a fruitful option for any of the analysed text pairs, at least if human-quality output is aimed for. Actually, the results may seem too pessimistic in relation to the automatisation issue, since it is a fact that automatic translation tools are used, and with advantage, in particular for non-literary text types, because they do reduce the workload of manual translation.

Here it is relevant to mention that the analysed texts provide a problematic norm for automatic translation, because the human-created target texts represent an ideal for the end result, and not for the raw output of an MT application. The chosen standard is probably an unrealistic, and perhaps also unfair, goal for MT development because it is generally accepted that the use of machine translation requires post-editing. Moreover, high-quality translation without revision is uncommon also when the translator is human. Still, manually produced target texts have been used as a standard because evaluating the products of real systems has not been an objective, and because the complexity measurements in this study aim at showing to what extent we might assume that an ideal, rule-based system could simulate the given translations. That is, the analysis is intended to measure the degree of linguistic correspondence between originals and manual translations, as an indicator of possibilities for automatisation.

Concerning the text type issue, an expected result is to find a lower degree of translational complexity in law texts than in fiction texts. Several kinds of recurrent semantic deviations between translationally corresponding units have been observed among the recorded data, and, in general, these phenomena constitute the primary factor contributing to the frequency of non-computable correspondences. [8] Although cases of type 4 are not infrequent within the law data, instances of semantic deviations are far less common than among the fiction data. [9] This is in line with the high degree of restrictedness in the law texts.

5.1 Minimally non-computable correspondences

In order to discuss further whether it would be fruitful to apply automatic translation to the selected texts, it is interesting to consider the workload potentially involved in editing possible machine output. For this purpose, we can assume that an MT system would generate only linguistically predictable translations for the analysed source texts. This means that the recorded type 4 correspondences represent cases where the machine would produce target expressions conforming with the characteristics of one of the lower correspondence types, or possibly not generate linguistically well-formed output at all. At any rate, post-editing would be required in order to reach the gold standard represented by the human translation.

Of relevance here is the question whether string pairs identified as type 4 in the present study have been classified as such because of only one, or few, semantic deviations between source and target units. That is, if the semantic difference between two corresponding strings is small, then the major part of the correspondence would involve a linguistically predictable translation, and it might be unproblematic for a post-editor to correct that subpart of the machine output which does not meet the standard. If post-editing amounts to simple corrections of linguistic errors that are few and easy to spot, then what Jurafsky & Martin (2009: 931) describe as the edit cost of post-editing would be low.

Of particular interest to the question of potential amount of post-editing are non-computable correspondences with only one minimal semantic deviation between source and target string. Such cases may be described as minimally non-computable, and in correspondences of this kind it would probably be easy to revise an automatically generated target expression to the standard of manual translation. An example can be taken from the Norwegian Act relating to petroleum activities. The noun phrase given in (6a) contains a relative clause, and is translated into the expression shown in (6b):

(6a) de områder som er nevnt i tillatelsen
‘the areas which are mentioned in license.def’
(6b) the areas mentioned in the licence

The only semantic deviation in this string pair is the presence vs. absence of grammatically expressed temporal information, and because of this, example (6) is a type 4 correspondence. Here it can be assumed that a rule-based translation system would produce the semantically equivalent target expression the areas which are mentioned in the licence, and a human post-editor might easily choose the non-finite alternative because he or she would know that that would be stylistically more appropriate in a law text.

The distribution of minimally non-computable correspondences among the recorded data again puts focus on the issue of text type, because such cases are far more frequent in the law texts than in the fiction texts. Within the law data, as much as 45.7% of the correspondences classified as type 4 are minimally non-computable, whereas among the fiction data, only 10.5% of the compiled type 4 correspondences are minimal ones (Thunes 2011: 342–343). This primarily reflects the fact that because law text is strongly norm-governed, semantic deviations between translationally corresponding units are far less frequent in the law texts than in those of fiction. [10] Moreover, it shows that the potential edit cost required by automatic translation would be considerably lower in the law texts than in the fiction texts.

Thus, on the basis of the data recorded in this study, the investigated pairs of law texts are tentatively regarded as representing a text type where machine translation may be helpful, if the effort required by post-editing is smaller than that of manual translation. In the case of the fiction texts, it seems clear that post-editing of automatically generated translations would be laborious and not cost effective.

5.2 The non-finite-finite pattern

The observations of minimally non-computable correspondences among the recorded data revealed a considerable number of translational links between English non-finite constructions and Norwegian finite clauses. This is compatible with the tendency that the use of non-finite constructions, such as ‑ing-clauses and ‑ed-clauses, is far more frequent in English than the use of syntactically congruent structures in Norwegian. E.g., the various kinds of adverbial functions that may be realised by English ‑ing-clauses tend to be associated with finite subclauses in Norwegian. The regularity of such correspondences follows from information about the two language systems, primarily because finite and non-finite constructions may be associated with matching types of syntactic functions in the two languages. However, this is not an absolute regularity that excludes the speaker, or writer, from making choices between alternative expressions. Hence, an English non-finite construction will not always be translationally matched by a Norwegian finite subclause, or vice versa. Thunes (2011: 264) describes this phenomenon as the non-finite-finite pattern of English-Norwegian parallel texts. The pattern can be seen as created by an interplay between, on the one hand, extra-linguistic factors and, on the other hand, the structures of the two language systems. That is, the language systems determine what syntactic functions that may be associated with the various kinds of finite and non-finite constructions, as well as the semantic contribution of those functions, but whether a finite or non-finite construction is chosen in a specific context may also be influenced by factors pertaining to language use (Thunes 2011: 264).

Among the recorded data, instances of the non-finite-finite pattern are found in two classes of translational correspondences. These are, firstly, string pairs where one of the units is a finite subclause, and the other is a non-finite construction, and, secondly, correspondences between complex lexical phrases where only one of the extracted units contains a finite subclause, and where the syntactic complement in the parallel unit is some kind of non-finite construction, as already shown by example (6) in section 5.1. Such correspondences are recorded because one of the units is, or contains, a finite clause (cf. section 4).

As mentioned in section 5, classes of recurrent semantic deviations between translationally corresponding units have been identified among the recorded data, and the by far most common such types are specification and despecification, i.e., cases where the target expression is either semantically more specific, or less specific, than the corresponding source expression (Thunes 2011: 331). Out of a total of 2,219 type 4 correspondences, specification has been identified in 918 string pairs, and despecification in 604 pairs (Thunes 2011: 340). Although both phenomena may occur in one and the same string pair, these figures show that a majority of the recorded type 4 correspondences involve either specification or despecification.

Because of these findings, it is interesting to find out to what extent instances of the two phenomena are caused by the non-finite-finite pattern alone, as in example (6), which is a case of semantic despecification by grammatical means, described by Thunes (2011: 347, 378) as grammatical despecification. Given that the finite expression contains temporal information not expressed in the corresponding non-finite construction, the pattern will induce specification in translation from English into Norwegian, and despecification in the opposite direction. Within the data compiled from English-to-Norwegian, occurrences of the non-finite-finite pattern caused 63.8% of the cases of specification found among the law data, but merely 10.4% of those recorded in the fiction texts (Thunes 2011: 388). As regards the Norwegian-to-English data, the pattern amounted to as much as 69.5% of the cases of despecification identified in the analysed pair of law text, but only 11.7% of those found in the fiction texts (Thunes 2011: 388–389).

Thus, the data show, on the one hand, that instances of the non-finite-finite pattern have contributed noticeably to the frequencies of, respectively, specification and despecification, and, on the other hand, that the non-finite-finite pattern has left a much stronger imprint on the law data than on those extracted from fiction. Moreover, since as much as 45.7% of the type 4 correspondences identified in the law text pairs are minimally non-computable cases (cf. section 5.1), and since specification and despecification are the main types of recorded semantic deviations, the non-finite-finite pattern appears to be the most important factor that has created minimally non-computable correspondences in the analysed texts (Thunes 2011: 389).

The relatively strong effect of the pattern on the law data, as compared to the fiction data, must not be understood as evidence that while the pattern is frequent in the law texts, it occurs rarely in the fiction texts. Rather, the data have shown that in the fiction texts the pattern is only one among a large variety of factors creating semantic deviations between translationally corresponding units, and this can be seen as a reflection of the difference in degree of restrictedness between the two text types: the types of semantic deviations between source and target which may occur in high-quality literary translation would not conform with the norms of law translation.

5.3 Explicitation

The phenomenon categorised as specification in the present study is very close to what is described as explicitation in translation theory. The common understanding of this notion is to make explicit in the translation information which is only implicit in the original (Pym 2005: 30). This is, however, not quite the same as the present notion of specification, which pertains to cases where a certain piece of information contained in the target is not linguistically encoded in the source expression. Specification may be illustrated by a pair of noun phrases compiled from the EEA Agreement and its Norwegian translation:

(7a) an act corresponding to an EEC regulation
(7b) en rettsakt som tilsvarer en EØF-forordning
‘an act which corresponds-to an EEC-regulation’

The string pair in (7) is recorded because the Norwegian unit contains a finite subclause. Since the translation (7b) contains grammatically expressed temporal information not contained in the source unit (7a), there is a minimal semantic difference between the two strings, and this correspondence is classified as type 4, and as an instance of specification. However, a human translator would not consider translating the unit (7a) in isolation from its context. Most likely, the matrix sentence (8a) in which (7a) is embedded would be translated as one unit:

(8a) ... an act corresponding to an EEC regulation shall as such be made part of the internal legal order of the Contracting Parties; ...
(8b) ... en rettsakt som tilsvarer en EØF-forordning skal som sådan gjøres til del av avtalepartenes interne rettsorden;
‘An act which corresponds-to an EEC-regulation shall as such make.passive to part of contracting-parties.def.possessive internal legal-order.’

Present tense is expressed in sentence (8a), and with access to the information contained in the matrix sentence, the choice of the finite construction som tilsvarer in the Norwegian translation cannot be said to be a case of explicitation in the sense explained above. However, given the analytical framework of the present study, (7a) constitutes a translation task. [11] Since (7a) is taken from a law text, it could be argued that the information about present time is implicit, since we may assume that what is expressed in the law text holds simultaneously with the period of its application, which is, in a sense, its time of utterance (Thunes 2011: 377).

Examples (7)–(8) have shown that, within the recorded data, the question of whether specification caused by the non-finite-finite pattern falls within explicitation or not can be reduced to a matter of definition. Still, it is interesting to ask to what extent the identified instances of specification may reveal the level of explicitation in the text pairs translated from English into Norwegian. With respect to explicitation it is then relevant to focus on the tendency that translators make target texts more semantically precise than the originals in order to ensure that the recipient will interpret the target text correctly relative to what the translator judges to be the intended interpretation of the source text. Hence, the cases where specification is created by the regularity of the non-finite-finite pattern become less interesting in relation to explicitation. What matters more are correspondences where semantic specification is the result of a translator’s choice that has not been influenced so much by interrelations between the language systems.

Disregarding the occurrences where it is only the non-finite-finite pattern that has caused specification in English-to-Norwegian correspondences reveals that the EEA Agreement and its Norwegian translation constitute a text pair where the level of explicitation is relatively low (Thunes 2011: 391). This conforms with the fact that supranational law texts are subject to special constraints which work against explicitation because the primary translation norm is to ensure that the legal content is the same in all language versions (Cao 2007: 153, 2010: 88). With respect to the fiction texts translated from English into Norwegian, it turns out that the target texts exhibit varying levels of explicitation (Thunes 2011: 392). This reflects different degrees of faithfulness to the originals, and is not surprising, as the translation of fiction is in no way as norm-controlled as law translation is.

6. Conclusions

Concerning the automatisation issue, the results of this study indicate that in all of the analysed text pairs the degree of translational complexity is so high that automatic translation without human intervention would probably not be helpful, although there is considerable variation among the text pairs. With respect to the text type issue, the recorded data show that, on average, the degree of complexity is lower in the selected pairs of law texts than in the fiction texts. This is an expected result, since law text is characterised by a higher level of restrictedness than fiction is.

In relation to the complexity measurements, it is interesting to consider types of recurrent semantic deviations that have been identified within the non-computable translational correspondences. This shows a clear difference between the two text types. Occurrences of semantic deviations between source and target units are far less frequent within the pairs of law texts than within those of fiction. Interestingly, in the law texts, almost half of the non-computable string pairs are cases where only one, minimal semantic difference has been identified. In the fiction texts, such minimally non-computable correspondences are relatively rare. This finding suggests that the potential edit cost of applying automatic translation to law text may be modest, and, in relation to fiction, it concurs with the common view that machine translation is not helpful.

The study of minimally non-computable correspondences shows that the chosen empirical method is of relevance also to other matters than the automatisation issue. The recorded data do not only measure the degree of translational complexity in the chosen text pairs. They can also be used to study certain regularities in the relationship between the two languages English and Norwegian, such as translational correspondences between non-finite and finite constructions, and they may provide empirical facts about phenomena falling within translation studies, such as explicitation. Thus, the present project illustrates that the different fields of machine translation, contrastive language research, and translation theory have a common denominator in the study of translational correspondences. Above all, the empirical investigation carried out is a reminder that the task of translation presents certain challenges which are highly demanding, possibly too complex for machines, and indeed also non-trivial for humans.

7. Acknowledgements

I thank the numerous authors and translators who produced the investigated texts, and for assistance in gaining lawful access to the texts, I am grateful to the Norwegian Ministry of Foreign Affairs, the Norwegian Petroleum Directorate, and the English-Norwegian Parallel Corpus (ENPC) Project, in particular to Jarle Ebeling, Knut Hofland, and the late Stig Johansson. Warm thanks are also due to the Centre for Advanced Study at the Norwegian Academy of Science and Letters, where I spent one year in the initial stage of this project. I gratefully acknowledge useful comments from two anonymous reviewers on a previous version of this paper, and, finally, I am much indebted to Helge Dyvik for invaluable assistance, in particular for tailoring software to the recording and processing of empirical data.

Notes

[1] Rule-based MT is the classical approach to machine translation, where the translation procedure relies on information about source and target language and their interrelations, and this is in contrast to statistical MT (SMT), or modern machine translation, where translations are computed on the basis of statistical information about existing correspondences in large bodies of parallel texts. See Jurafsky and Martin (2009: 898).

[2] Cf. Dyvik (1998: 52) on the notion of linguistically predictable translation.

[3] Table 2 in section 4.4 presents the frequencies of the various correspondence types in this study. Similar results were found in Thunes (1998).

[4] The quotation is taken from Venuti (2000: 92), where an overview of the seven procedures is presented. Pages 31–42 of Vinay & Darbelnet (1995) are reprinted in Venuti (2000: 84–93).

[5] The syntactic criteria for identifying units of analysis are discussed further in chapter 4 in Thunes (2011).

[6] Thunes (2011: 275–288) provides a further discussion of text-typological differences between law and fiction.

[7] An example could be the Norwegian noun skiføre, found in Bjørg Vik’s text. This word has no match in English, and needs to be translated by a paraphrase, such as conditions for skiing.

[8] Cf. chapter 6 in Thunes (2011).

[9] Thunes (2011: 292–304) discusses several factors that may have contributed to the occurrence of type 4 cases in the investigated law texts.

[10] Chapter 6 in Thunes (2011) provides information on how the various types of semantic deviations are distributed within the collected data.

[11] Cf. the discussion of units of analysis in section 4.1.

References

Aijmer, Karin, Bengt Altenberg & Mats Johansson, eds. 1996. Languages in Contrast. Papers from a Symposium on Text-based Cross-linguistic Studies, Lund 4–5 March 1994. (=Lund Studies in English 88.) Lund: Lund University Press.

Azevedo, Flávia. In print. Investigating the Problem of Codifying Linguistic Knowledge in Two Translations of Shakespeare’s Sonnets: A Corpus-based Study. Ph.D. dissertation, Federal University of Santa Catarina, Florianópolis.

Bhatia, Vijay K. 2010. “Specification in legislative writing: accessibility, transparency, power and control”. In Coulthard & Johnson (eds.), 37–50.

Cao, Deborah. 2007. Translating Law. (=Topics in Translation 33.) Clevedon, Buffalo, & Toronto: Multilingual Matters.

Cao, Deborah. 2010. “Translating legal language”. In Coulthard & Johnson (eds.), 78–91.

Coulthard, Malcolm & Alison Johnson, eds. 2010. The Routledge Handbook of Forensic Linguistics. London & New York: Routledge.

Cyrus, Lea. 2006. “Building a Resource for Studying Translation Shifts”. In Proceedings of the Fifth International Conference on Linguistic Resources and Evaluation (LREC-2006), 1240–1245. Genoa, Italy.

Dyvik, Helge. 1990. The PONS Project: Features of a Translation System. (=Skriftserie fra Institutt for fonetikk og lingvistikk 39, B.) University of Bergen.

Dyvik, Helge. 1993. “Text Pair Mapper”. Ms., University of Bergen.

Dyvik, Helge. 1995. “Exploiting Structural Similarities in Machine Translation”. Computers and the Humanities 28: 225–234.

Dyvik, Helge. 1998. “A translational basis for semantics”. In Johansson & Oksefjell (eds.), 51–86.

Elgemark, Anna. In progress. To the Very End. A Study of N-Rhemes in English and Swedish Translations. Ph.D. dissertation, University of Gothenburg.

Hasselgård, Hilde. 1996. “Some methodological issues in a contrastive study of word order in English and Norwegian”. In Aijmer et al. (eds.), 113–126.

Hasselgård, Hilde. 1998. “Thematic structure in translation between English and Norwegian”. In Johansson & Oksefjell (eds.), 145–167.

Johansson, Stig & Signe Oksefjell, eds. 1998. Corpora and Cross-linguistic Research: Theory, Method, and Case Studies. (=Language and Computers: Studies in Practical Linguistics 24.) Amsterdam & Atlanta, Georgia: Rodopi.

Jurafsky, Daniel & James H. Martin. 2009. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second edition. Upper Saddle River, New Jersey: Pearson Education.

Károly, Krisztina & Ágota Fóris, eds. 2005. New Trends in Translation Studies. In Honour of Kinga Klaudy. Budapest: Akadémiai Kiadó.

Macken, Lieve. 2010. Sub-sentential Alignment of Translational Correspondences. Ph.D. dissertation. Antwerp: University Press Antwerp.

Merkel, Magnus. 1999. Understanding and Enhancing Translation by Parallel Text Processing. (=Linköping Studies in Science and Technology. Dissertation No. 607.) Linköping University.

Pym, Anthony. 2005. “Explaining Explicitation”. In Károly & Fóris (eds.), 29–43.

Silva, Norma Andrade da. 2008. Análise da tradução do item lexical evidence para o português com base em um corpus jurídico. Master’s thesis, Federal University of Santa Catarina, Florianópolis.

Thunes, Martha. 1998. “Classifying translational correspondences”. In Johansson & Oksefjell (eds.), 25–50.

Thunes, Martha. 2011. Complexity in Translation. An English-Norwegian Study of Two Text Types. Ph.D. dissertation, University of Bergen. 574 http://bora.uib.no/handle/1956/5179

Tucunduva, Camila de Andrade. 2007. Translating Completeness: A Corpus-based Approach. Master’s thesis, Federal University of Santa Catarina, Florianópolis.

Venuti, Lawrence, ed. 2000. The Translation Studies Reader. London & New York: Routledge.

Vik, Bjørg. 1979. En håndfull lengsel. Oslo: J. W. Cappelens Forlag.

Vik, Bjørg. 1979. Out of Season and Other Stories. Translated by David McDuff & Patrick Browne. London: Sinclair Browne.

Vinay, Jean-Paul & Jean Darbelnet. 1995. Comparative Stylistics of French and English: A Methodology for Translation. Translated and edited by Juan C. Sager & M.-J. Hamel. (=Benjamins Translation Library 11.) Amsterdam & Philadelphia: John Benjamins.