The story of TER: A fair and workable remuneration model for machine translation post-editing effort?

:

Repurposing Translation Edit Rate (TER)

A fair and workable remuneration model for machine translation post-editing (MTPE) has been sought after for a long time. Translation Edit Rate (TER), although originally invented to measure machine translation engine quality, has recently been repurposed for measuring post-editing effort. In this post, you will find information and recommendations concerning the TER. We start with what TER actually is, continue with how it is used today, and conclude by looking at the caveats to keep in mind when wanting to successfully implement TER.

TER Origins

The truth is that TER can hardly be considered a new innovation, although many translation buyers and large language service providers are only starting to implement it. TER has actually been around for a while.

In 2006, an article entitled A Study of Translation Edit Rate with Targeted Human Annotation was published. The authors were Matthew Snover and Bonnie Dorr of the University of Maryland and Richard Schwartz, Linnea Micciulla and John Makhoul of BBN Technologies. In their pioneering article, TER was defined as a “measure for evaluating machine-translation output.” This means that TER in its essence was created to answer the question of whether a machine translation (MT) output is of adequate quality. This was done by comparing the MT output to a human reference translation. The comparison was based on a calculation of the minimum number of edits needed to modify the MT output so as to be identical to the human reference.

Before We Calculate TER

TER is based on edit distance. The edit distance between two strings is the minimum number of editing operations needed to transform one string into the other. The editing operations (edits) are SUBSTITUTION, INSERTION, DELETION, and SHIFT.

The unit of measurement of edit distance calculation can be either a character or a word. For MTPE effort evaluation, the word is currently the preferred unit.

Edit distance in MTPE means the calculated minimum number of edits needed to be performed on the MT string to exactly match the postedited string.

TER in MTPE is calculated by dividing this edit distance by the total number of units in the string. Later on we discuss what the total number of units in the denominator should be used – whether it should be based on the original machine translated string, or the postedited one.

But first, let´s look at some history.

TER Calculation

As mentioned above there are four types of edits: substitution, insertion, deletion, and shift. The minimum number of necessary edits was initially calculated at the character level, using Levenshtein distance. Snover, Dorr and colleagues, however, introduced in their study a Java script that calculates the minimum number of edits at the word level.

To measure the quality of MT output they divided the number of edits by the number of words in the final correct reference sentence. For example:

Source: Das Kostenelement %2 zur Ressourcenzuordnung %1 wurde nicht gefunden.

MT Output: Elemento di costo %2 della risorsa %1 non trovato.

Human reference: Elemento di costo %2 dell’attribuzione risorse %1 non trovato.

You can see that 2 edits have been made (substitution of the word della and substitution of the word risorsa) and the number of words in the human reference sentence is 9. Thus, TER = 2/9 = 0.22 = 22%. In other words, the machine hit seven words out of nine, which represents a 78% match in relation to the final correct sentence.

Naturally, as a rule of thumb, the closer TER is to zero the better the quality of the machine output.

TER Today

At the 2021 GlobalinkNEXT virtual conference, Matt Hauser – Senior Vice President of Transperfect, one of the biggest translation companies in the world – ranked TER among the top innovations that are changing and will continue to change the localization business in the near future.

The current TER calculation is still basically the same as that introduced by Snover, Dorr and colleagues. The “edits at the word level” approach became standard in measuring post-editing effort. The truth is that TER simply considers only the number of edits made, and does not consider the varying complexity of different postediting steps (e.g. sometimes modifying a single verb in a sentence can take much longer than to fully rewrite a developed sentence). Nevertheless, TER has become the fairest model for rewarding post-editing service. We have nothing better.

However, in order to reflect fairly the amount of post-editing work done, it is necessary to understand TER and calculate it correctly. While TER originally was used primarily to measure the quality of MT output, today it is being used to measure the amount of work done by the post-editor. Which is the exact opposite of what TER was created for.

Many users are not aware of the important difference between measuring post editing effort via dividing the sum of edits by the number of words in MT output sentence, and doing the same via dividing by the number of words in the final postedited sentence. In the following paragraphs you can find more details regarding this.

From Machine Translation Output Quality Evaluation to New Machine Translation Post-Editing Pricing Model

How TER influences the MTPE payment scheme

The most common payment scheme used for MTPE is based on a discount per source word. The MT discount is usually set as a flat rate for a certain type of content and agreed in advance. E.g. a 30% discount from the basic translation per source word rate for technical documentation, for a defined language combination. Consequently, it may happen that a particular MT output is of very poor quality and the post-editor ends up spending much more time on it than expected. In such a case he/she may afterwards require not 70%, but full payment. Or, conversely, it can happen that a machine translation output is very useful, and the post editor saves much more time than planned.

TER has the potential to change the above-mentioned paradigm of pre-agreed flat-rate MT discounts. While with the pre-agreed MT discount model everyone knows at the very beginning what the final cost will be, with the TER methodology it is possible to calculate the real cost only after the post-editing is completed. The MT output is compared with the final post-edited translation and TER is calculated for each sentence separately. Then the machine translation post-editing discounts are applied per sentence according to a pre-agreed TER pricing matrix. However, there is a logical downside to this: the cost can be known only after the post-editing is completed.

TER Paradoxes

Let’s assume that we have agreed on the following TER pricing matrix:

And now we need to calculate payment for the following post-edited Russian sentence:

Source: Material: Neoprene

MT Output: Материал: неопрен

Post-edit: Материал: Нeoпрен

One edit (a substitution – a modification to capital letter) divided by post-edit wordcount (two words) equals a TER of 0.5 or 50%. Strangely enough, this source sentence belongs to the x>40 category and will be paid by the full basic rate. No MT discounts are applicable. In this case, it seems post-editors have the clear advantage, don’t they? Nevertheless, the TER calculated for the next sentence in the same text may turn this advantage upside down.

Some industry experts argue that since the post-editor works on MT output, calculated edits should be divided by the wordcount of the MT sentence and not from the wordcount of the final post-edited sentence. And we agree with them. To understand this proposed change, and to demonstrate what difference this would make in TER and a payment calculated using the TER price matrix above, we can use following German–French example.

Source: Diese Seite unterstützt Sie bei der Migration der Picklist-URL in Attributgruppenpositionen.

MT Output: Cette page vous aide à migrer l’URL de la liste de sélection dans les positions de groupe d’attributs.

Post-edit: Cette page vous assiste lors de la migration des URL de la liste de sélection dans des positions de groupe d’attributs.

1. Number of edits divided by the number of words (18) in the MT-ed sentence. TER = 5/18 = 0.28 (25<x<=30 category). The relevant discount is 30% of the full amount based on the source sentence wordcount.

2. Number of edits divided by number of words (21) in final post-edited sentence. TER = 5/21 = 0.24 (20<x<=25 category). The relevant discount is 50% of the full amount based on the source sentence wordcount.

As seen above, the difference can be quite significant. Therefore, it is important to understand the original purpose of TER vs. its current usage, and to update the math behind it accordingly.

Some of the top fortune companies have adopted the original Snover and Dorr approach and divide the number of edits by the wordcount of the resulting postedited segment, while others divide the number of edits by the MT segment wordcount. And some have solved the wordcount issue in their own way. For example, in eBay they use either the MT wordcount or the post-edit wordcount, depending on which is the larger of the two. The results are always placed between 0 and 1, or between 0 and 100%.*

* You can find more details here.

What to be aware of

Despite all the above-mentioned issues, the undeniable advantage of the TER-based approach is that it attempts to measure the actual work done per sentence. This impresses most post-editors, who clearly appreciate the benefits as compared to the uncertainty of the MT flat-discount models. However, where the MT flat rate discount is ideal, in terms of cost projection and budgeting (as the base rate, source wordcount and MT discount are all known in advance), TER is a complete disaster. Until the project is finished, you cannot perform the TER calculations and therefore cannot know the costs, cannot issue purchase orders, etc. This is one of the reasons why some companies have opted for a hybrid solution – which is, however, not much welcomed by translation teams. These companies regularly perform a backward TER summarization once per longer period (e.g. per quarter) on texts post-edited within that period. The TER results are averaged and used as the basis of a new flat rate discount for the upcoming period.

The fact that some post-editors can once again be assigned much lower quality MT outputs, which do not correspond to the new average TER, can unfortunately be quite disappointing.

Conclusion

1. Although TER cannot be used with perfect foresight, i.e. can only be used ex-post, it is currently the fairest method being applied for measurement and remuneration of machine translation post-editing services.

2. TER is to be calculated per individual sentence to make full use of its benefits.

3. The original approach to calculating TER, as developed by Snover and Dorr, needs to be adjusted in order to reflect the usage shift from machine translation output quality evaluation, to evaluation of real post-editing effort. Therefore, we recommend that the number of edits be divided by the MT sentence wordcount and not by the post-edited sentence wordcount.

4. Implementing TER into the processes of translation buyers is anything but easy and may require significant changes to an existing workflow.

5. Understanding the TER pricing matrix and TER calculation formula can be very complex, which allows space for a wide range of (perhaps sometimes wild) payment experimentation.

If you would like to experiment with TER yourself this is a good place to start. Then you can find, e.g. on Github, various java scripts for TER calculations. For the sake of our own efficiency we have developed a tool that allows us and our clients to calculate TER for both Asian (character level) and European (word level) languages very easily. If you would like to implement TER into your processes as well, or to learn more, do get in touch, we can help with that. No strings attached.

Team exe

When interested, please fill out the form

send