Study Shows Impact of Clean Data and Consolidation on Statistical Machine Translation Quality

Share Article

New study into Automated Language Translation highlights need for clean and normalized data when sharing data. In cooperation with the Translation Automation User Society (TAUS), Asia Online conducted an experiment to determine the optimum approaches for building statistical machine translation (SMT) engines with shared data. The findings indicate that significant improvements in translation machine quality can be achieved with smaller pools of shared, clean data.

"Making industrial-strength machine translation available on-demand," Kirti Vashee, Vice President of Sales, Americas and Europe for Asia Online

The study shows that clean and normalized training data is key to high-quality translation output when building SMT engines. Therefore, activities associated with normalization and data cleaning should be integrated into best practices. Other related research, suggests that as little as 10% of the data in an unclean format can have a negative impact on translation quality.

Conducted in early 2009, Asia Online, in cooperation with TAUS, conducted an extensive experiment to determine the optimum way to build a statistical machine translation (SMT) engines with pooled data resources. Development of SMT engines have been hindered by the lack of sufficient amounts of training data and this experiment provides guidance on best practices for using and sharing pooled data resources. The final analysis and report is now available for download.

For the purposes of the experiment, three TAUS member companies in the same industry domain provided sets of training data to develop SMT engines. Each company was a multinational software organization. Asia Online performed extensive analysis on the data and created a total of 29 separate SMT engines by combining the data from the three companies in various configurations. It then performed evaluations of the output quality of all 29 engines using the BLEU and the F-Measure metrics.

According to Kirti Vashee, Asia Online's VP of Enterprise Translation Sales, the study is of particular importance to organizations looking at ways to constrain the costs associated with large localization projects, and considering data sharing as a means to accomplish this. "While SMT holds great promise in reducing the overall cost and increasing the throughput of globalization initiatives, the cost/complexity of creating a SMT engine has been a barrier, mainly because of the large volumes of training data that are needed. General wisdom holds that, the more data available when training the engine, the higher the translation quality of the SMT engine. However, just how much data is needed and of what type had not been rigorously tested or understood. This study dispels some myths about what is really needed to build a quality SMT engine and provides some evidence for what does work."

"The study shows that clean and normalized training data is key to high-quality translation output when building SMT engines. Therefore, activities associated with normalization and data cleaning should be integrated into best practices. Other related research, suggests that as little as 10% of the data in an unclean format can have a negative impact on translation quality. " said Mr. Vashee.

Key Finding 1: The best results are achieved by combining clean, mostly normalized sets of data. In general, un-normalized raw translation memory (TM) data produces lower-quality engines than engines built with clean normalized data. Significantly, more un-normalized data is required to get similar quality translation output as engines built with much smaller amounts of clean normalized data.

Put simply, this means that for most organizations, effort should be placed on preparing (and / or sharing) pools of consistent, clean normalized data rather than just attempting to locate large amounts of raw data.

Key Finding 2: The use of consolidated raw normalized data only improves engine quality when terminology is highly consistent across the shared data sets. Thus, focus on terminological consistency is likely to yield significant benefits. This also means that there is much less measurable benefit to using consolidated data with un-normalized raw data. In fact, adding non-normalized raw data to a smaller pool of normalized clear data can negatively impact the final engine quality.

For organizations looking to share or pool data, this suggests that each contributing set of data needs to be analyzed, cleaned and normalized before meaningful benefits occur.

Key Finding 3: The addition of clean baseline datasets can improve the fluency of the final translations and fill gaps in vocabulary that were not in the training datasets.

"With smaller training datasets, such as those provided by the three TAUS members, there are frequently gaps in vocabulary. Significant quantities of data are required to provide broad coverage of vocabulary. The study provides evidence that when customer training data is combined with baseline data, even if it is from a different domain, a wider range of vocabulary is covered, thereby increasing translation quality." said Mr. Vashee.

In conclusion, the study reveals that the quality and type of training data matters. Small amounts of clean training data can produce better quality engines than possible with much larger amounts of raw translation memory data. However, data volume is also a key driver for quality and large amounts of even moderately dirty data could, at some point, surpass the systems built on very small amounts of clean data. But given the same amount of training data, clean normalized datasets will always provide higher-quality results

The report, entitled Study on the Impact of Data Consolidation and Sharing for Statistical Machine Translation is available for download.

About Asia Online
Asia Online Pte. is an International web portal and Machine Translation technology company.

Asia Online's unique services enable people to transcend language as a barrier to knowledge by providing unrivaled access to the limitless store of English-language content on the Internet, in their language of choice.

Our mission is quite simple: to develop and apply solutions to unite communities across language barriers for ten Asian languages and more than twenty three European languages. The company's currently boasts support for more than 500 different language pairs, including Asian and European languages, which it offers both as a product and via its online enterprise translation portal. Translation services are targeted towards mass translators and the global language services industry.

Asia Online's primary focus delivers huge amounts of content in local languages. In doing so it has created a core technological infrastructure that enables massive translation projects to be undertaken. Asia Online is working with language service providers and publishers with its unique infrastructure that facilitates the ongoing evolution of real time corrective improvements that aims to deliver machine translation quality that is second to none.

Formed in 2006, Asia Online is a privately owned company backed by a number of individual investors and institutional venture capital. It is headquartered in Singapore, with operational headquarters in Bangkok, Thailand, from where it conducts R&D and daily business operations. Asia Online currently employs more than 400 full and part time staff and is in the process of being incorporated in an additional 10 Asian countries.


Share article on social media or email:

View article via:

Pdf Print

Contact Author

Dion Wiggins
Visit website