What is training data

How AI researchers obtain the necessary training data

Large amounts of data are required to train artificial intelligence (AI) algorithms. Because they are not always available in the right form, researchers work with detours. At the Conference on Empirical Methods in NLP in early November, experts presented a wide range of research results in the field of natural language processing based on sophisticated concepts for data collection. That reports Technology Review online in “Tricks for collecting data”.

Microsoft researchers, for example, wanted better data for evaluating utterances in “mixed code”, ie alternating two languages. For example, “Spenglisch”, a mixture of Spanish and English, occurs frequently in the real world, but rarely in written texts. So the researchers entered English texts into a Spanish translation machine and pasted parts of the result back into the original - and they had as much Spenglish as they wanted.

AI researchers at Google, on the other hand, tried to automatically break long sentences into several short sentences with the same meaning so that they are easier to understand. They use Wikipedia as a data source for this - the editing history of the online encyclopedia contains plenty of examples of linguistic improvements through shorter sentences with the same content. The result of this evaluation was 60 times more examples of split sentences with 90 times more words than the previous references for this task. When the researchers trained a machine learning model with their new data, it was 91 percent accurate.

More about this at Technology Review online:


Read comments (15) Go to homepage
Ad ad