On Language Detection: Classification x User Consumption

At Dailymotion, we aim at serving the right content to the right person at the right time. Except for funny cat videos whose cuteness is independent of any language. We want to deliver intelligible content that the user will actually enjoy. But the truth is that nowadays, people may speak more than one language fluently. We can therefore serve multilingual content to part of our audience. On our side, we end up with two linked issues concerning language detection:

  • + a content-centric issue : understanding in what language a video is and if language is crucial (for example, language is not crucial when you watch a football match but can be if you watch a political debate),
  • + a user-centric issue : knowing which language(s) a user can understand.

In this article, we will cover the content-centric issue and how we determine in what language is a video. Note that our primary goal here is to minimize false positives in language detection.

