Back to All Events

Fellowship: Machine Learning Specialist for Speech Recognition

  • WIPO Geneva Switzerland (map)

The position is located in the Advanced Technology Applications Center (ATAC) of the Global Database Division, Global Infrastructure Sector. The Section is responsible for providing support in machine learning applications to other database sections, to other WIPO Divisions and other external entities. The Section has developed its own machine learning tools for WIPO and for other international organizations: automatic translation tool (WIPO Translate), image similarity for trademarks and speech-to-text. ATAC wishes to continue exploring the applicability of Machine Learning to other activities. A first application for voice recognition has been developed in-house, using open source software such as Returnn and Kaldi.

The incumbent is responsible for exploring the use of speech recognition in the context of WIPO meetings. WIPO hosts several meetings (~ 150 every year) and needs to have the written transcript of each conference (at least for the English language). Since 2011 the transcripts are provided by an external company. The audios and the transcripts have been archived. WIPO began to explore the use of machine learning to make use of this training data and experiment with training our own speech recognition software. The proof-of-concept has been successful and WIPO needs now to improve the tool (adding new languages, improve quality, adding new functionalities etc).

WIPO has six official languages (Arabic, Spanish, English, French, Russian and Chinese), some audio and transcripts are available in each of these languages, however the more prominent language is English (the project will start with this language). It should be noted that WIPO has at disposal a large amount of training data (~ 3600 hours). 

The incumbent will work on this new project; take responsibility for gathering the data (audio, transcript and meta-data), explore various speech recognition techniques to apply on this data, define a roadmap, and, if feasible, build a first running prototype. 

The incumbent works under the direct supervision of the machine learning researcher, in close collaboration with the rest of the ATAC team. 

2. Duties and Responsibilities

The incumbent will perform the following principal duties:

(a) Collect all the data that can be useful for training a model:

  • Audio files: analyze current format(s), and propose a standard way to store and retrieve data

  • Transcripts: analyze current formats, defined a standard for storing and retrieving the texts, link the texts with corresponding audio and meta data

  • Meta-data: during conferences some additional data is recorded: date/time/meeting/speaker’s country etc. This information might be valuable for the machine learning process

(b) Study the current state of the art technology, evaluate the pros and cons of various methods, define a strategy for experimenting with one (few) technology(ies) in our context.

(c) Design a machine learning algorithm, define carefully the data training set, develop cleaning and filter methods on the data, perform quantitative and qualitative analysis. Note that the available transcripts of speeches might be of medium quality.

(d) Explore existing tools for aligning audio and transcripts (transcript synchronization).

(e) Collaborate with the conference division to define possible areas of application of the speech recognition tool (automatic transcription, automatic subtitling, possibility to index audio with corresponding text in a search engine …).

(f) (if feasible) Develop new proof of concept prototype(s) to demonstrate the applicability and pertinence of speech recognition (including for non-English languages).

(g) Advise the team (and the conference division stakeholders) on ways to enrich the collected data for machine learning. Explore the usage of external resources that could be used to improve the machine learning models (publically available audio/text corpora).

(h) Study the feasibility to integrate automatic direct speech recognition (“live captioning”).

(i) Study the extension of the project towards applications in automatic direct speech translation (speech-to-translated-text).

(j) Improve the segmentation module for “live captioning”.

(k) Explore different techniques to perform speaker change detection / diarization.

(l) Participate in the integration of speech-to-text software in WIPO environment (Arbor Media)

(m) Perform other related tasks as required.

For more information and to apply, please click here.

Earlier Event: September 8
Programme Management Officer (P3 and P4)
Later Event: September 20
Political Affairs Officer (P3 and P4)