LipNet explained

LipNet is a deep neural network for audio-visual speech recognition (ASVR). It was created by University of Oxford researchers Yannis Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. The technique, outlined in a paper in November 2016,[1] is able to decode text from the movement of a speaker's mouth. Traditional visual speech recognition approaches separated the problem into two stages: designing or learning visual features, and prediction. LipNet was the first end-to-end sentence-level lipreading model that learned spatiotemporal visual features and a sequence model simultaneously.[2] Audio-visual speech recognition has enormous practical potential, with applications such as improved hearing aids, improving the recovery and wellbeing of critically ill patients,[3] and speech recognition in noisy environments,[4] implemented for example in Nvidia's autonomous vehicles.[5]

Notes and References

  1. Assael. Yannis M.. Shillingford. Brendan. Whiteson. Shimon. de Freitas. Nando. 2016-12-16. LipNet: End-to-End Sentence-level Lipreading. cs.LG. 1611.01599.
  2. News: AI that lip-reads 'better than humans'. BBC News . November 8, 2016.
  3. Web site: Home Elementor. Liopa.
  4. Web site: Can deep learning help solve lip reading?. James. Vincent. November 7, 2016. The Verge.
  5. Web site: Revealed: How Nvidia's 'backseat driver' AI learned to read lips. Katyanna. Quach. www.theregister.com.