Open-source GazeGPT

Analyzing and Selecting Text-to-Speech, Vision-Language, and Speech-to-Text Models

LinkedIn Pictogram Github Pictogram Gmail Pictogram

June 21, 2024

Open-source GazeGPT:
Analyzing and Selecting Text-to-Speech, Vision-Language, and Speech-to-Text Models

EP50-reduce

Abstract

This literature review presents the analysis and selection of text-to-speech, vision-language and speech- to-text models to replace the proprietary models used in GazeGPT. GazeGPT is a system that combines voice-enhanced smart glasses with eye tracking, functioning as a personal assistant aware of the user's gaze within a scene. By wearing a head-mounted eye tracker, users can direct their gaze towards objects of interest and pose verbal questions. These questions, captured via a microphone in the headphones, are then processed in combination with the gaze data to provide contextual understanding of the user's intent. The scene camera integrated with the eye tracker captures the user's view, pinpointing the area of interest based on the gaze data. This multimodal input, comprising the visual scene and audio queries, is processed by the VLM to deliver precise and context-aware responses directly into the user's ears via speakers integrated into the glasses. GazeGPT currently relies on proprietary models like ChatGPT for its VLM and Elevenlabs for its text-to-speech functionality. In line with the goal of democratizing AI and enabling free research, this literature review exclusively selects open-source models. Additionally, it aims to deepen the reader's understanding of these models by dissecting them into their fundamental components. The author believes this method is useful due to the rapid emergence of new, often overly complex models that are not always presented in an easily understandable manner. Despite their over-engineering and complexity, these models rely on the same basic principles. Readers should come away from this literature review with a comprehensive understanding of the history of deep learning, including the progression from untrainable to trainable multi-layer perceptrons, the initial divergence of the field into natural language processing and computer vision, and the subsequent convergence through the transformer architecture. The review also details the author's choices for the best text-to-speech, vision-language, and speech-to-text models, along with their inner workings, including the inner workings of the transformer architecture.

Literature research

The literature review can be found in the folllowing downloadable pdf:

Code implementation

Code for this blog: