AI + Voice Cloning + Face Animation + NLP
An AI memory preservation toolkit that brings old photos to life. Upload a photograph, feed it stories and voice recordings, and NostalgiQ generates a talking, animated version of that person, preserving their likeness, personality, and voice for future generations. View source →
01 Overview
NostalgiQ combines face detection, voice cloning, talking head generation, and personality prediction into a single pipeline. Give it a photo and some text or audio, and it produces a video of that person speaking in their own voice with their own mannerisms. Built for families preserving the memory of loved ones.
02 The Pipeline
The system processes inputs through a multi-stage pipeline: face analysis extracts identity and features, text/audio is processed for voice synthesis, and a talking head model animates the face to match the speech.
03 Core Modules
Detects faces in photos and videos using InsightFace, DeepFace, and MediaPipe. Clusters identities across multiple images, estimates age, extracts facial landmarks, and generates scene descriptions with CLIP. Object detection via YOLOv8 and text extraction via EasyOCR provide additional context. Outputs cropped faces and a metadata.json with the full analysis.
Three engines available: SadTalker for local, realistic talking head videos from a single image plus audio. HeyGen API for cloud-based generation from a public image URL and text. D-ID API as an alternative cloud option. Each takes a still photograph and produces a video of that person speaking.
ElevenLabs voice cloning creates a synthetic voice from audio samples. Whisper handles speech-to-text transcription. Gemini generates conversational responses in the style of the person based on their writing samples and personality profile.
A photograph of the person. The face pipeline detects, crops, and analyzes the face automatically.
Audio recordings for voice cloning, or text samples for personality prediction and speech synthesis.
SadTalker, HeyGen, or D-ID animates the face to match the synthesized speech. The result is a video of the person talking.
Ask questions and receive responses in the person's voice and personality, powered by Gemini and the NLP personality model.
04 Technology Stack
InsightFace and DeepFace for detection and identity clustering. MediaPipe for landmark extraction. CLIP for scene understanding. YOLOv8 for object detection.
ElevenLabs API clones a voice from audio samples. The synthetic voice speaks new text while preserving the original tone, cadence, and character.
SadTalker for local inference, HeyGen and D-ID for cloud-based video generation. Each takes a still image and audio to produce a realistic speaking video.
Transformer-based personality prediction from text samples. Gemini generates responses matching the predicted personality. Whisper transcribes audio to text.
React/TypeScript interface (App.tsx) for uploading photos, recording audio, and viewing the generated talking portraits. Clean, emotional UI design.
Python Flask backend orchestrating the pipeline. Manages model inference, API calls, file processing, and serves the generated video output.
05 Skills