同步github官方更新截止Commits on Apr 18, 2025

a9c36c76e569107b5a39b3de8afd6e016b24d662

同步github官方更新截止Commits on Apr 18, 2025
a9c36c76e569107b5a39b3de8afd6e016b24d662
冯杨
Commit 8d25ce3c3a57415089a94790d05a74ba89e313b6 8d25ce3c 1 parent 76b1a0d1
Showing 23 changed files with 1100 additions and 196 deletions
.gitignore
README-EN.md
README.md
app.py
basereal.py
ernerf/data_utils/face_tracking/face_tracker.py
ernerf/data_utils/face_tracking/render_3dmm.py
ernerf/main.py
ernerf/nerf_triplane/utils.py
lightreal.py
lipreal.py
musereal.py
musetalk/models/unet.py
musetalk/models/vae.py
musetalk/simple_musetalk.py
musetalk/utils/preprocessing.py
musetalk/whisper/whisper/__init__.py
musetalk/whisper/whisper/transcribe.py
nerfasr.py
nerfreal.py
--- a/.gitignore
View file @8d25ce3
+++ b/.gitignore
View file @8d25ce3
@@ -15,4 +15,8 @@ pretrained
 *.mp4
 .DS_Store
 workspace/log_ngp.txt
-.idea
+.idea
+
+models/
+*.log
+dist
--- a/README-EN.md 0 → 100644
View file @8d25ce3
+++ b/README-EN.md 0 → 100644
View file @8d25ce3
+Real-time interactive streaming digital human enables synchronous audio and video dialogue. It can basically achieve commercial effects.
+
+[Effect of wav2lip](https://www.bilibili.com/video/BV1scwBeyELA/) | [Effect of ernerf](https://www.bilibili.com/video/BV1G1421z73r/) |  [Effect of musetalk](https://www.bilibili.com/video/BV1gm421N7vQ/)  
+
+## News
+- December 8, 2024: Improved multi-concurrency, and the video memory does not increase with the number of concurrent connections.
+- December 21, 2024: Added model warm-up for wav2lip and musetalk to solve the problem of stuttering during the first inference. Thanks to [@heimaojinzhangyz](https://github.com/heimaojinzhangyz)
+- December 28, 2024: Added the digital human model Ultralight-Digital-Human. Thanks to [@lijihua2017](https://github.com/lijihua2017)
+- February 7, 2025: Added fish-speech tts
+- February 21, 2025: Added the open-source model wav2lip256. Thanks to @不蠢不蠢
+- March 2, 2025: Added Tencent's speech synthesis service
+- March 16, 2025: Supports mac gpu inference. Thanks to [@GcsSloop](https://github.com/GcsSloop) 
+
+## Features
+1. Supports multiple digital human models: ernerf, musetalk, wav2lip, Ultralight-Digital-Human
+2. Supports voice cloning
+3. Supports interrupting the digital human while it is speaking
+4. Supports full-body video stitching
+5. Supports rtmp and webrtc
+6. Supports video arrangement: Play custom videos when not speaking
+7. Supports multi-concurrency
+
+## 1. Installation
+
+Tested on Ubuntu 20.04, Python 3.10, Pytorch 1.12 and CUDA 11.3
+
+### 1.1 Install dependency
+
+```bash
+conda create -n nerfstream python=3.10
+conda activate nerfstream
+# If the cuda version is not 11.3 (confirm the version by running nvidia-smi), install the corresponding version of pytorch according to <https://pytorch.org/get-started/previous-versions/> 
+conda install pytorch==1.12.1 torchvision==0.13.1 cudatoolkit=11.3 -c pytorch
+pip install -r requirements.txt
+# If you need to train the ernerf model, install the following libraries
+# pip install "git+https://github.com/facebookresearch/pytorch3d.git"
+# pip install tensorflow-gpu==2.8.0
+# pip install --upgrade "protobuf<=3.20.1"
+``` 
+Common installation issues [FAQ](https://livetalking-doc.readthedocs.io/en/latest/faq.html)  
+For setting up the linux cuda environment, you can refer to this article https://zhuanlan.zhihu.com/p/674972886
+
+
+## 2. Quick Start
+- Download the models  
+Quark Cloud Disk <https://pan.quark.cn/s/83a750323ef0>    
+Google Drive <https://drive.google.com/drive/folders/1FOC_MD6wdogyyX_7V1d4NDIO7P9NlSAJ?usp=sharing>  
+Copy wav2lip256.pth to the models folder of this project and rename it to wav2lip.pth;  
+Extract wav2lip256_avatar1.tar.gz and copy the entire folder to the data/avatars folder of this project.
+- Run  
+python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar1  
+Open http://serverip:8010/webrtcapi.html in a browser. First click'start' to play the digital human video; then enter any text in the text box and submit it. The digital human will broadcast this text.  
+<font color=red>The server side needs to open ports tcp:8010; udp:1-65536</font>  
+If you need to purchase a high-definition wav2lip model for commercial use, [Link](https://livetalking-doc.readthedocs.io/zh-cn/latest/service.html#wav2lip).  
+
+- Quick experience  
+<https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_GitHub_livetalking1.3> Create an instance with this image to run it.
+
+If you can't access huggingface, before running
+```
+export HF_ENDPOINT=https://hf-mirror.com
+``` 
+
+
+## 3. More Usage
+Usage instructions: <https://livetalking-doc.readthedocs.io/en/latest>
+  
+## 4. Docker Run  
+No need for the previous installation, just run directly.
+```
+docker run --gpus all -it --network=host --rm registry.cn-beijing.aliyuncs.com/codewithgpu2/lipku-metahuman-stream:2K9qaMBu8v
+```
+The code is in /root/metahuman-stream. First, git pull to get the latest code, and then execute the commands as in steps 2 and 3. 
+
+The following images are provided:
+- autodl image: <https://www.codewithgpu.com/i/lipku/metahuman-stream/base>   
+[autodl Tutorial](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
+- ucloud image: <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_livetalking1.3>  
+Any port can be opened, and there is no need to deploy an srs service additionally.  
+[ucloud Tutorial](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html) 
+
+
+## 5. TODO
+- [x] Added chatgpt to enable digital human dialogue
+- [x] Voice cloning
+- [x] Replace the digital human with a video when it is silent
+- [x] MuseTalk
+- [x] Wav2Lip
+- [x] Ultralight-Digital-Human
+
+---
+If this project is helpful to you, please give it a star. Friends who are interested are also welcome to join in and improve this project together.
+* Knowledge Planet: https://t.zsxq.com/7NMyO, where high-quality common problems, best practice experiences, and problem solutions are accumulated.
+* WeChat Official Account: Digital Human Technology  
+![](https://mmbiz.qpic.cn/sz_mmbiz_jpg/l3ZibgueFiaeyfaiaLZGuMGQXnhLWxibpJUS2gfs8Dje6JuMY8zu2tVyU9n8Zx1yaNncvKHBMibX0ocehoITy5qQEZg/640?wxfrom=12&tp=wxpic&usePicPrefetch=1&wx_fmt=jpeg&amp;from=appmsg) 
--- a/README.md
View file @8d25ce3
+++ b/README.md
View file @8d25ce3
-Real time interactive streaming digital human， realize audio video synchronous dialogue. It can basically achieve commercial effects.  
+[English](./README-EN.md) | 中文版   
 实时交互流式数字人，实现音视频同步对话。基本可以达到商用效果
+[wav2lip效果](https://www.bilibili.com/video/BV1scwBeyELA/) | [ernerf效果](https://www.bilibili.com/video/BV1G1421z73r/) | [musetalk效果](https://www.bilibili.com/video/BV1gm421N7vQ/)
-[ernerf 效果](https://www.bilibili.com/video/BV1PM4m1y7Q2/) [musetalk 效果](https://www.bilibili.com/video/BV1gm421N7vQ/) [wav2lip 效果](https://www.bilibili.com/video/BV1Bw4m1e74P/)
-
-## 为避免与 3d 数字人混淆，原项目 metahuman-stream 改名为 livetalking，原有链接地址继续可用
+## 为避免与3d数字人混淆，原项目metahuman-stream改名为livetalking，原有链接地址继续可用
 ## News
-
 - 2024.12.8 完善多并发，显存不随并发数增加
-- 2024.12.21 添加 wav2lip、musetalk 模型预热，解决第一次推理卡顿问题。感谢@heimaojinzhangyz
-- 2024.12.28 添加数字人模型 Ultralight-Digital-Human。 感谢@lijihua2017
-- 2025.2.7 添加 fish-speech tts
-- 2025.2.21 添加 wav2lip256 开源模型 感谢@不蠢不蠢
+- 2024.12.21 添加wav2lip、musetalk模型预热，解决第一次推理卡顿问题。感谢[@heimaojinzhangyz](https://github.com/heimaojinzhangyz)
+- 2024.12.28 添加数字人模型Ultralight-Digital-Human。 感谢[@lijihua2017](https://github.com/lijihua2017)
+- 2025.2.7 添加fish-speech tts
+- 2025.2.21 添加wav2lip256开源模型 感谢@不蠢不蠢
 - 2025.3.2 添加腾讯语音合成服务
+- 2025.3.16 支持mac gpu推理，感谢[@GcsSloop](https://github.com/GcsSloop)
 ## Features
-
 1. 支持多种数字人模型: ernerf、musetalk、wav2lip、Ultralight-Digital-Human
 2. 支持声音克隆
 3. 支持数字人说话被打断
 4. 支持全身视频拼接
-5. 支持 rtmp 和 webrtc
+5. 支持rtmp和webrtc
 6. 支持视频编排：不说话时播放自定义视频
 7. 支持多并发
@@ -33,67 +31,61 @@ Tested on Ubuntu 20.04, Python3.10, Pytorch 1.12 and CUDA 11.3
 ```bash
 conda create -n nerfstream python=3.10
 conda activate nerfstream
-#如果cuda版本不为11.3(运行nvidia-smi确认版本)，根据<https://pytorch.org/get-started/previous-versions/>安装对应版本的pytorch
+#如果cuda版本不为11.3(运行nvidia-smi确认版本)，根据<https://pytorch.org/get-started/previous-versions/>安装对应版本的pytorch 
 conda install pytorch==1.12.1 torchvision==0.13.1 cudatoolkit=11.3 -c pytorch
 pip install -r requirements.txt
 #如果需要训练ernerf模型，安装下面的库
 # pip install "git+https://github.com/facebookresearch/pytorch3d.git"
 # pip install tensorflow-gpu==2.8.0
 # pip install --upgrade "protobuf<=3.20.1"
-```
-
+``` 
 安装常见问题[FAQ](https://livetalking-doc.readthedocs.io/en/latest/faq.html)  
-linux cuda 环境搭建可以参考这篇文章 https://zhuanlan.zhihu.com/p/674972886
+linux cuda环境搭建可以参考这篇文章 https://zhuanlan.zhihu.com/p/674972886
-## 2. Quick Start
+## 2. Quick Start
 - 下载模型  
-  百度云盘<https://pan.baidu.com/s/1yOsQ06-RIDTJd3HFCw4wtA> 密码: ltua  
+  夸克云盘<https://pan.quark.cn/s/83a750323ef0>    
   GoogleDriver <https://drive.google.com/drive/folders/1FOC_MD6wdogyyX_7V1d4NDIO7P9NlSAJ?usp=sharing>  
-  将 wav2lip256.pth 拷到本项目的 models 下, 重命名为 wav2lip.pth;  
-  将 wav2lip256_avatar1.tar.gz 解压后整个文件夹拷到本项目的 data/avatars 下
+  将wav2lip256.pth拷到本项目的models下, 重命名为wav2lip.pth;  
+  将wav2lip256_avatar1.tar.gz解压后整个文件夹拷到本项目的data/avatars下
 - 运行  
   python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar1 --preload 2
-
   使用 GPU 启动模特 3 号：python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar3 --preload 2
-  用浏览器打开 http://serverip:8010/webrtcapi.html , 先点‘start',播放数字人视频；然后在文本框输入任意文字，提交。数字人播报该段文字  
+
+用浏览器打开http://serverip:8010/webrtcapi.html , 先点‘start',播放数字人视频；然后在文本框输入任意文字，提交。数字人播报该段文字  
   <font color=red>服务端需要开放端口 tcp:8010; udp:1-65536 </font>  
-  如果需要商用高清 wav2lip 模型，可以与我联系购买
+  如果需要商用高清wav2lip模型，[链接](https://livetalking-doc.readthedocs.io/zh-cn/latest/service.html#wav2lip)
 - 快速体验  
   <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_GitHub_livetalking1.3> 用该镜像创建实例即可运行成功
-如果访问不了 huggingface，在运行前
-
+如果访问不了huggingface，在运行前
 ```
 export HF_ENDPOINT=https://hf-mirror.com
-```
+``` 
-## 3. More Usage
+## 3. More Usage
 使用说明: <https://livetalking-doc.readthedocs.io/>
 ## 4. Docker Run
-
 不需要前面的安装，直接运行。
-
 ```
 docker run --gpus all -it --network=host --rm registry.cn-beijing.aliyuncs.com/codewithgpu2/lipku-metahuman-stream:2K9qaMBu8v
 ```
-
-代码在/root/metahuman-stream，先 git pull 拉一下最新代码，然后执行命令同第 2、3 步
+代码在/root/metahuman-stream，先git pull拉一下最新代码，然后执行命令同第2、3步
 提供如下镜像
+- autodl镜像: <https://www.codewithgpu.com/i/lipku/metahuman-stream/base>   
+  [autodl教程](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
+- ucloud镜像: <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_livetalking1.3>  
+  可以开放任意端口，不需要另外部署srs服务.  
+  [ucloud教程](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html)
-- autodl 镜像: <https://www.codewithgpu.com/i/lipku/metahuman-stream/base>  
-  [autodl 教程](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
-- ucloud 镜像: <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_livetalking1.3>  
-  可以开放任意端口，不需要另外部署 srs 服务.  
-  [ucloud 教程](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html)
 ## 5. TODO
-
-- [x] 添加 chatgpt 实现数字人对话
+- [x] 添加chatgpt实现数字人对话
 - [x] 声音克隆
 - [x] 数字人静音时用一段视频代替
 - [x] MuseTalk
@@ -101,9 +93,8 @@ docker run --gpus all -it --network=host --rm registry.cn-beijing.aliyuncs.com/c
 - [x] Ultralight-Digital-Human
 ---
+如果本项目对你有帮助，帮忙点个star。也欢迎感兴趣的朋友一起来完善该项目.
+* 知识星球: https://t.zsxq.com/7NMyO 沉淀高质量常见问题、最佳实践经验、问题解答
+* 微信公众号：数字人技术  
+  ![](https://mmbiz.qpic.cn/sz_mmbiz_jpg/l3ZibgueFiaeyfaiaLZGuMGQXnhLWxibpJUS2gfs8Dje6JuMY8zu2tVyU9n8Zx1yaNncvKHBMibX0ocehoITy5qQEZg/640?wxfrom=12&tp=wxpic&usePicPrefetch=1&wx_fmt=jpeg&amp;from=appmsg)  
-如果本项目对你有帮助，帮忙点个 star。也欢迎感兴趣的朋友一起来完善该项目.
-
-- 知识星球: https://t.zsxq.com/7NMyO 沉淀高质量常见问题、最佳实践经验、问题解答
-- 微信公众号：数字人技术  
-  ![](https://mmbiz.qpic.cn/sz_mmbiz_jpg/l3ZibgueFiaeyfaiaLZGuMGQXnhLWxibpJUS2gfs8Dje6JuMY8zu2tVyU9n8Zx1yaNncvKHBMibX0ocehoITy5qQEZg/640?wxfrom=12&tp=wxpic&usePicPrefetch=1&wx_fmt=jpeg&from=appmsg)
--- a/app.py
View file @8d25ce3
+++ b/app.py
View file @8d25ce3
@@ -201,7 +201,7 @@ async def set_audiotype(request):
     params = await request.json()
     sessionid = params.get('sessionid',0)    
-    nerfreals[sessionid].set_curr_state(params['audiotype'],params['reinit'])
+    nerfreals[sessionid].set_custom_state(params['audiotype'],params['reinit'])
     return web.Response(
         content_type="application/json",
@@ -495,6 +495,8 @@ if __name__ == '__main__':
     elif opt.transport=='rtcpush':
         pagename='rtcpushapi.html'
     logger.info('start http server; http://<serverip>:'+str(opt.listenport)+'/'+pagename)
+    logger.info('如果使用webrtc，推荐访问webrtc集成前端: http://<serverip>:'+str(opt.listenport)+'/dashboard.html')
+
     def run_server(runner):
         loop = asyncio.new_event_loop()
         asyncio.set_event_loop(loop)
--- a/basereal.py
View file @8d25ce3
+++ b/basereal.py
View file @8d25ce3
@@ -35,7 +35,7 @@ import soundfile as sf
 import av
 from fractions import Fraction
-from ttsreal import EdgeTTS,VoitsTTS,XTTS,CosyVoiceTTS,FishTTS,TencentTTS
+from ttsreal import EdgeTTS,SovitsTTS,XTTS,CosyVoiceTTS,FishTTS,TencentTTS
 from logger import logger
 from tqdm import tqdm
@@ -57,7 +57,7 @@ class BaseReal:
         if opt.tts == "edgetts":
             self.tts = EdgeTTS(opt,self)
         elif opt.tts == "gpt-sovits":
-            self.tts = VoitsTTS(opt,self)
+            self.tts = SovitsTTS(opt,self)
         elif opt.tts == "xtts":
             self.tts = XTTS(opt,self)
         elif opt.tts == "cosyvoice":
@@ -66,7 +66,7 @@ class BaseReal:
             self.tts = FishTTS(opt,self)
         elif opt.tts == "tencent":
             self.tts = TencentTTS(opt,self)
-        
+
         self.speaking = False
         self.recording = False
@@ -84,11 +84,11 @@ class BaseReal:
     def put_msg_txt(self,msg,eventpoint=None):
         self.tts.put_msg_txt(msg,eventpoint)
-    
+
     def put_audio_frame(self,audio_chunk,eventpoint=None): #16khz 20ms pcm
         self.asr.put_audio_frame(audio_chunk,eventpoint)
-    def put_audio_file(self,filebyte): 
+    def put_audio_file(self,filebyte):
         input_stream = BytesIO(filebyte)
         stream = self.__create_bytes_stream(input_stream)
         streamlen = stream.shape[0]
@@ -97,7 +97,7 @@ class BaseReal:
             self.put_audio_frame(stream[idx:idx+self.chunk])
             streamlen -= self.chunk
             idx += self.chunk
-    
+
     def __create_bytes_stream(self,byte_stream):
         #byte_stream=BytesIO(buffer)
         stream, sample_rate = sf.read(byte_stream) # [T*sample_rate,] float64
@@ -107,7 +107,7 @@ class BaseReal:
         if stream.ndim > 1:
             logger.info(f'[WARN] audio has {stream.shape[1]} channels, only use the first.')
             stream = stream[:, 0]
-    
+
         if sample_rate != self.sample_rate and stream.shape[0]>0:
             logger.info(f'[WARN] audio sample rate is {sample_rate}, resampling into {self.sample_rate}.')
             stream = resampy.resample(x=stream, sr_orig=sample_rate, sr_new=self.sample_rate)
@@ -120,7 +120,7 @@ class BaseReal:
     def is_speaking(self)->bool:
         return self.speaking
-    
+
     def __loadcustom(self):
         for item in self.opt.customopt:
             logger.info(item)
@@ -155,9 +155,9 @@ class BaseReal:
                     '-s', "{}x{}".format(self.width, self.height),
                     '-r', str(25),
                     '-i', '-',
-                    '-pix_fmt', 'yuv420p', 
+                    '-pix_fmt', 'yuv420p',
                     '-vcodec', "h264",
-                    #'-f' , 'flv',                  
+                    #'-f' , 'flv',
                     f'temp{self.opt.sessionid}.mp4']
         self._record_video_pipe = subprocess.Popen(command, shell=False, stdin=subprocess.PIPE)
@@ -169,7 +169,7 @@ class BaseReal:
                     '-ar', '16000',
                     '-i', '-',
                     '-acodec', 'aac',
-                    #'-f' , 'wav',                  
+                    #'-f' , 'wav',
                     f'temp{self.opt.sessionid}.aac']
         self._record_audio_pipe = subprocess.Popen(acommand, shell=False, stdin=subprocess.PIPE)
@@ -177,10 +177,10 @@ class BaseReal:
         # self.recordq_video.queue.clear()
         # self.recordq_audio.queue.clear()
         # self.container = av.open(path, mode="w")
-    
+
         # process_thread = Thread(target=self.record_frame, args=())
         # process_thread.start()
-    
+
     def record_video_data(self,image):
         if self.width == 0:
             print("image.shape:",image.shape)
@@ -191,14 +191,14 @@ class BaseReal:
     def record_audio_data(self,frame):
         if self.recording:
             self._record_audio_pipe.stdin.write(frame.tostring())
-    
-    # def record_frame(self): 
+
+    # def record_frame(self):
     #     videostream = self.container.add_stream("libx264", rate=25)
     #     videostream.codec_context.time_base = Fraction(1, 25)
     #     audiostream = self.container.add_stream("aac")
     #     audiostream.codec_context.time_base = Fraction(1, 16000)
     #     init = True
-    #     framenum = 0       
+    #     framenum = 0
     #     while self.recording:
     #         try:
     #             videoframe = self.recordq_video.get(block=True, timeout=1)
@@ -231,18 +231,18 @@ class BaseReal:
     #     self.recordq_video.queue.clear()
     #     self.recordq_audio.queue.clear()
     #     print('record thread stop')
-		
+
     def stop_recording(self):
         """停止录制视频"""
         if not self.recording:
             return
-        self.recording = False 
-        self._record_video_pipe.stdin.close()  #wait() 
+        self.recording = False
+        self._record_video_pipe.stdin.close()  #wait()
         self._record_video_pipe.wait()
         self._record_audio_pipe.stdin.close()
         self._record_audio_pipe.wait()
         cmd_combine_audio = f"ffmpeg -y -i temp{self.opt.sessionid}.aac -i temp{self.opt.sessionid}.mp4 -c:v copy -c:a copy data/record.mp4"
-        os.system(cmd_combine_audio) 
+        os.system(cmd_combine_audio)
         #os.remove(output_path)
     def mirror_index(self,size, index):
@@ -252,8 +252,8 @@ class BaseReal:
         if turn % 2 == 0:
             return res
         else:
-            return size - res - 1 
-    
+            return size - res - 1
+
     def get_audio_stream(self,audiotype):
         idx = self.custom_audio_index[audiotype]
         stream = self.custom_audio_cycle[audiotype][idx:idx+self.chunk]
@@ -261,9 +261,9 @@ class BaseReal:
         if self.custom_audio_index[audiotype]>=self.custom_audio_cycle[audiotype].shape[0]:
             self.curr_state = 1  #当前视频不循环播放，切换到静音状态
         return stream
-    
-    def set_curr_state(self,audiotype, reinit):
-        print('set_curr_state:',audiotype)
+
+    def set_custom_state(self,audiotype, reinit=True):
+        print('set_custom_state:',audiotype)
         self.curr_state = audiotype
         if reinit:
             self.custom_audio_index[audiotype] = 0
--- a/ernerf/data_utils/face_tracking/face_tracker.py
View file @8d25ce3
+++ b/ernerf/data_utils/face_tracking/face_tracker.py
View file @8d25ce3
@@ -179,8 +179,11 @@ print(f'[INFO] fitting light...')
 batch_size = 32
-device_default = torch.device("cuda:0")
-device_render = torch.device("cuda:0")
+device_default = torch.device("cuda:0" if torch.cuda.is_available() else (
+    "mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
+device_render = torch.device("cuda:0" if torch.cuda.is_available() else (
+    "mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
+
 renderer = Render_3DMM(arg_focal, h, w, batch_size, device_render)
 sel_ids = np.arange(0, num_frames, int(num_frames / batch_size))[:batch_size]
--- a/ernerf/data_utils/face_tracking/render_3dmm.py
View file @8d25ce3
+++ b/ernerf/data_utils/face_tracking/render_3dmm.py
View file @8d25ce3
@@ -83,7 +83,7 @@ class Render_3DMM(nn.Module):
         img_h=500,
         img_w=500,
         batch_size=1,
-        device=torch.device("cuda:0"),
+        device=torch.device("cuda:0" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")),
     ):
         super(Render_3DMM, self).__init__()
--- a/ernerf/main.py
View file @8d25ce3
+++ b/ernerf/main.py
View file @8d25ce3
@@ -147,7 +147,7 @@ if __name__ == '__main__':
     seed_everything(opt.seed)
-    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
     model = NeRFNetwork(opt)
--- a/ernerf/nerf_triplane/utils.py
View file @8d25ce3
+++ b/ernerf/nerf_triplane/utils.py
View file @8d25ce3
@@ -442,7 +442,7 @@ class LPIPSMeter:
         self.N = 0
         self.net = net
-        self.device = device if device is not None else torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        self.device = device if device is not None else torch.device('cuda' if torch.cuda.is_available() else ('mps' if hasattr(torch.backends, "mps") and torch.backends.mps.is_available() else 'cpu'))
         self.fn = lpips.LPIPS(net=net).eval().to(self.device)
     def clear(self):
@@ -456,13 +456,13 @@ class LPIPSMeter:
             inp = inp.to(self.device)
             outputs.append(inp)
         return outputs
-    
+
     def update(self, preds, truths):
         preds, truths = self.prepare_inputs(preds, truths) # [B, H, W, 3] --> [B, 3, H, W], range in [0, 1]
         v = self.fn(truths, preds, normalize=True).item() # normalize=True: [0, 1] to [-1, 1]
         self.V += v
         self.N += 1
-    
+
     def measure(self):
         return self.V / self.N
@@ -499,7 +499,7 @@ class LMDMeter:
         self.V = 0
         self.N = 0
-    
+
     def get_landmarks(self, img):
         if self.backend == 'dlib':
@@ -515,7 +515,7 @@ class LMDMeter:
         else:
             lms = self.predictor.get_landmarks(img)[-1]
-        
+
         # self.vis_landmarks(img, lms)
         lms = lms.astype(np.float32)
@@ -537,7 +537,7 @@ class LMDMeter:
             inp = (inp * 255).astype(np.uint8)
             outputs.append(inp)
         return outputs
-    
+
     def update(self, preds, truths):
         # assert B == 1
         preds, truths = self.prepare_inputs(preds[0], truths[0]) # [H, W, 3] numpy array
@@ -553,13 +553,13 @@ class LMDMeter:
         # avarage
         lms_pred = lms_pred - lms_pred.mean(0)
         lms_truth = lms_truth - lms_truth.mean(0)
-        
+
         # distance
         dist = np.sqrt(((lms_pred - lms_truth) ** 2).sum(1)).mean(0)
-        
+
         self.V += dist
         self.N += 1
-    
+
     def measure(self):
         return self.V / self.N
@@ -567,14 +567,14 @@ class LMDMeter:
         writer.add_scalar(os.path.join(prefix, f"LMD ({self.backend})"), self.measure(), global_step)
     def report(self):
-        return f'LMD ({self.backend}) = {self.measure():.6f}' 
-    
+        return f'LMD ({self.backend}) = {self.measure():.6f}'
+
 class Trainer(object):
-    def __init__(self, 
+    def __init__(self,
                  name, # name of this experiment
                  opt, # extra conf
-                 model, # network 
+                 model, # network
                  criterion=None, # loss function, if None, assume inline implementation in train_step
                  optimizer=None, # optimizer
                  ema_decay=None, # if use EMA, set the decay
@@ -596,7 +596,7 @@ class Trainer(object):
                  use_tensorboardX=True, # whether to use tensorboard for logging
                  scheduler_update_every_step=False, # whether to call scheduler.step() after every train step
                  ):
-        
+
         self.name = name
         self.opt = opt
         self.mute = mute
@@ -618,7 +618,11 @@ class Trainer(object):
         self.flip_init_lips = self.opt.init_lips
         self.time_stamp = time.strftime("%Y-%m-%d_%H-%M-%S")
         self.scheduler_update_every_step = scheduler_update_every_step
-        self.device = device if device is not None else torch.device(f'cuda:{local_rank}' if torch.cuda.is_available() else 'cpu')
+        self.device = device if device is not None else torch.device(
+            f'cuda:{local_rank}' if torch.cuda.is_available() else (
+                'mps' if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else 'cpu'
+            )
+        )
         self.console = Console()
         model.to(self.device)
--- a/lightreal.py
View file @8d25ce3
+++ b/lightreal.py
View file @8d25ce3
@@ -56,10 +56,8 @@ from ultralight.unet import Model
 from ultralight.audio2feature import Audio2Feature
 from logger import logger
-
-device = 'cuda' if torch.cuda.is_available() else 'cpu'
-logger.info('Using {} for inference.'.format(device))
-
+device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
+print('Using {} for inference.'.format(device))
 def load_model(opt):
     audio_processor = Audio2Feature()
--- a/lipreal.py
View file @8d25ce3
+++ b/lipreal.py
View file @8d25ce3
@@ -44,8 +44,8 @@ from basereal import BaseReal
 from tqdm import tqdm
 from logger import logger
-device = 'cuda' if torch.cuda.is_available() else 'cpu'
-logger.info('Using {} for inference.'.format(device))
+device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
+print('Using {} for inference.'.format(device))
 def _load(checkpoint_path):
 	if device == 'cuda':
--- a/musereal.py
View file @8d25ce3
+++ b/musereal.py
View file @8d25ce3
@@ -51,7 +51,7 @@ from logger import logger
 def load_model():
     # load model weights
     audio_processor,vae, unet, pe = load_all_model()
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
     timesteps = torch.tensor([0], device=device)
     pe = pe.half()
     vae.vae = vae.vae.half()
@@ -64,7 +64,7 @@ def load_avatar(avatar_id):
     #self.video_path = '' #video_path
     #self.bbox_shift = opt.bbox_shift
     avatar_path = f"./data/avatars/{avatar_id}"
-    full_imgs_path = f"{avatar_path}/full_imgs" 
+    full_imgs_path = f"{avatar_path}/full_imgs"
     coords_path = f"{avatar_path}/coords.pkl"
     latents_out_path= f"{avatar_path}/latents.pt"
     video_out_path = f"{avatar_path}/vid_output/"
@@ -74,7 +74,7 @@ def load_avatar(avatar_id):
     # self.avatar_info = {
     #     "avatar_id":self.avatar_id,
     #     "video_path":self.video_path,
-    #     "bbox_shift":self.bbox_shift   
+    #     "bbox_shift":self.bbox_shift
     # }
     input_latent_list_cycle = torch.load(latents_out_path)  #,weights_only=True
@@ -124,19 +124,19 @@ def __mirror_index(size, index):
     if turn % 2 == 0:
         return res
     else:
-        return size - res - 1 
+        return size - res - 1
 @torch.no_grad()
 def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,audio_out_queue,res_frame_queue,
               vae, unet, pe,timesteps): #vae, unet, pe,timesteps
-    
+
     # vae, unet, pe = load_diffusion_model()
     # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     # timesteps = torch.tensor([0], device=device)
     # pe = pe.half()
     # vae.vae = vae.vae.half()
     # unet.model = unet.model.half()
-    
+
     length = len(input_latent_list_cycle)
     index = 0
     count=0
@@ -169,7 +169,7 @@ def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,a
                 latent = input_latent_list_cycle[idx]
                 latent_batch.append(latent)
             latent_batch = torch.cat(latent_batch, dim=0)
-            
+
             # for i, (whisper_batch,latent_batch) in enumerate(gen):
             audio_feature_batch = torch.from_numpy(whisper_batch)
             audio_feature_batch = audio_feature_batch.to(device=unet.device,
@@ -179,8 +179,8 @@ def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,a
             # print('prepare time:',time.perf_counter()-t)
             # t=time.perf_counter()
-            pred_latents = unet.model(latent_batch, 
-                                        timesteps, 
+            pred_latents = unet.model(latent_batch,
+                                        timesteps,
                                         encoder_hidden_states=audio_feature_batch).sample
             # print('unet time:',time.perf_counter()-t)
             # t=time.perf_counter()
@@ -203,7 +203,7 @@ def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,a
                 #self.__pushmedia(res_frame,loop,audio_track,video_track)
                 res_frame_queue.put((res_frame,__mirror_index(length,index),audio_frames[i*2:i*2+2]))
                 index = index + 1
-            #print('total batch time:',time.perf_counter()-starttime)            
+            #print('total batch time:',time.perf_counter()-starttime)
     logger.info('musereal inference processor stop')
 class MuseReal(BaseReal):
@@ -226,12 +226,12 @@ class MuseReal(BaseReal):
         self.asr = MuseASR(opt,self,self.audio_processor)
         self.asr.warm_up()
-        
+
         self.render_event = mp.Event()
     def __del__(self):
         logger.info(f'musereal({self.sessionid}) delete')
-    
+
     def __mirror_index(self, index):
         size = len(self.coord_list_cycle)
@@ -240,9 +240,9 @@ class MuseReal(BaseReal):
         if turn % 2 == 0:
             return res
         else:
-            return size - res - 1  
+            return size - res - 1
-    def __warm_up(self): 
+    def __warm_up(self):
         self.asr.run_step()
         whisper_chunks = self.asr.get_next_feat()
         whisper_batch = np.stack(whisper_chunks)
@@ -260,30 +260,57 @@ class MuseReal(BaseReal):
         audio_feature_batch = self.pe(audio_feature_batch)
         latent_batch = latent_batch.to(dtype=self.unet.model.dtype)
-        pred_latents = self.unet.model(latent_batch, 
-                                    self.timesteps, 
+        pred_latents = self.unet.model(latent_batch,
+                                    self.timesteps,
                                     encoder_hidden_states=audio_feature_batch).sample
         recon = self.vae.decode_latents(pred_latents)
-      
+
     def process_frames(self,quit_event,loop=None,audio_track=None,video_track=None):
-        
+        enable_transition = True  # 设置为False禁用过渡效果，True启用
+
+        if enable_transition:
+            self.last_speaking = False
+            self.transition_start = time.time()
+            self.transition_duration = 0.1  # 过渡时间
+            self.last_silent_frame = None  # 静音帧缓存
+            self.last_speaking_frame = None  # 说话帧缓存
+
         while not quit_event.is_set():
             try:
                 res_frame,idx,audio_frames = self.res_frame_queue.get(block=True, timeout=1)
             except queue.Empty:
                 continue
-            if audio_frames[0][1]!=0 and audio_frames[1][1]!=0: #全为静音数据，只需要取fullimg
+
+            if enable_transition:
+                # 检测状态变化
+                current_speaking = not (audio_frames[0][1]!=0 and audio_frames[1][1]!=0)
+                if current_speaking != self.last_speaking:
+                    logger.info(f"状态切换：{'说话' if self.last_speaking else '静音'} → {'说话' if current_speaking else '静音'}")
+                    self.transition_start = time.time()
+                self.last_speaking = current_speaking
+
+            if audio_frames[0][1]!=0 and audio_frames[1][1]!=0:
                 self.speaking = False
                 audiotype = audio_frames[0][1]
-                if self.custom_index.get(audiotype) is not None: #有自定义视频
+                if self.custom_index.get(audiotype) is not None:
                     mirindex = self.mirror_index(len(self.custom_img_cycle[audiotype]),self.custom_index[audiotype])
-                    combine_frame = self.custom_img_cycle[audiotype][mirindex]
+                    target_frame = self.custom_img_cycle[audiotype][mirindex]
                     self.custom_index[audiotype] += 1
-                    # if not self.custom_opt[audiotype].loop and self.custom_index[audiotype]>=len(self.custom_img_cycle[audiotype]):
-                    #     self.curr_state = 1  #当前视频不循环播放，切换到静音状态
                 else:
-                    combine_frame = self.frame_list_cycle[idx]
+                    target_frame = self.frame_list_cycle[idx]
+
+                if enable_transition:
+                    # 说话→静音过渡
+                    if time.time() - self.transition_start < self.transition_duration and self.last_speaking_frame is not None:
+                        alpha = min(1.0, (time.time() - self.transition_start) / self.transition_duration)
+                        combine_frame = cv2.addWeighted(self.last_speaking_frame, 1-alpha, target_frame, alpha, 0)
+                    else:
+                        combine_frame = target_frame
+                    # 缓存静音帧
+                    self.last_silent_frame = combine_frame.copy()
+                else:
+                    combine_frame = target_frame
             else:
                 self.speaking = True
                 bbox = self.coord_list_cycle[idx]
@@ -291,20 +318,29 @@ class MuseReal(BaseReal):
                 x1, y1, x2, y2 = bbox
                 try:
                     res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
-                except:
+                except Exception as e:
+                    logger.warning(f"resize error: {e}")
                     continue
                 mask = self.mask_list_cycle[idx]
                 mask_crop_box = self.mask_coords_list_cycle[idx]
-                #combine_frame = get_image(ori_frame,res_frame,bbox)
-                #t=time.perf_counter()
-                combine_frame = get_image_blending(ori_frame,res_frame,bbox,mask,mask_crop_box)
-                #print('blending time:',time.perf_counter()-t)
-            image = combine_frame #(outputs['image'] * 255).astype(np.uint8)
+                current_frame = get_image_blending(ori_frame,res_frame,bbox,mask,mask_crop_box)
+                if enable_transition:
+                    # 静音→说话过渡
+                    if time.time() - self.transition_start < self.transition_duration and self.last_silent_frame is not None:
+                        alpha = min(1.0, (time.time() - self.transition_start) / self.transition_duration)
+                        combine_frame = cv2.addWeighted(self.last_silent_frame, 1-alpha, current_frame, alpha, 0)
+                    else:
+                        combine_frame = current_frame
+                    # 缓存说话帧
+                    self.last_speaking_frame = combine_frame.copy()
+                else:
+                    combine_frame = current_frame
+
+            image = combine_frame
             new_frame = VideoFrame.from_ndarray(image, format="bgr24")
             asyncio.run_coroutine_threadsafe(video_track._queue.put((new_frame,None)), loop)
             self.record_video_data(image)
-            #self.recordq_video.put(new_frame)  
             for audio_frame in audio_frames:
                 frame,type,eventpoint = audio_frame
@@ -312,12 +348,8 @@ class MuseReal(BaseReal):
                 new_frame = AudioFrame(format='s16', layout='mono', samples=frame.shape[0])
                 new_frame.planes[0].update(frame.tobytes())
                 new_frame.sample_rate=16000
-                # if audio_track._queue.qsize()>10:
-                #     time.sleep(0.1)
                 asyncio.run_coroutine_threadsafe(audio_track._queue.put((new_frame,eventpoint)), loop)
                 self.record_audio_data(frame)
-                #self.notify(eventpoint)
-                #self.recordq_audio.put(new_frame)
         logger.info('musereal process_frames thread stop') 
     def render(self,quit_event,loop=None,audio_track=None,video_track=None):
--- a/musetalk/models/unet.py
View file @8d25ce3
+++ b/musetalk/models/unet.py
View file @8d25ce3
@@ -36,7 +36,7 @@ class UNet():
             unet_config = json.load(f)
         self.model = UNet2DConditionModel(**unet_config)
         self.pe = PositionalEncoding(d_model=384)
-        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
         weights = torch.load(model_path) if torch.cuda.is_available() else torch.load(model_path, map_location=self.device)
         self.model.load_state_dict(weights)
         if use_float16:
--- a/musetalk/models/vae.py
View file @8d25ce3
+++ b/musetalk/models/vae.py
View file @8d25ce3
@@ -23,7 +23,7 @@ class VAE():
         self.model_path = model_path
         self.vae = AutoencoderKL.from_pretrained(self.model_path)
-        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
         self.vae.to(self.device)
         if use_float16:
--- a/musetalk/simple_musetalk.py
View file @8d25ce3
+++ b/musetalk/simple_musetalk.py
View file @8d25ce3
@@ -325,7 +325,7 @@ def create_musetalk_human(file, avatar_id):
 # initialize the mmpose model
-device = "cuda" if torch.cuda.is_available() else "cpu"
+device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
 fa = FaceAlignment(1, flip_input=False, device=device)
 config_file = os.path.join(current_dir, 'utils/dwpose/rtmpose-l_8xb32-270e_coco-ubody-wholebody-384x288.py')
 checkpoint_file = os.path.abspath(os.path.join(current_dir, '../models/dwpose/dw-ll_ucoco_384.pth'))
--- a/musetalk/utils/preprocessing.py
View file @8d25ce3
+++ b/musetalk/utils/preprocessing.py
View file @8d25ce3
@@ -13,14 +13,14 @@ import torch
 from tqdm import tqdm
 # initialize the mmpose model
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
 config_file = './musetalk/utils/dwpose/rtmpose-l_8xb32-270e_coco-ubody-wholebody-384x288.py'
 checkpoint_file = './models/dwpose/dw-ll_ucoco_384.pth'
 model = init_model(config_file, checkpoint_file, device=device)
 # initialize the face detection model
-device = "cuda" if torch.cuda.is_available() else "cpu"
-fa = FaceAlignment(LandmarksType._2D, flip_input=False,device=device)
+device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
+fa = FaceAlignment(LandmarksType._2D, flip_input=False, device=device)
 # maker if the bbox is not sufficient 
 coord_placeholder = (0.0,0.0,0.0,0.0)
--- a/musetalk/whisper/whisper/__init__.py
View file @8d25ce3
+++ b/musetalk/whisper/whisper/__init__.py
View file @8d25ce3
@@ -91,7 +91,7 @@ def load_model(name: str, device: Optional[Union[str, torch.device]] = None, dow
     """
     if device is None:
-        device = "cuda" if torch.cuda.is_available() else "cpu"
+        device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
     if download_root is None:
         download_root = os.getenv(
             "XDG_CACHE_HOME", 
--- a/musetalk/whisper/whisper/transcribe.py
View file @8d25ce3
+++ b/musetalk/whisper/whisper/transcribe.py
View file @8d25ce3
@@ -78,17 +78,19 @@ def transcribe(
         if dtype == torch.float16:
             warnings.warn("FP16 is not supported on CPU; using FP32 instead")
             dtype = torch.float32
+        if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+            warnings.warn("Performing inference on CPU when MPS is available")
     if dtype == torch.float32:
         decode_options["fp16"] = False
     mel = log_mel_spectrogram(audio)
-   
+
     all_segments = []
     def add_segment(
             *, start: float, end: float, encoder_embeddings
     ):
-      
+
         all_segments.append(
             {
                 "start": start,
@@ -100,20 +102,20 @@ def transcribe(
     num_frames = mel.shape[-1]
     seek = 0
     previous_seek_value = seek
-    sample_skip = 3000 # 
+    sample_skip = 3000 #
     with tqdm.tqdm(total=num_frames, unit='frames', disable=verbose is not False) as pbar:
         while seek < num_frames:
             # seek是开始的帧数
             end_seek = min(seek + sample_skip, num_frames)
             segment = pad_or_trim(mel[:,seek:seek+sample_skip], N_FRAMES).to(model.device).to(dtype)
-            
+
             single = segment.ndim == 2
             if single:
                 segment = segment.unsqueeze(0)
             if dtype == torch.float16:
                 segment = segment.half()
             audio_features, embeddings  = model.encoder(segment, include_embeddings = True)
-            
+
             encoder_embeddings = embeddings
             #print(f"encoder_embeddings shape {encoder_embeddings.shape}")
             add_segment(
@@ -124,7 +126,7 @@ def transcribe(
                 encoder_embeddings=encoder_embeddings,
             )
             seek+=sample_skip
-    
+
     return dict(segments=all_segments)
@@ -135,7 +137,7 @@ def cli():
     parser.add_argument("audio", nargs="+", type=str, help="audio file(s) to transcribe")
     parser.add_argument("--model", default="small", choices=available_models(), help="name of the Whisper model to use")
     parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
-    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
+    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "mps", help="device to use for PyTorch inference")
     parser.add_argument("--output_dir", "-o", type=str, default=".", help="directory to save the outputs")
     parser.add_argument("--verbose", type=str2bool, default=True, help="whether to print out the progress and debug messages")
--- a/nerfasr.py
View file @8d25ce3
+++ b/nerfasr.py
View file @8d25ce3
@@ -30,7 +30,7 @@ class NerfASR(BaseASR):
     def __init__(self, opt, parent, audio_processor,audio_model):
         super().__init__(opt,parent)
-        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
         if 'esperanto' in self.opt.asr_model:
             self.audio_dim = 44
         elif 'deepspeech' in self.opt.asr_model:
--- a/nerfreal.py
View file @8d25ce3
+++ b/nerfreal.py
View file @8d25ce3
@@ -77,7 +77,7 @@ def load_model(opt):
     seed_everything(opt.seed)
     logger.info(opt)
-    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    device = torch.device('cuda' if torch.cuda.is_available() else ('mps' if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else 'cpu'))
     model = NeRFNetwork(opt)
     criterion = torch.nn.MSELoss(reduction='none')
--- a/ttsreal.py
View file @8d25ce3
+++ b/ttsreal.py
View file @8d25ce3
@@ -90,7 +90,7 @@ class BaseTTS:
 ###########################################################################################
 class EdgeTTS(BaseTTS):
     def txt_to_audio(self,msg):
-        voicename = "zh-CN-XiaoxiaoNeural"
+        voicename = "zh-CN-YunxiaNeural"
         text,textevent = msg
         t = time.time()
         asyncio.new_event_loop().run_until_complete(self.__main(voicename,text))
@@ -98,7 +98,7 @@ class EdgeTTS(BaseTTS):
         if self.input_stream.getbuffer().nbytes<=0: #edgetts err
             logger.error('edgetts err!!!!!')
             return
-        
+
         self.input_stream.seek(0)
         stream = self.__create_bytes_stream(self.input_stream)
         streamlen = stream.shape[0]
@@ -107,15 +107,15 @@ class EdgeTTS(BaseTTS):
             eventpoint=None
             streamlen -= self.chunk
             if idx==0:
-                eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                eventpoint={'status':'start','text':text,'msgevent':textevent}
             elif streamlen<self.chunk:
-                eventpoint={'status':'end','text':text,'msgenvent':textevent}
+                eventpoint={'status':'end','text':text,'msgevent':textevent}
             self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
             idx += self.chunk
         #if streamlen>0:  #skip last frame(not 20ms)
         #    self.queue.put(stream[idx:])
         self.input_stream.seek(0)
-        self.input_stream.truncate() 
+        self.input_stream.truncate()
     def __create_bytes_stream(self,byte_stream):
         #byte_stream=BytesIO(buffer)
@@ -126,13 +126,13 @@ class EdgeTTS(BaseTTS):
         if stream.ndim > 1:
             logger.info(f'[WARN] audio has {stream.shape[1]} channels, only use the first.')
             stream = stream[:, 0]
-    
+
         if sample_rate != self.sample_rate and stream.shape[0]>0:
             logger.info(f'[WARN] audio sample rate is {sample_rate}, resampling into {self.sample_rate}.')
             stream = resampy.resample(x=stream, sr_orig=sample_rate, sr_new=self.sample_rate)
         return stream
-    
+
     async def __main(self,voicename: str, text: str):
         try:
             communicate = edge_tts.Communicate(text, voicename)
@@ -153,12 +153,12 @@ class EdgeTTS(BaseTTS):
 ###########################################################################################
 class FishTTS(BaseTTS):
-    def txt_to_audio(self,msg): 
+    def txt_to_audio(self,msg):
         text,textevent = msg
         self.stream_tts(
             self.fish_speech(
                 text,
-                self.opt.REF_FILE,  
+                self.opt.REF_FILE,
                 self.opt.REF_TEXT,
                 "zh", #en args.language,
                 self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
@@ -190,9 +190,9 @@ class FishTTS(BaseTTS):
             if res.status_code != 200:
                 logger.error("Error:%s", res.text)
                 return
-                
+
             first = True
-        
+
             for chunk in res.iter_content(chunk_size=17640): # 1764 44100*20ms*2
                 #print('chunk len:',len(chunk))
                 if first:
@@ -209,7 +209,7 @@ class FishTTS(BaseTTS):
         text,textevent = msg
         first = True
         for chunk in audio_stream:
-            if chunk is not None and len(chunk)>0:          
+            if chunk is not None and len(chunk)>0:
                 stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 stream = resampy.resample(x=stream, sr_orig=44100, sr_new=self.sample_rate)
                 #byte_stream=BytesIO(buffer)
@@ -219,22 +219,22 @@ class FishTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                        eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                        eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
-        eventpoint={'status':'end','text':text,'msgenvent':textevent}
-        self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint) 
+        eventpoint={'status':'end','text':text,'msgevent':textevent}
+        self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
 ###########################################################################################
-class VoitsTTS(BaseTTS):
-    def txt_to_audio(self,msg): 
+class SovitsTTS(BaseTTS):
+    def txt_to_audio(self,msg):
         text,textevent = msg
         self.stream_tts(
             self.gpt_sovits(
                 text,
-                self.opt.REF_FILE,  
+                self.opt.REF_FILE,
                 self.opt.REF_TEXT,
                 "zh", #en args.language,
                 self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
@@ -271,9 +271,9 @@ class VoitsTTS(BaseTTS):
             if res.status_code != 200:
                 logger.error("Error:%s", res.text)
                 return
-                
+
             first = True
-        
+
             for chunk in res.iter_content(chunk_size=None): #12800 1280 32K*20ms*2
                 logger.info('chunk len:%d',len(chunk))
                 if first:
@@ -295,7 +295,7 @@ class VoitsTTS(BaseTTS):
         if stream.ndim > 1:
             logger.info(f'[WARN] audio has {stream.shape[1]} channels, only use the first.')
             stream = stream[:, 0]
-    
+
         if sample_rate != self.sample_rate and stream.shape[0]>0:
             logger.info(f'[WARN] audio sample rate is {sample_rate}, resampling into {self.sample_rate}.')
             stream = resampy.resample(x=stream, sr_orig=sample_rate, sr_new=self.sample_rate)
@@ -306,7 +306,7 @@ class VoitsTTS(BaseTTS):
         text,textevent = msg
         first = True
         for chunk in audio_stream:
-            if chunk is not None and len(chunk)>0:          
+            if chunk is not None and len(chunk)>0:
                 #stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 #stream = resampy.resample(x=stream, sr_orig=32000, sr_new=self.sample_rate)
                 byte_stream=BytesIO(chunk)
@@ -316,22 +316,22 @@ class VoitsTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                        eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                        eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
-        eventpoint={'status':'end','text':text,'msgenvent':textevent}
+        eventpoint={'status':'end','text':text,'msgevent':textevent}
         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
 ###########################################################################################
 class CosyVoiceTTS(BaseTTS):
     def txt_to_audio(self,msg):
-        text,textevent = msg 
+        text,textevent = msg
         self.stream_tts(
             self.cosy_voice(
                 text,
-                self.opt.REF_FILE,  
+                self.opt.REF_FILE,
                 self.opt.REF_TEXT,
                 "zh", #en args.language,
                 self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
@@ -348,16 +348,16 @@ class CosyVoiceTTS(BaseTTS):
         try:
             files = [('prompt_wav', ('prompt_wav', open(reffile, 'rb'), 'application/octet-stream'))]
             res = requests.request("GET", f"{server_url}/inference_zero_shot", data=payload, files=files, stream=True)
-            
+
             end = time.perf_counter()
             logger.info(f"cosy_voice Time to make POST: {end-start}s")
             if res.status_code != 200:
                 logger.error("Error:%s", res.text)
                 return
-                
+
             first = True
-        
+
             for chunk in res.iter_content(chunk_size=9600): # 960 24K*20ms*2
                 if first:
                     end = time.perf_counter()
@@ -372,7 +372,7 @@ class CosyVoiceTTS(BaseTTS):
         text,textevent = msg
         first = True
         for chunk in audio_stream:
-            if chunk is not None and len(chunk)>0:          
+            if chunk is not None and len(chunk)>0:
                 stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 stream = resampy.resample(x=stream, sr_orig=24000, sr_new=self.sample_rate)
                 #byte_stream=BytesIO(buffer)
@@ -382,13 +382,13 @@ class CosyVoiceTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                        eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                        eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
-        eventpoint={'status':'end','text':text,'msgenvent':textevent}
-        self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint) 
+        eventpoint={'status':'end','text':text,'msgevent':textevent}
+        self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
 ###########################################################################################
 _PROTOCOL = "https://"
@@ -407,7 +407,7 @@ class TencentTTS(BaseTTS):
         self.sample_rate = 16000
         self.volume = 0
         self.speed = 0
-    
+
     def __gen_signature(self, params):
         sort_dict = sorted(params.keys())
         sign_str = "POST" + _HOST + _PATH + "?"
@@ -440,11 +440,11 @@ class TencentTTS(BaseTTS):
         return params
     def txt_to_audio(self,msg):
-        text,textevent = msg 
+        text,textevent = msg
         self.stream_tts(
             self.tencent_voice(
                 text,
-                self.opt.REF_FILE,  
+                self.opt.REF_FILE,
                 self.opt.REF_TEXT,
                 "zh", #en args.language,
                 self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
@@ -465,12 +465,12 @@ class TencentTTS(BaseTTS):
         try:
             res = requests.post(url, headers=headers,
                           data=json.dumps(params), stream=True)
-            
+
             end = time.perf_counter()
             logger.info(f"tencent Time to make POST: {end-start}s")
-                
+
             first = True
-        
+
             for chunk in res.iter_content(chunk_size=6400): # 640 16K*20ms*2
                 #logger.info('chunk len:%d',len(chunk))
                 if first:
@@ -483,7 +483,7 @@ class TencentTTS(BaseTTS):
                     except:
                         end = time.perf_counter()
                         logger.info(f"tencent Time to first chunk: {end-start}s")
-                        first = False                    
+                        first = False
                 if chunk and self.state==State.RUNNING:
                     yield chunk
         except Exception as e:
@@ -494,7 +494,7 @@ class TencentTTS(BaseTTS):
         first = True
         last_stream = np.array([],dtype=np.float32)
         for chunk in audio_stream:
-            if chunk is not None and len(chunk)>0:          
+            if chunk is not None and len(chunk)>0:
                 stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 stream = np.concatenate((last_stream,stream))
                 #stream = resampy.resample(x=stream, sr_orig=24000, sr_new=self.sample_rate)
@@ -505,14 +505,14 @@ class TencentTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                        eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                        eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
                 last_stream = stream[idx:] #get the remain stream
-        eventpoint={'status':'end','text':text,'msgenvent':textevent}
-        self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint) 
+        eventpoint={'status':'end','text':text,'msgevent':textevent}
+        self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
 ###########################################################################################
@@ -522,7 +522,7 @@ class XTTS(BaseTTS):
         self.speaker = self.get_speaker(opt.REF_FILE, opt.TTS_SERVER)
     def txt_to_audio(self,msg):
-        text,textevent = msg  
+        text,textevent = msg
         self.stream_tts(
             self.xtts(
                 text,
@@ -558,7 +558,7 @@ class XTTS(BaseTTS):
                 return
             first = True
-        
+
             for chunk in res.iter_content(chunk_size=9600): #24K*20ms*2
                 if first:
                     end = time.perf_counter()
@@ -568,12 +568,12 @@ class XTTS(BaseTTS):
                     yield chunk
         except Exception as e:
             print(e)
-    
+
     def stream_tts(self,audio_stream,msg):
         text,textevent = msg
         first = True
         for chunk in audio_stream:
-            if chunk is not None and len(chunk)>0:          
+            if chunk is not None and len(chunk)>0:
                 stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 stream = resampy.resample(x=stream, sr_orig=24000, sr_new=self.sample_rate)
                 #byte_stream=BytesIO(buffer)
@@ -583,10 +583,10 @@ class XTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                        eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                        eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
-        eventpoint={'status':'end','text':text,'msgenvent':textevent}
+        eventpoint={'status':'end','text':text,'msgevent':textevent}
         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)  
--- a/ultralight/unet.py
View file @8d25ce3
+++ b/ultralight/unet.py
View file @8d25ce3
@@ -236,7 +236,7 @@ if __name__ == '__main__':
             if hasattr(module, 'reparameterize'):
                 module.reparameterize()
         return model
-    device = torch.device("cuda")
+    device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
     def check_onnx(torch_out, torch_in, audio):
         onnx_model = onnx.load(onnx_path)
         onnx.checker.check_model(onnx_model)
--- a/web/dashboard.html 0 → 100644
View file @8d25ce3
+++ b/web/dashboard.html 0 → 100644
View file @8d25ce3
+<!DOCTYPE html>
+<html lang="zh-CN">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>livetalking数字人交互平台</title>
+    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet">
+    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.10.0/font/bootstrap-icons.css">
+    <style>
+        :root {
+            --primary-color: #4361ee;
+            --secondary-color: #3f37c9;
+            --accent-color: #4895ef;
+            --background-color: #f8f9fa;
+            --card-bg: #ffffff;
+            --text-color: #212529;
+            --border-radius: 10px;
+            --box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
+        }
+
+        body {
+            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+            background-color: var(--background-color);
+            color: var(--text-color);
+            min-height: 100vh;
+            padding-top: 20px;
+        }
+
+        .dashboard-container {
+            max-width: 1400px;
+            margin: 0 auto;
+            padding: 20px;
+        }
+
+        .card {
+            background-color: var(--card-bg);
+            border-radius: var(--border-radius);
+            box-shadow: var(--box-shadow);
+            border: none;
+            margin-bottom: 20px;
+            overflow: hidden;
+        }
+
+        .card-header {
+            background-color: var(--primary-color);
+            color: white;
+            font-weight: 600;
+            padding: 15px 20px;
+            border-bottom: none;
+        }
+
+        .video-container {
+            position: relative;
+            width: 100%;
+            background-color: #000;
+            border-radius: var(--border-radius);
+            overflow: hidden;
+            display: flex;
+            justify-content: center;
+            align-items: center;
+        }
+
+        video {
+            max-width: 100%;
+            max-height: 100%;
+            display: block;
+            border-radius: var(--border-radius);
+        }
+
+        .controls-container {
+            padding: 20px;
+        }
+
+        .btn-primary {
+            background-color: var(--primary-color);
+            border-color: var(--primary-color);
+        }
+
+        .btn-primary:hover {
+            background-color: var(--secondary-color);
+            border-color: var(--secondary-color);
+        }
+
+        .btn-outline-primary {
+            color: var(--primary-color);
+            border-color: var(--primary-color);
+        }
+
+        .btn-outline-primary:hover {
+            background-color: var(--primary-color);
+            color: white;
+        }
+
+        .form-control {
+            border-radius: var(--border-radius);
+            padding: 10px 15px;
+            border: 1px solid #ced4da;
+        }
+
+        .form-control:focus {
+            border-color: var(--accent-color);
+            box-shadow: 0 0 0 0.25rem rgba(67, 97, 238, 0.25);
+        }
+
+        .status-indicator {
+            width: 10px;
+            height: 10px;
+            border-radius: 50%;
+            display: inline-block;
+            margin-right: 5px;
+        }
+
+        .status-connected {
+            background-color: #28a745;
+        }
+
+        .status-disconnected {
+            background-color: #dc3545;
+        }
+
+        .status-connecting {
+            background-color: #ffc107;
+        }
+
+        .asr-container {
+            height: 300px;
+            overflow-y: auto;
+            padding: 15px;
+            background-color: #f8f9fa;
+            border-radius: var(--border-radius);
+            border: 1px solid #ced4da;
+        }
+
+        .asr-text {
+            margin-bottom: 10px;
+            padding: 10px;
+            background-color: white;
+            border-radius: var(--border-radius);
+            box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1);
+        }
+
+        .user-message {
+            background-color: #e3f2fd;
+            border-left: 4px solid var(--primary-color);
+        }
+
+        .system-message {
+            background-color: #f1f8e9;
+            border-left: 4px solid #8bc34a;
+        }
+
+        .recording-indicator {
+            position: absolute;
+            top: 15px;
+            right: 15px;
+            background-color: rgba(220, 53, 69, 0.8);
+            color: white;
+            padding: 5px 10px;
+            border-radius: 20px;
+            font-size: 0.8rem;
+            display: none;
+        }
+
+        .recording-indicator.active {
+            display: flex;
+            align-items: center;
+        }
+
+        .recording-indicator .blink {
+            width: 10px;
+            height: 10px;
+            background-color: #fff;
+            border-radius: 50%;
+            margin-right: 5px;
+            animation: blink 1s infinite;
+        }
+
+        @keyframes blink {
+            0% { opacity: 1; }
+            50% { opacity: 0.3; }
+            100% { opacity: 1; }
+        }
+
+        .mode-switch {
+            margin-bottom: 20px;
+        }
+
+        .nav-tabs .nav-link {
+            color: var(--text-color);
+            border: none;
+            padding: 10px 20px;
+            border-radius: var(--border-radius) var(--border-radius) 0 0;
+        }
+
+        .nav-tabs .nav-link.active {
+            color: var(--primary-color);
+            background-color: var(--card-bg);
+            border-bottom: 3px solid var(--primary-color);
+            font-weight: 600;
+        }
+
+        .tab-content {
+            padding: 20px;
+            background-color: var(--card-bg);
+            border-radius: 0 0 var(--border-radius) var(--border-radius);
+        }
+
+        .settings-panel {
+            padding: 15px;
+            background-color: #f8f9fa;
+            border-radius: var(--border-radius);
+            margin-top: 15px;
+        }
+
+        .footer {
+            text-align: center;
+            margin-top: 30px;
+            padding: 20px 0;
+            color: #6c757d;
+            font-size: 0.9rem;
+        }
+        
+        .voice-record-btn {
+            width: 60px;
+            height: 60px;
+            border-radius: 50%;
+            background-color: var(--primary-color);
+            color: white;
+            display: flex;
+            justify-content: center;
+            align-items: center;
+            cursor: pointer;
+            transition: all 0.2s ease;
+            box-shadow: 0 2px 5px rgba(0,0,0,0.2);
+            margin: 0 auto;
+        }
+        
+        .voice-record-btn:hover {
+            background-color: var(--secondary-color);
+            transform: scale(1.05);
+        }
+        
+        .voice-record-btn:active {
+            background-color: #dc3545;
+            transform: scale(0.95);
+        }
+        
+        .voice-record-btn i {
+            font-size: 24px;
+        }
+        
+        .voice-record-label {
+            text-align: center;
+            margin-top: 10px;
+            font-size: 14px;
+            color: #6c757d;
+        }
+        
+        .video-size-control {
+            margin-top: 15px;
+        }
+        
+        .recording-pulse {
+            animation: pulse 1.5s infinite;
+        }
+        
+        @keyframes pulse {
+            0% {
+                box-shadow: 0 0 0 0 rgba(220, 53, 69, 0.7);
+            }
+            70% {
+                box-shadow: 0 0 0 15px rgba(220, 53, 69, 0);
+            }
+            100% {
+                box-shadow: 0 0 0 0 rgba(220, 53, 69, 0);
+            }
+        }
+    </style>
+</head>
+<body>
+    <div class="dashboard-container">
+        <div class="row">
+            <div class="col-12">
+                <h1 class="text-center mb-4">livetalking数字人交互平台</h1>
+            </div>
+        </div>
+
+        <div class="row">
+            <!-- 视频区域 -->
+            <div class="col-lg-8">
+                <div class="card">
+                    <div class="card-header d-flex justify-content-between align-items-center">
+                        <div>
+                            <span class="status-indicator status-disconnected" id="connection-status"></span>
+                            <span id="status-text">未连接</span>
+                        </div>
+                    </div>
+                    <div class="card-body p-0">
+                        <div class="video-container">
+                            <video id="video" autoplay playsinline></video>
+                            <div class="recording-indicator" id="recording-indicator">
+                                <div class="blink"></div>
+                                <span>录制中</span>
+                            </div>
+                        </div>
+                        
+                        <div class="controls-container">
+                            <div class="row">
+                                <div class="col-md-6 mb-3">
+                                    <button class="btn btn-primary w-100" id="start">
+                                        <i class="bi bi-play-fill"></i> 开始连接
+                                    </button>
+                                    <button class="btn btn-danger w-100" id="stop" style="display: none;">
+                                        <i class="bi bi-stop-fill"></i> 停止连接
+                                    </button>
+                                </div>
+                                <div class="col-md-6 mb-3">
+                                    <div class="d-flex">
+                                        <button class="btn btn-outline-primary flex-grow-1 me-2" id="btn_start_record">
+                                            <i class="bi bi-record-fill"></i> 开始录制
+                                        </button>
+                                        <button class="btn btn-outline-danger flex-grow-1" id="btn_stop_record" disabled>
+                                            <i class="bi bi-stop-fill"></i> 停止录制
+                                        </button>
+                                    </div>
+                                </div>
+                            </div>
+
+                            <div class="row">
+                                <div class="col-12">
+                                    <div class="video-size-control">
+                                        <label for="video-size-slider" class="form-label">视频大小调节: <span id="video-size-value">100%</span></label>
+                                        <input type="range" class="form-range" id="video-size-slider" min="50" max="150" value="100">
+                                    </div>
+                                </div>
+                            </div>
+                            
+                            <div class="settings-panel mt-3">
+                                <div class="row">
+                                    <div class="col-md-12">
+                                        <div class="form-check form-switch mb-3">
+                                            <input class="form-check-input" type="checkbox" id="use-stun">
+                                            <label class="form-check-label" for="use-stun">使用STUN服务器</label>
+                                        </div>
+                                    </div>
+                                </div>
+                            </div>
+                        </div>
+                    </div>
+                </div>
+            </div>
+
+            <!-- 右侧交互 -->
+            <div class="col-lg-4">
+                <div class="card">
+                    <div class="card-header">
+                        <ul class="nav nav-tabs card-header-tabs" id="interaction-tabs" role="tablist">
+                            <li class="nav-item" role="presentation">
+                                <button class="nav-link active" id="chat-tab" data-bs-toggle="tab" data-bs-target="#chat" type="button" role="tab" aria-controls="chat" aria-selected="true">对话模式</button>
+                            </li>
+                            <li class="nav-item" role="presentation">
+                                <button class="nav-link" id="tts-tab" data-bs-toggle="tab" data-bs-target="#tts" type="button" role="tab" aria-controls="tts" aria-selected="false">朗读模式</button>
+                            </li>
+                        </ul>
+                    </div>
+                    <div class="card-body">
+                        <div class="tab-content" id="interaction-tabs-content">
+                            <!-- 对话模式 -->
+                            <div class="tab-pane fade show active" id="chat" role="tabpanel" aria-labelledby="chat-tab">
+                                <div class="asr-container mb-3" id="chat-messages">
+                                    <div class="asr-text system-message">
+                                        系统: 欢迎使用livetalking，请点击"开始连接"按钮开始对话。
+                                    </div>
+                                </div>
+                                
+                                <form id="chat-form">
+                                    <div class="input-group mb-3">
+                                        <textarea class="form-control" id="chat-message" rows="3" placeholder="输入您想对数字人说的话..."></textarea>
+                                        <button class="btn btn-primary" type="submit">
+                                            <i class="bi bi-send"></i> 发送
+                                        </button>
+                                    </div>
+                                </form>
+                                
+                                <!-- 按住说话按钮 -->
+                                <div class="voice-record-btn" id="voice-record-btn">
+                                    <i class="bi bi-mic-fill"></i>
+                                </div>
+                                <div class="voice-record-label">按住说话，松开发送</div>
+                            </div>
+                            
+                            <!-- 朗读模式 -->
+                            <div class="tab-pane fade" id="tts" role="tabpanel" aria-labelledby="tts-tab">
+                                <form id="echo-form">
+                                    <div class="mb-3">
+                                        <label for="message" class="form-label">输入要朗读的文本</label>
+                                        <textarea class="form-control" id="message" rows="6" placeholder="输入您想让数字人朗读的文字..."></textarea>
+                                    </div>
+                                    <button type="submit" class="btn btn-primary w-100">
+                                        <i class="bi bi-volume-up"></i> 朗读文本
+                                    </button>
+                                </form>
+                            </div>
+                        </div>
+                    </div>
+                </div>
+            </div>
+        </div>
+
+        <div class="footer">
+            <p>Made with ❤️ by Marstaos | Frontend & Performance Optimization</p>
+        </div>
+    </div>
+
+    <!-- 隐藏的会话ID -->
+    <input type="hidden" id="sessionid" value="0">
+
+
+    <script src="client.js"></script>
+    <script src="srs.sdk.js"></script>
+    <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>
+    <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
+    <script>
+        $(document).ready(function() {
+            $('#video-size-slider').on('input', function() {
+                const value = $(this).val();
+                $('#video-size-value').text(value + '%');
+                $('#video').css('width', value + '%');
+            });
+            function updateConnectionStatus(status) {
+                const statusIndicator = $('#connection-status');
+                const statusText = $('#status-text');
+                
+                statusIndicator.removeClass('status-connected status-disconnected status-connecting');
+                
+                switch(status) {
+                    case 'connected':
+                        statusIndicator.addClass('status-connected');
+                        statusText.text('已连接');
+                        break;
+                    case 'connecting':
+                        statusIndicator.addClass('status-connecting');
+                        statusText.text('连接中...');
+                        break;
+                    case 'disconnected':
+                    default:
+                        statusIndicator.addClass('status-disconnected');
+                        statusText.text('未连接');
+                        break;
+                }
+            }
+
+            // 添加聊天消息
+            function addChatMessage(message, type = 'user') {
+                const messagesContainer = $('#chat-messages');
+                const messageClass = type === 'user' ? 'user-message' : 'system-message';
+                const sender = type === 'user' ? '您' : '数字人';
+                
+                const messageElement = $(`
+                    <div class="asr-text ${messageClass}">
+                        ${sender}: ${message}
+                    </div>
+                `);
+                
+                messagesContainer.append(messageElement);
+                messagesContainer.scrollTop(messagesContainer[0].scrollHeight);
+            }
+
+            // 开始/停止按钮
+            $('#start').click(function() {
+                updateConnectionStatus('connecting');
+                start();
+                $(this).hide();
+                $('#stop').show();
+                
+                // 添加定时器检查视频流是否已加载
+                let connectionCheckTimer = setInterval(function() {
+                    const video = document.getElementById('video');
+                    // 检查视频是否有数据
+                    if (video.readyState >= 3 && video.videoWidth > 0) {
+                        updateConnectionStatus('connected');
+                        clearInterval(connectionCheckTimer);
+                    }
+                }, 2000); // 每2秒检查一次
+                
+                // 60秒后如果还是连接中状态，就停止检查
+                setTimeout(function() {
+                    if (connectionCheckTimer) {
+                        clearInterval(connectionCheckTimer);
+                    }
+                }, 60000);
+            });
+
+            $('#stop').click(function() {
+                stop();
+                $(this).hide();
+                $('#start').show();
+                updateConnectionStatus('disconnected');
+            });
+
+            // 录制功能
+            $('#btn_start_record').click(function() {
+                console.log('Starting recording...');
+                fetch('/record', {
+                    body: JSON.stringify({
+                        type: 'start_record',
+                        sessionid: parseInt(document.getElementById('sessionid').value),
+                    }),
+                    headers: {
+                        'Content-Type': 'application/json'
+                    },
+                    method: 'POST'
+                }).then(function(response) {
+                    if (response.ok) {
+                        console.log('Recording started.');
+                        $('#btn_start_record').prop('disabled', true);
+                        $('#btn_stop_record').prop('disabled', false);
+                        $('#recording-indicator').addClass('active');
+                    } else {
+                        console.error('Failed to start recording.');
+                    }
+                }).catch(function(error) {
+                    console.error('Error:', error);
+                });
+            });
+
+            $('#btn_stop_record').click(function() {
+                console.log('Stopping recording...');
+                fetch('/record', {
+                    body: JSON.stringify({
+                        type: 'end_record',
+                        sessionid: parseInt(document.getElementById('sessionid').value),
+                    }),
+                    headers: {
+                        'Content-Type': 'application/json'
+                    },
+                    method: 'POST'
+                }).then(function(response) {
+                    if (response.ok) {
+                        console.log('Recording stopped.');
+                        $('#btn_start_record').prop('disabled', false);
+                        $('#btn_stop_record').prop('disabled', true);
+                        $('#recording-indicator').removeClass('active');
+                    } else {
+                        console.error('Failed to stop recording.');
+                    }
+                }).catch(function(error) {
+                    console.error('Error:', error);
+                });
+            });
+
+            $('#echo-form').on('submit', function(e) {
+                e.preventDefault();
+                var message = $('#message').val();
+                if (!message.trim()) return;
+                
+                console.log('Sending echo message:', message);
+                
+                fetch('/human', {
+                    body: JSON.stringify({
+                        text: message,
+                        type: 'echo',
+                        interrupt: true,
+                        sessionid: parseInt(document.getElementById('sessionid').value),
+                    }),
+                    headers: {
+                        'Content-Type': 'application/json'
+                    },
+                    method: 'POST'
+                });
+                
+                $('#message').val('');
+                addChatMessage(`已发送朗读请求: "${message}"`, 'system');
+            });
+
+            // 聊天模式表单提交
+            $('#chat-form').on('submit', function(e) {
+                e.preventDefault();
+                var message = $('#chat-message').val();
+                if (!message.trim()) return;
+                
+                console.log('Sending chat message:', message);
+                
+                fetch('/human', {
+                    body: JSON.stringify({
+                        text: message,
+                        type: 'chat',
+                        interrupt: true,
+                        sessionid: parseInt(document.getElementById('sessionid').value),
+                    }),
+                    headers: {
+                        'Content-Type': 'application/json'
+                    },
+                    method: 'POST'
+                });
+                
+                addChatMessage(message, 'user');
+                $('#chat-message').val('');
+            });
+
+            // 按住说话功能
+            let mediaRecorder;
+            let audioChunks = [];
+            let isRecording = false;
+            let recognition;
+            
+            // 检查浏览器是否支持语音识别
+            const isSpeechRecognitionSupported = 'webkitSpeechRecognition' in window || 'SpeechRecognition' in window;
+            
+            if (isSpeechRecognitionSupported) {
+                recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
+                recognition.continuous = true;
+                recognition.interimResults = true;
+                recognition.lang = 'zh-CN';
+                
+                recognition.onresult = function(event) {
+                    let interimTranscript = '';
+                    let finalTranscript = '';
+                    
+                    for (let i = event.resultIndex; i < event.results.length; ++i) {
+                        if (event.results[i].isFinal) {
+                            finalTranscript += event.results[i][0].transcript;
+                        } else {
+                            interimTranscript += event.results[i][0].transcript;
+                            $('#chat-message').val(interimTranscript);
+                        }
+                    }
+                    
+                    if (finalTranscript) {
+                        $('#chat-message').val(finalTranscript);
+                    }
+                };
+                
+                recognition.onerror = function(event) {
+                    console.error('语音识别错误:', event.error);
+                };
+            }
+            
+            // 按住说话按钮事件
+            $('#voice-record-btn').on('mousedown touchstart', function(e) {
+                e.preventDefault();
+                startRecording();
+            }).on('mouseup mouseleave touchend', function() {
+                if (isRecording) {
+                    stopRecording();
+                }
+            });
+            
+            // 开始录音
+            function startRecording() {
+                if (isRecording) return;
+                
+                navigator.mediaDevices.getUserMedia({ audio: true })
+                    .then(function(stream) {
+                        audioChunks = [];
+                        mediaRecorder = new MediaRecorder(stream);
+                        
+                        mediaRecorder.ondataavailable = function(e) {
+                            if (e.data.size > 0) {
+                                audioChunks.push(e.data);
+                            }
+                        };
+                        
+                        mediaRecorder.start();
+                        isRecording = true;
+                        
+                        $('#voice-record-btn').addClass('recording-pulse');
+                        $('#voice-record-btn').css('background-color', '#dc3545');
+                        
+                        if (recognition) {
+                            recognition.start();
+                        }
+                    })
+                    .catch(function(error) {
+                        console.error('无法访问麦克风:', error);
+                        alert('无法访问麦克风，请检查浏览器权限设置。');
+                    });
+            }
+
+            function stopRecording() {
+                if (!isRecording) return;
+                
+                mediaRecorder.stop();
+                isRecording = false;
+                
+                // 停止所有音轨
+                mediaRecorder.stream.getTracks().forEach(track => track.stop());
+                
+                // 视觉反馈恢复
+                $('#voice-record-btn').removeClass('recording-pulse');
+                $('#voice-record-btn').css('background-color', '');
+                
+                // 停止语音识别
+                if (recognition) {
+                    recognition.stop();
+                }
+                
+                // 获取识别的文本并发送
+                setTimeout(function() {
+                    const recognizedText = $('#chat-message').val().trim();
+                    if (recognizedText) {
+                        // 发送识别的文本
+                        fetch('/human', {
+                            body: JSON.stringify({
+                                text: recognizedText,
+                                type: 'chat',
+                                interrupt: true,
+                                sessionid: parseInt(document.getElementById('sessionid').value),
+                            }),
+                            headers: {
+                                'Content-Type': 'application/json'
+                            },
+                            method: 'POST'
+                        });
+                        
+                        addChatMessage(recognizedText, 'user');
+                        $('#chat-message').val('');
+                    }
+                }, 500); 
+            }
+
+            // WebRTC 相关功能
+            if (typeof window.onWebRTCConnected === 'function') {
+                const originalOnConnected = window.onWebRTCConnected;
+                window.onWebRTCConnected = function() {
+                    updateConnectionStatus('connected');
+                    if (originalOnConnected) originalOnConnected();
+                };
+            } else {
+                window.onWebRTCConnected = function() {
+                    updateConnectionStatus('connected');
+                };
+            }
+
+            // 当连接断开时更新状态
+            if (typeof window.onWebRTCDisconnected === 'function') {
+                const originalOnDisconnected = window.onWebRTCDisconnected;
+                window.onWebRTCDisconnected = function() {
+                    updateConnectionStatus('disconnected');
+                    if (originalOnDisconnected) originalOnDisconnected();
+                };
+            } else {
+                window.onWebRTCDisconnected = function() {
+                    updateConnectionStatus('disconnected');
+                };
+            }
+
+            // SRS WebRTC播放功能
+            var sdk = null; // 全局处理器，用于在重新发布时进行清理
+
+            function startPlay() {
+                // 关闭之前的连接
+                if (sdk) {
+                    sdk.close();
+                }
+                
+                sdk = new SrsRtcWhipWhepAsync();
+                $('#video').prop('srcObject', sdk.stream);
+                
+                var host = window.location.hostname;
+                var url = "http://" + host + ":1985/rtc/v1/whep/?app=live&stream=livestream";
+                
+                sdk.play(url).then(function(session) {
+                    console.log('WebRTC播放已启动，会话ID:', session.sessionid);
+                }).catch(function(reason) {
+                    sdk.close();
+                    console.error('WebRTC播放失败:', reason);
+                });
+            }
+        });
+    </script>
+</body>
+</html>