同步github官方更新截止Commits on Apr 18, 2025

a9c36c76e569107b5a39b3de8afd6e016b24d662

同步github官方更新截止Commits on Apr 18, 2025
a9c36c76e569107b5a39b3de8afd6e016b24d662
冯杨
Commit 8d25ce3c3a57415089a94790d05a74ba89e313b6 8d25ce3c 1 parent 76b1a0d1
Showing 23 changed files with 1100 additions and 196 deletions
.gitignore
README-EN.md
README.md
app.py
basereal.py
ernerf/data_utils/face_tracking/face_tracker.py
ernerf/data_utils/face_tracking/render_3dmm.py
ernerf/main.py
ernerf/nerf_triplane/utils.py
lightreal.py
lipreal.py
musereal.py
musetalk/models/unet.py
musetalk/models/vae.py
musetalk/simple_musetalk.py
musetalk/utils/preprocessing.py
musetalk/whisper/whisper/__init__.py
musetalk/whisper/whisper/transcribe.py
nerfasr.py
nerfreal.py
--- a/.gitignore
View file @8d25ce3
+++ b/.gitignore
View file @8d25ce3
@@ -15,4 +15,8 @@ pretrained
 *.mp4
 .DS_Store
 workspace/log_ngp.txt
- .idea
\ No newline at end of file
+ .idea
+ 
+ models/
+ *.log
+ dist
\ No newline at end of file
--- a/README-EN.md 0 → 100644
View file @8d25ce3
+++ b/README-EN.md 0 → 100644
View file @8d25ce3
+ Real-time interactive streaming digital human enables synchronous audio and video dialogue. It can basically achieve commercial effects.
+ 
+ [Effect of wav2lip](https://www.bilibili.com/video/BV1scwBeyELA/) | [Effect of ernerf](https://www.bilibili.com/video/BV1G1421z73r/) |  [Effect of musetalk](https://www.bilibili.com/video/BV1gm421N7vQ/)  
+ 
+ ## News
+ - December 8, 2024: Improved multi-concurrency, and the video memory does not increase with the number of concurrent connections.
+ - December 21, 2024: Added model warm-up for wav2lip and musetalk to solve the problem of stuttering during the first inference. Thanks to [@heimaojinzhangyz](https://github.com/heimaojinzhangyz)
+ - December 28, 2024: Added the digital human model Ultralight-Digital-Human. Thanks to [@lijihua2017](https://github.com/lijihua2017)
+ - February 7, 2025: Added fish-speech tts
+ - February 21, 2025: Added the open-source model wav2lip256. Thanks to @不蠢不蠢
+ - March 2, 2025: Added Tencent's speech synthesis service
+ - March 16, 2025: Supports mac gpu inference. Thanks to [@GcsSloop](https://github.com/GcsSloop) 
+ 
+ ## Features
+ 1. Supports multiple digital human models: ernerf, musetalk, wav2lip, Ultralight-Digital-Human
+ 2. Supports voice cloning
+ 3. Supports interrupting the digital human while it is speaking
+ 4. Supports full-body video stitching
+ 5. Supports rtmp and webrtc
+ 6. Supports video arrangement: Play custom videos when not speaking
+ 7. Supports multi-concurrency
+ 
+ ## 1. Installation
+ 
+ Tested on Ubuntu 20.04, Python 3.10, Pytorch 1.12 and CUDA 11.3
+ 
+ ### 1.1 Install dependency
+ 
+ ```bash
+ conda create -n nerfstream python=3.10
+ conda activate nerfstream
+ # If the cuda version is not 11.3 (confirm the version by running nvidia-smi), install the corresponding version of pytorch according to <https://pytorch.org/get-started/previous-versions/> 
+ conda install pytorch==1.12.1 torchvision==0.13.1 cudatoolkit=11.3 -c pytorch
+ pip install -r requirements.txt
+ # If you need to train the ernerf model, install the following libraries
+ # pip install "git+https://github.com/facebookresearch/pytorch3d.git"
+ # pip install tensorflow-gpu==2.8.0
+ # pip install --upgrade "protobuf<=3.20.1"
+ ``` 
+ Common installation issues [FAQ](https://livetalking-doc.readthedocs.io/en/latest/faq.html)  
+ For setting up the linux cuda environment, you can refer to this article https://zhuanlan.zhihu.com/p/674972886
+ 
+ 
+ ## 2. Quick Start
+ - Download the models  
+ Quark Cloud Disk <https://pan.quark.cn/s/83a750323ef0>    
+ Google Drive <https://drive.google.com/drive/folders/1FOC_MD6wdogyyX_7V1d4NDIO7P9NlSAJ?usp=sharing>  
+ Copy wav2lip256.pth to the models folder of this project and rename it to wav2lip.pth;  
+ Extract wav2lip256_avatar1.tar.gz and copy the entire folder to the data/avatars folder of this project.
+ - Run  
+ python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar1  
+ Open http://serverip:8010/webrtcapi.html in a browser. First click'start' to play the digital human video; then enter any text in the text box and submit it. The digital human will broadcast this text.  
+ <font color=red>The server side needs to open ports tcp:8010; udp:1-65536</font>  
+ If you need to purchase a high-definition wav2lip model for commercial use, [Link](https://livetalking-doc.readthedocs.io/zh-cn/latest/service.html#wav2lip).  
+ 
+ - Quick experience  
+ <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_GitHub_livetalking1.3> Create an instance with this image to run it.
+ 
+ If you can't access huggingface, before running
+ ```
+ export HF_ENDPOINT=https://hf-mirror.com
+ ``` 
+ 
+ 
+ ## 3. More Usage
+ Usage instructions: <https://livetalking-doc.readthedocs.io/en/latest>
+   
+ ## 4. Docker Run  
+ No need for the previous installation, just run directly.
+ ```
+ docker run --gpus all -it --network=host --rm registry.cn-beijing.aliyuncs.com/codewithgpu2/lipku-metahuman-stream:2K9qaMBu8v
+ ```
+ The code is in /root/metahuman-stream. First, git pull to get the latest code, and then execute the commands as in steps 2 and 3. 
+ 
+ The following images are provided:
+ - autodl image: <https://www.codewithgpu.com/i/lipku/metahuman-stream/base>   
+ [autodl Tutorial](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
+ - ucloud image: <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_livetalking1.3>  
+ Any port can be opened, and there is no need to deploy an srs service additionally.  
+ [ucloud Tutorial](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html) 
+ 
+ 
+ ## 5. TODO
+ - [x] Added chatgpt to enable digital human dialogue
+ - [x] Voice cloning
+ - [x] Replace the digital human with a video when it is silent
+ - [x] MuseTalk
+ - [x] Wav2Lip
+ - [x] Ultralight-Digital-Human
+ 
+ ---
+ If this project is helpful to you, please give it a star. Friends who are interested are also welcome to join in and improve this project together.
+ * Knowledge Planet: https://t.zsxq.com/7NMyO, where high-quality common problems, best practice experiences, and problem solutions are accumulated.
+ * WeChat Official Account: Digital Human Technology  
+ ![](https://mmbiz.qpic.cn/sz_mmbiz_jpg/l3ZibgueFiaeyfaiaLZGuMGQXnhLWxibpJUS2gfs8Dje6JuMY8zu2tVyU9n8Zx1yaNncvKHBMibX0ocehoITy5qQEZg/640?wxfrom=12&tp=wxpic&usePicPrefetch=1&wx_fmt=jpeg&amp;from=appmsg) 
\ No newline at end of file
--- a/README.md
View file @8d25ce3
+++ b/README.md
View file @8d25ce3
- Real time interactive streaming digital human， realize audio video synchronous dialogue. It can basically achieve commercial effects.  
+ [English](./README-EN.md) | 中文版   
 实时交互流式数字人，实现音视频同步对话。基本可以达到商用效果
+ [wav2lip效果](https://www.bilibili.com/video/BV1scwBeyELA/) | [ernerf效果](https://www.bilibili.com/video/BV1G1421z73r/) | [musetalk效果](https://www.bilibili.com/video/BV1gm421N7vQ/)
 
- [ernerf 效果](https://www.bilibili.com/video/BV1PM4m1y7Q2/) [musetalk 效果](https://www.bilibili.com/video/BV1gm421N7vQ/) [wav2lip 效果](https://www.bilibili.com/video/BV1Bw4m1e74P/)
- 
- ## 为避免与 3d 数字人混淆，原项目 metahuman-stream 改名为 livetalking，原有链接地址继续可用
+ ## 为避免与3d数字人混淆，原项目metahuman-stream改名为livetalking，原有链接地址继续可用
 
 ## News
- 
 - 2024.12.8 完善多并发，显存不随并发数增加
- - 2024.12.21 添加 wav2lip、musetalk 模型预热，解决第一次推理卡顿问题。感谢@heimaojinzhangyz
- - 2024.12.28 添加数字人模型 Ultralight-Digital-Human。 感谢@lijihua2017
- - 2025.2.7 添加 fish-speech tts
- - 2025.2.21 添加 wav2lip256 开源模型 感谢@不蠢不蠢
+ - 2024.12.21 添加wav2lip、musetalk模型预热，解决第一次推理卡顿问题。感谢[@heimaojinzhangyz](https://github.com/heimaojinzhangyz)
+ - 2024.12.28 添加数字人模型Ultralight-Digital-Human。 感谢[@lijihua2017](https://github.com/lijihua2017)
+ - 2025.2.7 添加fish-speech tts
+ - 2025.2.21 添加wav2lip256开源模型 感谢@不蠢不蠢
 - 2025.3.2 添加腾讯语音合成服务
+ - 2025.3.16 支持mac gpu推理，感谢[@GcsSloop](https://github.com/GcsSloop)
 
 ## Features
- 
 1. 支持多种数字人模型: ernerf、musetalk、wav2lip、Ultralight-Digital-Human
 2. 支持声音克隆
 3. 支持数字人说话被打断
 4. 支持全身视频拼接
- 5. 支持 rtmp 和 webrtc
+ 5. 支持rtmp和webrtc
 6. 支持视频编排：不说话时播放自定义视频
 7. 支持多并发
 
@@ -33,67 +31,61 @@ Tested on Ubuntu 20.04, Python3.10, Pytorch 1.12 and CUDA 11.3
 ```bash
 conda create -n nerfstream python=3.10
 conda activate nerfstream
- #如果cuda版本不为11.3(运行nvidia-smi确认版本)，根据<https://pytorch.org/get-started/previous-versions/>安装对应版本的pytorch
+ #如果cuda版本不为11.3(运行nvidia-smi确认版本)，根据<https://pytorch.org/get-started/previous-versions/>安装对应版本的pytorch 
 conda install pytorch==1.12.1 torchvision==0.13.1 cudatoolkit=11.3 -c pytorch
 pip install -r requirements.txt
 #如果需要训练ernerf模型，安装下面的库
 # pip install "git+https://github.com/facebookresearch/pytorch3d.git"
 # pip install tensorflow-gpu==2.8.0
 # pip install --upgrade "protobuf<=3.20.1"
- ```
- 
+ ``` 
 安装常见问题[FAQ](https://livetalking-doc.readthedocs.io/en/latest/faq.html)  
- linux cuda 环境搭建可以参考这篇文章 https://zhuanlan.zhihu.com/p/674972886
+ linux cuda环境搭建可以参考这篇文章 https://zhuanlan.zhihu.com/p/674972886
 
- ## 2. Quick Start
 
+ ## 2. Quick Start
 - 下载模型  
-   百度云盘<https://pan.baidu.com/s/1yOsQ06-RIDTJd3HFCw4wtA> 密码: ltua  
+   夸克云盘<https://pan.quark.cn/s/83a750323ef0>    
   GoogleDriver <https://drive.google.com/drive/folders/1FOC_MD6wdogyyX_7V1d4NDIO7P9NlSAJ?usp=sharing>  
-   将 wav2lip256.pth 拷到本项目的 models 下, 重命名为 wav2lip.pth;  
-   将 wav2lip256_avatar1.tar.gz 解压后整个文件夹拷到本项目的 data/avatars 下
+   将wav2lip256.pth拷到本项目的models下, 重命名为wav2lip.pth;  
+   将wav2lip256_avatar1.tar.gz解压后整个文件夹拷到本项目的data/avatars下
 - 运行  
   python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar1 --preload 2
- 
   使用 GPU 启动模特 3 号：python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar3 --preload 2
-   用浏览器打开 http://serverip:8010/webrtcapi.html , 先点‘start',播放数字人视频；然后在文本框输入任意文字，提交。数字人播报该段文字  
+ 
+ 用浏览器打开http://serverip:8010/webrtcapi.html , 先点‘start',播放数字人视频；然后在文本框输入任意文字，提交。数字人播报该段文字  
   <font color=red>服务端需要开放端口 tcp:8010; udp:1-65536 </font>  
-   如果需要商用高清 wav2lip 模型，可以与我联系购买
+   如果需要商用高清wav2lip模型，[链接](https://livetalking-doc.readthedocs.io/zh-cn/latest/service.html#wav2lip)
 
 - 快速体验  
   <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_GitHub_livetalking1.3> 用该镜像创建实例即可运行成功
 
- 如果访问不了 huggingface，在运行前
- 
+ 如果访问不了huggingface，在运行前
 ```
 export HF_ENDPOINT=https://hf-mirror.com
- ```
+ ``` 
 
- ## 3. More Usage
 
+ ## 3. More Usage
 使用说明: <https://livetalking-doc.readthedocs.io/>
 
 ## 4. Docker Run
- 
 不需要前面的安装，直接运行。
- 
 ```
 docker run --gpus all -it --network=host --rm registry.cn-beijing.aliyuncs.com/codewithgpu2/lipku-metahuman-stream:2K9qaMBu8v
 ```
- 
- 代码在/root/metahuman-stream，先 git pull 拉一下最新代码，然后执行命令同第 2、3 步
+ 代码在/root/metahuman-stream，先git pull拉一下最新代码，然后执行命令同第2、3步
 
 提供如下镜像
+ - autodl镜像: <https://www.codewithgpu.com/i/lipku/metahuman-stream/base>   
+   [autodl教程](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
+ - ucloud镜像: <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_livetalking1.3>  
+   可以开放任意端口，不需要另外部署srs服务.  
+   [ucloud教程](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html)
 
- - autodl 镜像: <https://www.codewithgpu.com/i/lipku/metahuman-stream/base>  
-   [autodl 教程](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
- - ucloud 镜像: <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_livetalking1.3>  
-   可以开放任意端口，不需要另外部署 srs 服务.  
-   [ucloud 教程](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html)
 
 ## 5. TODO
- 
- - [x] 添加 chatgpt 实现数字人对话
+ - [x] 添加chatgpt实现数字人对话
 - [x] 声音克隆
 - [x] 数字人静音时用一段视频代替
 - [x] MuseTalk
@@ -101,9 +93,8 @@ docker run --gpus all -it --network=host --rm registry.cn-beijing.aliyuncs.com/c
 - [x] Ultralight-Digital-Human
 
 ---
+ 如果本项目对你有帮助，帮忙点个star。也欢迎感兴趣的朋友一起来完善该项目.
+ * 知识星球: https://t.zsxq.com/7NMyO 沉淀高质量常见问题、最佳实践经验、问题解答
+ * 微信公众号：数字人技术  
+   ![](https://mmbiz.qpic.cn/sz_mmbiz_jpg/l3ZibgueFiaeyfaiaLZGuMGQXnhLWxibpJUS2gfs8Dje6JuMY8zu2tVyU9n8Zx1yaNncvKHBMibX0ocehoITy5qQEZg/640?wxfrom=12&tp=wxpic&usePicPrefetch=1&wx_fmt=jpeg&amp;from=appmsg)  
 
- 如果本项目对你有帮助，帮忙点个 star。也欢迎感兴趣的朋友一起来完善该项目.
- 
- - 知识星球: https://t.zsxq.com/7NMyO 沉淀高质量常见问题、最佳实践经验、问题解答
- - 微信公众号：数字人技术  
-   ![](https://mmbiz.qpic.cn/sz_mmbiz_jpg/l3ZibgueFiaeyfaiaLZGuMGQXnhLWxibpJUS2gfs8Dje6JuMY8zu2tVyU9n8Zx1yaNncvKHBMibX0ocehoITy5qQEZg/640?wxfrom=12&tp=wxpic&usePicPrefetch=1&wx_fmt=jpeg&from=appmsg)
--- a/app.py
View file @8d25ce3
+++ b/app.py
View file @8d25ce3
@@ -201,7 +201,7 @@ async def set_audiotype(request):
     params = await request.json()
 
     sessionid = params.get('sessionid',0)    
-     nerfreals[sessionid].set_curr_state(params['audiotype'],params['reinit'])
+     nerfreals[sessionid].set_custom_state(params['audiotype'],params['reinit'])
 
     return web.Response(
         content_type="application/json",
@@ -495,6 +495,8 @@ if __name__ == '__main__':
     elif opt.transport=='rtcpush':
         pagename='rtcpushapi.html'
     logger.info('start http server; http://<serverip>:'+str(opt.listenport)+'/'+pagename)
+     logger.info('如果使用webrtc，推荐访问webrtc集成前端: http://<serverip>:'+str(opt.listenport)+'/dashboard.html')
+ 
     def run_server(runner):
         loop = asyncio.new_event_loop()
         asyncio.set_event_loop(loop)
--- a/basereal.py
View file @8d25ce3
+++ b/basereal.py
View file @8d25ce3
@@ -35,7 +35,7 @@ import soundfile as sf
 import av
 from fractions import Fraction
 
- from ttsreal import EdgeTTS,VoitsTTS,XTTS,CosyVoiceTTS,FishTTS,TencentTTS
+ from ttsreal import EdgeTTS,SovitsTTS,XTTS,CosyVoiceTTS,FishTTS,TencentTTS
 from logger import logger
 
 from tqdm import tqdm
@@ -57,7 +57,7 @@ class BaseReal:
         if opt.tts == "edgetts":
             self.tts = EdgeTTS(opt,self)
         elif opt.tts == "gpt-sovits":
-             self.tts = VoitsTTS(opt,self)
+             self.tts = SovitsTTS(opt,self)
         elif opt.tts == "xtts":
             self.tts = XTTS(opt,self)
         elif opt.tts == "cosyvoice":
@@ -66,7 +66,7 @@ class BaseReal:
             self.tts = FishTTS(opt,self)
         elif opt.tts == "tencent":
             self.tts = TencentTTS(opt,self)
-         
+ 
         self.speaking = False
 
         self.recording = False
@@ -84,11 +84,11 @@ class BaseReal:
 
     def put_msg_txt(self,msg,eventpoint=None):
         self.tts.put_msg_txt(msg,eventpoint)
-     
+ 
     def put_audio_frame(self,audio_chunk,eventpoint=None): #16khz 20ms pcm
         self.asr.put_audio_frame(audio_chunk,eventpoint)
 
-     def put_audio_file(self,filebyte): 
+     def put_audio_file(self,filebyte):
         input_stream = BytesIO(filebyte)
         stream = self.__create_bytes_stream(input_stream)
         streamlen = stream.shape[0]
@@ -97,7 +97,7 @@ class BaseReal:
             self.put_audio_frame(stream[idx:idx+self.chunk])
             streamlen -= self.chunk
             idx += self.chunk
-     
+ 
     def __create_bytes_stream(self,byte_stream):
         #byte_stream=BytesIO(buffer)
         stream, sample_rate = sf.read(byte_stream) # [T*sample_rate,] float64
@@ -107,7 +107,7 @@ class BaseReal:
         if stream.ndim > 1:
             logger.info(f'[WARN] audio has {stream.shape[1]} channels, only use the first.')
             stream = stream[:, 0]
-     
+ 
         if sample_rate != self.sample_rate and stream.shape[0]>0:
             logger.info(f'[WARN] audio sample rate is {sample_rate}, resampling into {self.sample_rate}.')
             stream = resampy.resample(x=stream, sr_orig=sample_rate, sr_new=self.sample_rate)
@@ -120,7 +120,7 @@ class BaseReal:
 
     def is_speaking(self)->bool:
         return self.speaking
-     
+ 
     def __loadcustom(self):
         for item in self.opt.customopt:
             logger.info(item)
@@ -155,9 +155,9 @@ class BaseReal:
                     '-s', "{}x{}".format(self.width, self.height),
                     '-r', str(25),
                     '-i', '-',
-                     '-pix_fmt', 'yuv420p', 
+                     '-pix_fmt', 'yuv420p',
                     '-vcodec', "h264",
-                     #'-f' , 'flv',                  
+                     #'-f' , 'flv',
                     f'temp{self.opt.sessionid}.mp4']
         self._record_video_pipe = subprocess.Popen(command, shell=False, stdin=subprocess.PIPE)
 
@@ -169,7 +169,7 @@ class BaseReal:
                     '-ar', '16000',
                     '-i', '-',
                     '-acodec', 'aac',
-                     #'-f' , 'wav',                  
+                     #'-f' , 'wav',
                     f'temp{self.opt.sessionid}.aac']
         self._record_audio_pipe = subprocess.Popen(acommand, shell=False, stdin=subprocess.PIPE)
 
@@ -177,10 +177,10 @@ class BaseReal:
         # self.recordq_video.queue.clear()
         # self.recordq_audio.queue.clear()
         # self.container = av.open(path, mode="w")
-     
+ 
         # process_thread = Thread(target=self.record_frame, args=())
         # process_thread.start()
-     
+ 
     def record_video_data(self,image):
         if self.width == 0:
             print("image.shape:",image.shape)
@@ -191,14 +191,14 @@ class BaseReal:
     def record_audio_data(self,frame):
         if self.recording:
             self._record_audio_pipe.stdin.write(frame.tostring())
-     
-     # def record_frame(self): 
+ 
+     # def record_frame(self):
     #     videostream = self.container.add_stream("libx264", rate=25)
     #     videostream.codec_context.time_base = Fraction(1, 25)
     #     audiostream = self.container.add_stream("aac")
     #     audiostream.codec_context.time_base = Fraction(1, 16000)
     #     init = True
-     #     framenum = 0       
+     #     framenum = 0
     #     while self.recording:
     #         try:
     #             videoframe = self.recordq_video.get(block=True, timeout=1)
@@ -231,18 +231,18 @@ class BaseReal:
     #     self.recordq_video.queue.clear()
     #     self.recordq_audio.queue.clear()
     #     print('record thread stop')
- 		
+ 
     def stop_recording(self):
         """停止录制视频"""
         if not self.recording:
             return
-         self.recording = False 
-         self._record_video_pipe.stdin.close()  #wait() 
+         self.recording = False
+         self._record_video_pipe.stdin.close()  #wait()
         self._record_video_pipe.wait()
         self._record_audio_pipe.stdin.close()
         self._record_audio_pipe.wait()
         cmd_combine_audio = f"ffmpeg -y -i temp{self.opt.sessionid}.aac -i temp{self.opt.sessionid}.mp4 -c:v copy -c:a copy data/record.mp4"
-         os.system(cmd_combine_audio) 
+         os.system(cmd_combine_audio)
         #os.remove(output_path)
 
     def mirror_index(self,size, index):
@@ -252,8 +252,8 @@ class BaseReal:
         if turn % 2 == 0:
             return res
         else:
-             return size - res - 1 
-     
+             return size - res - 1
+ 
     def get_audio_stream(self,audiotype):
         idx = self.custom_audio_index[audiotype]
         stream = self.custom_audio_cycle[audiotype][idx:idx+self.chunk]
@@ -261,9 +261,9 @@ class BaseReal:
         if self.custom_audio_index[audiotype]>=self.custom_audio_cycle[audiotype].shape[0]:
             self.curr_state = 1  #当前视频不循环播放，切换到静音状态
         return stream
-     
-     def set_curr_state(self,audiotype, reinit):
-         print('set_curr_state:',audiotype)
+ 
+     def set_custom_state(self,audiotype, reinit=True):
+         print('set_custom_state:',audiotype)
         self.curr_state = audiotype
         if reinit:
             self.custom_audio_index[audiotype] = 0
--- a/ernerf/data_utils/face_tracking/face_tracker.py
View file @8d25ce3
+++ b/ernerf/data_utils/face_tracking/face_tracker.py
View file @8d25ce3
@@ -179,8 +179,11 @@ print(f'[INFO] fitting light...')
 
 batch_size = 32
 
- device_default = torch.device("cuda:0")
- device_render = torch.device("cuda:0")
+ device_default = torch.device("cuda:0" if torch.cuda.is_available() else (
+     "mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
+ device_render = torch.device("cuda:0" if torch.cuda.is_available() else (
+     "mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
+ 
 renderer = Render_3DMM(arg_focal, h, w, batch_size, device_render)
 
 sel_ids = np.arange(0, num_frames, int(num_frames / batch_size))[:batch_size]
--- a/ernerf/data_utils/face_tracking/render_3dmm.py
View file @8d25ce3
+++ b/ernerf/data_utils/face_tracking/render_3dmm.py
View file @8d25ce3
@@ -83,7 +83,7 @@ class Render_3DMM(nn.Module):
         img_h=500,
         img_w=500,
         batch_size=1,
-         device=torch.device("cuda:0"),
+         device=torch.device("cuda:0" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")),
     ):
         super(Render_3DMM, self).__init__()
 
--- a/ernerf/main.py
View file @8d25ce3
+++ b/ernerf/main.py
View file @8d25ce3
@@ -147,7 +147,7 @@ if __name__ == '__main__':
     
     seed_everything(opt.seed)
 
-     device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+     device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
 
     model = NeRFNetwork(opt)
 
--- a/ernerf/nerf_triplane/utils.py
View file @8d25ce3
+++ b/ernerf/nerf_triplane/utils.py
View file @8d25ce3
@@ -442,7 +442,7 @@ class LPIPSMeter:
         self.N = 0
         self.net = net
 
-         self.device = device if device is not None else torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+         self.device = device if device is not None else torch.device('cuda' if torch.cuda.is_available() else ('mps' if hasattr(torch.backends, "mps") and torch.backends.mps.is_available() else 'cpu'))
         self.fn = lpips.LPIPS(net=net).eval().to(self.device)
 
     def clear(self):
@@ -456,13 +456,13 @@ class LPIPSMeter:
             inp = inp.to(self.device)
             outputs.append(inp)
         return outputs
-     
+ 
     def update(self, preds, truths):
         preds, truths = self.prepare_inputs(preds, truths) # [B, H, W, 3] --> [B, 3, H, W], range in [0, 1]
         v = self.fn(truths, preds, normalize=True).item() # normalize=True: [0, 1] to [-1, 1]
         self.V += v
         self.N += 1
-     
+ 
     def measure(self):
         return self.V / self.N
 
@@ -499,7 +499,7 @@ class LMDMeter:
 
         self.V = 0
         self.N = 0
-     
+ 
     def get_landmarks(self, img):
 
         if self.backend == 'dlib':
@@ -515,7 +515,7 @@ class LMDMeter:
 
         else:
             lms = self.predictor.get_landmarks(img)[-1]
-         
+ 
         # self.vis_landmarks(img, lms)
         lms = lms.astype(np.float32)
 
@@ -537,7 +537,7 @@ class LMDMeter:
             inp = (inp * 255).astype(np.uint8)
             outputs.append(inp)
         return outputs
-     
+ 
     def update(self, preds, truths):
         # assert B == 1
         preds, truths = self.prepare_inputs(preds[0], truths[0]) # [H, W, 3] numpy array
@@ -553,13 +553,13 @@ class LMDMeter:
         # avarage
         lms_pred = lms_pred - lms_pred.mean(0)
         lms_truth = lms_truth - lms_truth.mean(0)
-         
+ 
         # distance
         dist = np.sqrt(((lms_pred - lms_truth) ** 2).sum(1)).mean(0)
-         
+ 
         self.V += dist
         self.N += 1
-     
+ 
     def measure(self):
         return self.V / self.N
 
@@ -567,14 +567,14 @@ class LMDMeter:
         writer.add_scalar(os.path.join(prefix, f"LMD ({self.backend})"), self.measure(), global_step)
 
     def report(self):
-         return f'LMD ({self.backend}) = {self.measure():.6f}' 
-     
+         return f'LMD ({self.backend}) = {self.measure():.6f}'
+ 
 
 class Trainer(object):
-     def __init__(self, 
+     def __init__(self,
                  name, # name of this experiment
                  opt, # extra conf
-                  model, # network 
+                  model, # network
                  criterion=None, # loss function, if None, assume inline implementation in train_step
                  optimizer=None, # optimizer
                  ema_decay=None, # if use EMA, set the decay
@@ -596,7 +596,7 @@ class Trainer(object):
                  use_tensorboardX=True, # whether to use tensorboard for logging
                  scheduler_update_every_step=False, # whether to call scheduler.step() after every train step
                  ):
-         
+ 
         self.name = name
         self.opt = opt
         self.mute = mute
@@ -618,7 +618,11 @@ class Trainer(object):
         self.flip_init_lips = self.opt.init_lips
         self.time_stamp = time.strftime("%Y-%m-%d_%H-%M-%S")
         self.scheduler_update_every_step = scheduler_update_every_step
-         self.device = device if device is not None else torch.device(f'cuda:{local_rank}' if torch.cuda.is_available() else 'cpu')
+         self.device = device if device is not None else torch.device(
+             f'cuda:{local_rank}' if torch.cuda.is_available() else (
+                 'mps' if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else 'cpu'
+             )
+         )
         self.console = Console()
 
         model.to(self.device)
--- a/lightreal.py
View file @8d25ce3
+++ b/lightreal.py
View file @8d25ce3
@@ -56,10 +56,8 @@ from ultralight.unet import Model
 from ultralight.audio2feature import Audio2Feature
 from logger import logger
 
- 
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
- logger.info('Using {} for inference.'.format(device))
- 
+ device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
+ print('Using {} for inference.'.format(device))
 
 def load_model(opt):
     audio_processor = Audio2Feature()
--- a/lipreal.py
View file @8d25ce3
+++ b/lipreal.py
View file @8d25ce3
@@ -44,8 +44,8 @@ from basereal import BaseReal
 from tqdm import tqdm
 from logger import logger
 
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
- logger.info('Using {} for inference.'.format(device))
+ device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
+ print('Using {} for inference.'.format(device))
 
 def _load(checkpoint_path):
 	if device == 'cuda':
--- a/musereal.py
View file @8d25ce3
+++ b/musereal.py
View file @8d25ce3
@@ -51,7 +51,7 @@ from logger import logger
 def load_model():
     # load model weights
     audio_processor,vae, unet, pe = load_all_model()
-     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+     device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
     timesteps = torch.tensor([0], device=device)
     pe = pe.half()
     vae.vae = vae.vae.half()
@@ -64,7 +64,7 @@ def load_avatar(avatar_id):
     #self.video_path = '' #video_path
     #self.bbox_shift = opt.bbox_shift
     avatar_path = f"./data/avatars/{avatar_id}"
-     full_imgs_path = f"{avatar_path}/full_imgs" 
+     full_imgs_path = f"{avatar_path}/full_imgs"
     coords_path = f"{avatar_path}/coords.pkl"
     latents_out_path= f"{avatar_path}/latents.pt"
     video_out_path = f"{avatar_path}/vid_output/"
@@ -74,7 +74,7 @@ def load_avatar(avatar_id):
     # self.avatar_info = {
     #     "avatar_id":self.avatar_id,
     #     "video_path":self.video_path,
-     #     "bbox_shift":self.bbox_shift   
+     #     "bbox_shift":self.bbox_shift
     # }
 
     input_latent_list_cycle = torch.load(latents_out_path)  #,weights_only=True
@@ -124,19 +124,19 @@ def __mirror_index(size, index):
     if turn % 2 == 0:
         return res
     else:
-         return size - res - 1 
+         return size - res - 1
 
 @torch.no_grad()
 def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,audio_out_queue,res_frame_queue,
               vae, unet, pe,timesteps): #vae, unet, pe,timesteps
-     
+ 
     # vae, unet, pe = load_diffusion_model()
     # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     # timesteps = torch.tensor([0], device=device)
     # pe = pe.half()
     # vae.vae = vae.vae.half()
     # unet.model = unet.model.half()
-     
+ 
     length = len(input_latent_list_cycle)
     index = 0
     count=0
@@ -169,7 +169,7 @@ def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,a
                 latent = input_latent_list_cycle[idx]
                 latent_batch.append(latent)
             latent_batch = torch.cat(latent_batch, dim=0)
-             
+ 
             # for i, (whisper_batch,latent_batch) in enumerate(gen):
             audio_feature_batch = torch.from_numpy(whisper_batch)
             audio_feature_batch = audio_feature_batch.to(device=unet.device,
@@ -179,8 +179,8 @@ def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,a
             # print('prepare time:',time.perf_counter()-t)
             # t=time.perf_counter()
 
-             pred_latents = unet.model(latent_batch, 
-                                         timesteps, 
+             pred_latents = unet.model(latent_batch,
+                                         timesteps,
                                         encoder_hidden_states=audio_feature_batch).sample
             # print('unet time:',time.perf_counter()-t)
             # t=time.perf_counter()
@@ -203,7 +203,7 @@ def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,a
                 #self.__pushmedia(res_frame,loop,audio_track,video_track)
                 res_frame_queue.put((res_frame,__mirror_index(length,index),audio_frames[i*2:i*2+2]))
                 index = index + 1
-             #print('total batch time:',time.perf_counter()-starttime)            
+             #print('total batch time:',time.perf_counter()-starttime)
     logger.info('musereal inference processor stop')
 
 class MuseReal(BaseReal):
@@ -226,12 +226,12 @@ class MuseReal(BaseReal):
 
         self.asr = MuseASR(opt,self,self.audio_processor)
         self.asr.warm_up()
-         
+ 
         self.render_event = mp.Event()
 
     def __del__(self):
         logger.info(f'musereal({self.sessionid}) delete')
-     
+ 
 
     def __mirror_index(self, index):
         size = len(self.coord_list_cycle)
@@ -240,9 +240,9 @@ class MuseReal(BaseReal):
         if turn % 2 == 0:
             return res
         else:
-             return size - res - 1  
+             return size - res - 1
 
-     def __warm_up(self): 
+     def __warm_up(self):
         self.asr.run_step()
         whisper_chunks = self.asr.get_next_feat()
         whisper_batch = np.stack(whisper_chunks)
@@ -260,30 +260,57 @@ class MuseReal(BaseReal):
         audio_feature_batch = self.pe(audio_feature_batch)
         latent_batch = latent_batch.to(dtype=self.unet.model.dtype)
 
-         pred_latents = self.unet.model(latent_batch, 
-                                     self.timesteps, 
+         pred_latents = self.unet.model(latent_batch,
+                                     self.timesteps,
                                     encoder_hidden_states=audio_feature_batch).sample
         recon = self.vae.decode_latents(pred_latents)
-       
+ 
 
     def process_frames(self,quit_event,loop=None,audio_track=None,video_track=None):
-         
+         enable_transition = True  # 设置为False禁用过渡效果，True启用
+ 
+         if enable_transition:
+             self.last_speaking = False
+             self.transition_start = time.time()
+             self.transition_duration = 0.1  # 过渡时间
+             self.last_silent_frame = None  # 静音帧缓存
+             self.last_speaking_frame = None  # 说话帧缓存
+ 
         while not quit_event.is_set():
             try:
                 res_frame,idx,audio_frames = self.res_frame_queue.get(block=True, timeout=1)
             except queue.Empty:
                 continue
-             if audio_frames[0][1]!=0 and audio_frames[1][1]!=0: #全为静音数据，只需要取fullimg
+ 
+             if enable_transition:
+                 # 检测状态变化
+                 current_speaking = not (audio_frames[0][1]!=0 and audio_frames[1][1]!=0)
+                 if current_speaking != self.last_speaking:
+                     logger.info(f"状态切换：{'说话' if self.last_speaking else '静音'} → {'说话' if current_speaking else '静音'}")
+                     self.transition_start = time.time()
+                 self.last_speaking = current_speaking
+ 
+             if audio_frames[0][1]!=0 and audio_frames[1][1]!=0:
                 self.speaking = False
                 audiotype = audio_frames[0][1]
-                 if self.custom_index.get(audiotype) is not None: #有自定义视频
+                 if self.custom_index.get(audiotype) is not None:
                     mirindex = self.mirror_index(len(self.custom_img_cycle[audiotype]),self.custom_index[audiotype])
-                     combine_frame = self.custom_img_cycle[audiotype][mirindex]
+                     target_frame = self.custom_img_cycle[audiotype][mirindex]
                     self.custom_index[audiotype] += 1
-                     # if not self.custom_opt[audiotype].loop and self.custom_index[audiotype]>=len(self.custom_img_cycle[audiotype]):
-                     #     self.curr_state = 1  #当前视频不循环播放，切换到静音状态
                 else:
-                     combine_frame = self.frame_list_cycle[idx]
+                     target_frame = self.frame_list_cycle[idx]
+ 
+                 if enable_transition:
+                     # 说话→静音过渡
+                     if time.time() - self.transition_start < self.transition_duration and self.last_speaking_frame is not None:
+                         alpha = min(1.0, (time.time() - self.transition_start) / self.transition_duration)
+                         combine_frame = cv2.addWeighted(self.last_speaking_frame, 1-alpha, target_frame, alpha, 0)
+                     else:
+                         combine_frame = target_frame
+                     # 缓存静音帧
+                     self.last_silent_frame = combine_frame.copy()
+                 else:
+                     combine_frame = target_frame
             else:
                 self.speaking = True
                 bbox = self.coord_list_cycle[idx]
@@ -291,20 +318,29 @@ class MuseReal(BaseReal):
                 x1, y1, x2, y2 = bbox
                 try:
                     res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
-                 except:
+                 except Exception as e:
+                     logger.warning(f"resize error: {e}")
                     continue
                 mask = self.mask_list_cycle[idx]
                 mask_crop_box = self.mask_coords_list_cycle[idx]
-                 #combine_frame = get_image(ori_frame,res_frame,bbox)
-                 #t=time.perf_counter()
-                 combine_frame = get_image_blending(ori_frame,res_frame,bbox,mask,mask_crop_box)
-                 #print('blending time:',time.perf_counter()-t)
 
-             image = combine_frame #(outputs['image'] * 255).astype(np.uint8)
+                 current_frame = get_image_blending(ori_frame,res_frame,bbox,mask,mask_crop_box)
+                 if enable_transition:
+                     # 静音→说话过渡
+                     if time.time() - self.transition_start < self.transition_duration and self.last_silent_frame is not None:
+                         alpha = min(1.0, (time.time() - self.transition_start) / self.transition_duration)
+                         combine_frame = cv2.addWeighted(self.last_silent_frame, 1-alpha, current_frame, alpha, 0)
+                     else:
+                         combine_frame = current_frame
+                     # 缓存说话帧
+                     self.last_speaking_frame = combine_frame.copy()
+                 else:
+                     combine_frame = current_frame
+ 
+             image = combine_frame
             new_frame = VideoFrame.from_ndarray(image, format="bgr24")
             asyncio.run_coroutine_threadsafe(video_track._queue.put((new_frame,None)), loop)
             self.record_video_data(image)
-             #self.recordq_video.put(new_frame)  
 
             for audio_frame in audio_frames:
                 frame,type,eventpoint = audio_frame
@@ -312,12 +348,8 @@ class MuseReal(BaseReal):
                 new_frame = AudioFrame(format='s16', layout='mono', samples=frame.shape[0])
                 new_frame.planes[0].update(frame.tobytes())
                 new_frame.sample_rate=16000
-                 # if audio_track._queue.qsize()>10:
-                 #     time.sleep(0.1)
                 asyncio.run_coroutine_threadsafe(audio_track._queue.put((new_frame,eventpoint)), loop)
                 self.record_audio_data(frame)
-                 #self.notify(eventpoint)
-                 #self.recordq_audio.put(new_frame)
         logger.info('musereal process_frames thread stop') 
             
     def render(self,quit_event,loop=None,audio_track=None,video_track=None):
--- a/musetalk/models/unet.py
View file @8d25ce3
+++ b/musetalk/models/unet.py
View file @8d25ce3
@@ -36,7 +36,7 @@ class UNet():
             unet_config = json.load(f)
         self.model = UNet2DConditionModel(**unet_config)
         self.pe = PositionalEncoding(d_model=384)
-         self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+         self.device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
         weights = torch.load(model_path) if torch.cuda.is_available() else torch.load(model_path, map_location=self.device)
         self.model.load_state_dict(weights)
         if use_float16:
--- a/musetalk/models/vae.py
View file @8d25ce3
+++ b/musetalk/models/vae.py
View file @8d25ce3
@@ -23,7 +23,7 @@ class VAE():
         self.model_path = model_path
         self.vae = AutoencoderKL.from_pretrained(self.model_path)
 
-         self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+         self.device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
         self.vae.to(self.device)
 
         if use_float16:
--- a/musetalk/simple_musetalk.py
View file @8d25ce3
+++ b/musetalk/simple_musetalk.py
View file @8d25ce3
@@ -325,7 +325,7 @@ def create_musetalk_human(file, avatar_id):
 
 
 # initialize the mmpose model
- device = "cuda" if torch.cuda.is_available() else "cpu"
+ device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
 fa = FaceAlignment(1, flip_input=False, device=device)
 config_file = os.path.join(current_dir, 'utils/dwpose/rtmpose-l_8xb32-270e_coco-ubody-wholebody-384x288.py')
 checkpoint_file = os.path.abspath(os.path.join(current_dir, '../models/dwpose/dw-ll_ucoco_384.pth'))
--- a/musetalk/utils/preprocessing.py
View file @8d25ce3
+++ b/musetalk/utils/preprocessing.py
View file @8d25ce3
@@ -13,14 +13,14 @@ import torch
 from tqdm import tqdm
 
 # initialize the mmpose model
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
 config_file = './musetalk/utils/dwpose/rtmpose-l_8xb32-270e_coco-ubody-wholebody-384x288.py'
 checkpoint_file = './models/dwpose/dw-ll_ucoco_384.pth'
 model = init_model(config_file, checkpoint_file, device=device)
 
 # initialize the face detection model
- device = "cuda" if torch.cuda.is_available() else "cpu"
- fa = FaceAlignment(LandmarksType._2D, flip_input=False,device=device)
+ device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
+ fa = FaceAlignment(LandmarksType._2D, flip_input=False, device=device)
 
 # maker if the bbox is not sufficient 
 coord_placeholder = (0.0,0.0,0.0,0.0)
--- a/musetalk/whisper/whisper/__init__.py
View file @8d25ce3
+++ b/musetalk/whisper/whisper/__init__.py
View file @8d25ce3
@@ -91,7 +91,7 @@ def load_model(name: str, device: Optional[Union[str, torch.device]] = None, dow
     """
 
     if device is None:
-         device = "cuda" if torch.cuda.is_available() else "cpu"
+         device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
     if download_root is None:
         download_root = os.getenv(
             "XDG_CACHE_HOME", 
--- a/musetalk/whisper/whisper/transcribe.py
View file @8d25ce3
+++ b/musetalk/whisper/whisper/transcribe.py
View file @8d25ce3
@@ -78,17 +78,19 @@ def transcribe(
         if dtype == torch.float16:
             warnings.warn("FP16 is not supported on CPU; using FP32 instead")
             dtype = torch.float32
+         if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+             warnings.warn("Performing inference on CPU when MPS is available")
 
     if dtype == torch.float32:
         decode_options["fp16"] = False
 
     mel = log_mel_spectrogram(audio)
-    
+ 
     all_segments = []
     def add_segment(
             *, start: float, end: float, encoder_embeddings
     ):
-       
+ 
         all_segments.append(
             {
                 "start": start,
@@ -100,20 +102,20 @@ def transcribe(
     num_frames = mel.shape[-1]
     seek = 0
     previous_seek_value = seek
-     sample_skip = 3000 # 
+     sample_skip = 3000 #
     with tqdm.tqdm(total=num_frames, unit='frames', disable=verbose is not False) as pbar:
         while seek < num_frames:
             # seek是开始的帧数
             end_seek = min(seek + sample_skip, num_frames)
             segment = pad_or_trim(mel[:,seek:seek+sample_skip], N_FRAMES).to(model.device).to(dtype)
-             
+ 
             single = segment.ndim == 2
             if single:
                 segment = segment.unsqueeze(0)
             if dtype == torch.float16:
                 segment = segment.half()
             audio_features, embeddings  = model.encoder(segment, include_embeddings = True)
-             
+ 
             encoder_embeddings = embeddings
             #print(f"encoder_embeddings shape {encoder_embeddings.shape}")
             add_segment(
@@ -124,7 +126,7 @@ def transcribe(
                 encoder_embeddings=encoder_embeddings,
             )
             seek+=sample_skip
-     
+ 
     return dict(segments=all_segments)
 
 
@@ -135,7 +137,7 @@ def cli():
     parser.add_argument("audio", nargs="+", type=str, help="audio file(s) to transcribe")
     parser.add_argument("--model", default="small", choices=available_models(), help="name of the Whisper model to use")
     parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
-     parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
+     parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "mps", help="device to use for PyTorch inference")
     parser.add_argument("--output_dir", "-o", type=str, default=".", help="directory to save the outputs")
     parser.add_argument("--verbose", type=str2bool, default=True, help="whether to print out the progress and debug messages")
 
--- a/nerfasr.py
View file @8d25ce3
+++ b/nerfasr.py
View file @8d25ce3
@@ -30,7 +30,7 @@ class NerfASR(BaseASR):
     def __init__(self, opt, parent, audio_processor,audio_model):
         super().__init__(opt,parent)
 
-         self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
+         self.device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
         if 'esperanto' in self.opt.asr_model:
             self.audio_dim = 44
         elif 'deepspeech' in self.opt.asr_model:
--- a/nerfreal.py
View file @8d25ce3
+++ b/nerfreal.py
View file @8d25ce3
@@ -77,7 +77,7 @@ def load_model(opt):
     seed_everything(opt.seed)
     logger.info(opt)
 
-     device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+     device = torch.device('cuda' if torch.cuda.is_available() else ('mps' if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else 'cpu'))
     model = NeRFNetwork(opt)
 
     criterion = torch.nn.MSELoss(reduction='none')
--- a/ttsreal.py
View file @8d25ce3
+++ b/ttsreal.py
View file @8d25ce3
@@ -90,7 +90,7 @@ class BaseTTS:
 ###########################################################################################
 class EdgeTTS(BaseTTS):
     def txt_to_audio(self,msg):
-         voicename = "zh-CN-XiaoxiaoNeural"
+         voicename = "zh-CN-YunxiaNeural"
         text,textevent = msg
         t = time.time()
         asyncio.new_event_loop().run_until_complete(self.__main(voicename,text))
@@ -98,7 +98,7 @@ class EdgeTTS(BaseTTS):
         if self.input_stream.getbuffer().nbytes<=0: #edgetts err
             logger.error('edgetts err!!!!!')
             return
-         
+ 
         self.input_stream.seek(0)
         stream = self.__create_bytes_stream(self.input_stream)
         streamlen = stream.shape[0]
@@ -107,15 +107,15 @@ class EdgeTTS(BaseTTS):
             eventpoint=None
             streamlen -= self.chunk
             if idx==0:
-                 eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                 eventpoint={'status':'start','text':text,'msgevent':textevent}
             elif streamlen<self.chunk:
-                 eventpoint={'status':'end','text':text,'msgenvent':textevent}
+                 eventpoint={'status':'end','text':text,'msgevent':textevent}
             self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
             idx += self.chunk
         #if streamlen>0:  #skip last frame(not 20ms)
         #    self.queue.put(stream[idx:])
         self.input_stream.seek(0)
-         self.input_stream.truncate() 
+         self.input_stream.truncate()
 
     def __create_bytes_stream(self,byte_stream):
         #byte_stream=BytesIO(buffer)
@@ -126,13 +126,13 @@ class EdgeTTS(BaseTTS):
         if stream.ndim > 1:
             logger.info(f'[WARN] audio has {stream.shape[1]} channels, only use the first.')
             stream = stream[:, 0]
-     
+ 
         if sample_rate != self.sample_rate and stream.shape[0]>0:
             logger.info(f'[WARN] audio sample rate is {sample_rate}, resampling into {self.sample_rate}.')
             stream = resampy.resample(x=stream, sr_orig=sample_rate, sr_new=self.sample_rate)
 
         return stream
-     
+ 
     async def __main(self,voicename: str, text: str):
         try:
             communicate = edge_tts.Communicate(text, voicename)
@@ -153,12 +153,12 @@ class EdgeTTS(BaseTTS):
 
 ###########################################################################################
 class FishTTS(BaseTTS):
-     def txt_to_audio(self,msg): 
+     def txt_to_audio(self,msg):
         text,textevent = msg
         self.stream_tts(
             self.fish_speech(
                 text,
-                 self.opt.REF_FILE,  
+                 self.opt.REF_FILE,
                 self.opt.REF_TEXT,
                 "zh", #en args.language,
                 self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
@@ -190,9 +190,9 @@ class FishTTS(BaseTTS):
             if res.status_code != 200:
                 logger.error("Error:%s", res.text)
                 return
-                 
+ 
             first = True
-         
+ 
             for chunk in res.iter_content(chunk_size=17640): # 1764 44100*20ms*2
                 #print('chunk len:',len(chunk))
                 if first:
@@ -209,7 +209,7 @@ class FishTTS(BaseTTS):
         text,textevent = msg
         first = True
         for chunk in audio_stream:
-             if chunk is not None and len(chunk)>0:          
+             if chunk is not None and len(chunk)>0:
                 stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 stream = resampy.resample(x=stream, sr_orig=44100, sr_new=self.sample_rate)
                 #byte_stream=BytesIO(buffer)
@@ -219,22 +219,22 @@ class FishTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                         eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                         eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
-         eventpoint={'status':'end','text':text,'msgenvent':textevent}
-         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint) 
+         eventpoint={'status':'end','text':text,'msgevent':textevent}
+         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
 
 ###########################################################################################
- class VoitsTTS(BaseTTS):
-     def txt_to_audio(self,msg): 
+ class SovitsTTS(BaseTTS):
+     def txt_to_audio(self,msg):
         text,textevent = msg
         self.stream_tts(
             self.gpt_sovits(
                 text,
-                 self.opt.REF_FILE,  
+                 self.opt.REF_FILE,
                 self.opt.REF_TEXT,
                 "zh", #en args.language,
                 self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
@@ -271,9 +271,9 @@ class VoitsTTS(BaseTTS):
             if res.status_code != 200:
                 logger.error("Error:%s", res.text)
                 return
-                 
+ 
             first = True
-         
+ 
             for chunk in res.iter_content(chunk_size=None): #12800 1280 32K*20ms*2
                 logger.info('chunk len:%d',len(chunk))
                 if first:
@@ -295,7 +295,7 @@ class VoitsTTS(BaseTTS):
         if stream.ndim > 1:
             logger.info(f'[WARN] audio has {stream.shape[1]} channels, only use the first.')
             stream = stream[:, 0]
-     
+ 
         if sample_rate != self.sample_rate and stream.shape[0]>0:
             logger.info(f'[WARN] audio sample rate is {sample_rate}, resampling into {self.sample_rate}.')
             stream = resampy.resample(x=stream, sr_orig=sample_rate, sr_new=self.sample_rate)
@@ -306,7 +306,7 @@ class VoitsTTS(BaseTTS):
         text,textevent = msg
         first = True
         for chunk in audio_stream:
-             if chunk is not None and len(chunk)>0:          
+             if chunk is not None and len(chunk)>0:
                 #stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 #stream = resampy.resample(x=stream, sr_orig=32000, sr_new=self.sample_rate)
                 byte_stream=BytesIO(chunk)
@@ -316,22 +316,22 @@ class VoitsTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                         eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                         eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
-         eventpoint={'status':'end','text':text,'msgenvent':textevent}
+         eventpoint={'status':'end','text':text,'msgevent':textevent}
         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
 
 ###########################################################################################
 class CosyVoiceTTS(BaseTTS):
     def txt_to_audio(self,msg):
-         text,textevent = msg 
+         text,textevent = msg
         self.stream_tts(
             self.cosy_voice(
                 text,
-                 self.opt.REF_FILE,  
+                 self.opt.REF_FILE,
                 self.opt.REF_TEXT,
                 "zh", #en args.language,
                 self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
@@ -348,16 +348,16 @@ class CosyVoiceTTS(BaseTTS):
         try:
             files = [('prompt_wav', ('prompt_wav', open(reffile, 'rb'), 'application/octet-stream'))]
             res = requests.request("GET", f"{server_url}/inference_zero_shot", data=payload, files=files, stream=True)
-             
+ 
             end = time.perf_counter()
             logger.info(f"cosy_voice Time to make POST: {end-start}s")
 
             if res.status_code != 200:
                 logger.error("Error:%s", res.text)
                 return
-                 
+ 
             first = True
-         
+ 
             for chunk in res.iter_content(chunk_size=9600): # 960 24K*20ms*2
                 if first:
                     end = time.perf_counter()
@@ -372,7 +372,7 @@ class CosyVoiceTTS(BaseTTS):
         text,textevent = msg
         first = True
         for chunk in audio_stream:
-             if chunk is not None and len(chunk)>0:          
+             if chunk is not None and len(chunk)>0:
                 stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 stream = resampy.resample(x=stream, sr_orig=24000, sr_new=self.sample_rate)
                 #byte_stream=BytesIO(buffer)
@@ -382,13 +382,13 @@ class CosyVoiceTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                         eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                         eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
-         eventpoint={'status':'end','text':text,'msgenvent':textevent}
-         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint) 
+         eventpoint={'status':'end','text':text,'msgevent':textevent}
+         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
 
 ###########################################################################################
 _PROTOCOL = "https://"
@@ -407,7 +407,7 @@ class TencentTTS(BaseTTS):
         self.sample_rate = 16000
         self.volume = 0
         self.speed = 0
-     
+ 
     def __gen_signature(self, params):
         sort_dict = sorted(params.keys())
         sign_str = "POST" + _HOST + _PATH + "?"
@@ -440,11 +440,11 @@ class TencentTTS(BaseTTS):
         return params
 
     def txt_to_audio(self,msg):
-         text,textevent = msg 
+         text,textevent = msg
         self.stream_tts(
             self.tencent_voice(
                 text,
-                 self.opt.REF_FILE,  
+                 self.opt.REF_FILE,
                 self.opt.REF_TEXT,
                 "zh", #en args.language,
                 self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
@@ -465,12 +465,12 @@ class TencentTTS(BaseTTS):
         try:
             res = requests.post(url, headers=headers,
                           data=json.dumps(params), stream=True)
-             
+ 
             end = time.perf_counter()
             logger.info(f"tencent Time to make POST: {end-start}s")
-                 
+ 
             first = True
-         
+ 
             for chunk in res.iter_content(chunk_size=6400): # 640 16K*20ms*2
                 #logger.info('chunk len:%d',len(chunk))
                 if first:
@@ -483,7 +483,7 @@ class TencentTTS(BaseTTS):
                     except:
                         end = time.perf_counter()
                         logger.info(f"tencent Time to first chunk: {end-start}s")
-                         first = False                    
+                         first = False
                 if chunk and self.state==State.RUNNING:
                     yield chunk
         except Exception as e:
@@ -494,7 +494,7 @@ class TencentTTS(BaseTTS):
         first = True
         last_stream = np.array([],dtype=np.float32)
         for chunk in audio_stream:
-             if chunk is not None and len(chunk)>0:          
+             if chunk is not None and len(chunk)>0:
                 stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 stream = np.concatenate((last_stream,stream))
                 #stream = resampy.resample(x=stream, sr_orig=24000, sr_new=self.sample_rate)
@@ -505,14 +505,14 @@ class TencentTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                         eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                         eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
                 last_stream = stream[idx:] #get the remain stream
-         eventpoint={'status':'end','text':text,'msgenvent':textevent}
-         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint) 
+         eventpoint={'status':'end','text':text,'msgevent':textevent}
+         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
 
 ###########################################################################################
 
@@ -522,7 +522,7 @@ class XTTS(BaseTTS):
         self.speaker = self.get_speaker(opt.REF_FILE, opt.TTS_SERVER)
 
     def txt_to_audio(self,msg):
-         text,textevent = msg  
+         text,textevent = msg
         self.stream_tts(
             self.xtts(
                 text,
@@ -558,7 +558,7 @@ class XTTS(BaseTTS):
                 return
 
             first = True
-         
+ 
             for chunk in res.iter_content(chunk_size=9600): #24K*20ms*2
                 if first:
                     end = time.perf_counter()
@@ -568,12 +568,12 @@ class XTTS(BaseTTS):
                     yield chunk
         except Exception as e:
             print(e)
-     
+ 
     def stream_tts(self,audio_stream,msg):
         text,textevent = msg
         first = True
         for chunk in audio_stream:
-             if chunk is not None and len(chunk)>0:          
+             if chunk is not None and len(chunk)>0:
                 stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
                 stream = resampy.resample(x=stream, sr_orig=24000, sr_new=self.sample_rate)
                 #byte_stream=BytesIO(buffer)
@@ -583,10 +583,10 @@ class XTTS(BaseTTS):
                 while streamlen >= self.chunk:
                     eventpoint=None
                     if first:
-                         eventpoint={'status':'start','text':text,'msgenvent':textevent}
+                         eventpoint={'status':'start','text':text,'msgevent':textevent}
                         first = False
                     self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
                     streamlen -= self.chunk
                     idx += self.chunk
-         eventpoint={'status':'end','text':text,'msgenvent':textevent}
+         eventpoint={'status':'end','text':text,'msgevent':textevent}
         self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)  
\ No newline at end of file
--- a/ultralight/unet.py
View file @8d25ce3
+++ b/ultralight/unet.py
View file @8d25ce3
@@ -236,7 +236,7 @@ if __name__ == '__main__':
             if hasattr(module, 'reparameterize'):
                 module.reparameterize()
         return model
-     device = torch.device("cuda")
+     device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
     def check_onnx(torch_out, torch_in, audio):
         onnx_model = onnx.load(onnx_path)
         onnx.checker.check_model(onnx_model)
--- a/web/dashboard.html 0 → 100644
View file @8d25ce3
+++ b/web/dashboard.html 0 → 100644
View file @8d25ce3
+ <!DOCTYPE html>
+ <html lang="zh-CN">
+ <head>
+     <meta charset="UTF-8">
+     <meta name="viewport" content="width=device-width, initial-scale=1.0">
+     <title>livetalking数字人交互平台</title>
+     <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet">
+     <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.10.0/font/bootstrap-icons.css">
+     <style>
+         :root {
+             --primary-color: #4361ee;
+             --secondary-color: #3f37c9;
+             --accent-color: #4895ef;
+             --background-color: #f8f9fa;
+             --card-bg: #ffffff;
+             --text-color: #212529;
+             --border-radius: 10px;
+             --box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
+         }
+ 
+         body {
+             font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+             background-color: var(--background-color);
+             color: var(--text-color);
+             min-height: 100vh;
+             padding-top: 20px;
+         }
+ 
+         .dashboard-container {
+             max-width: 1400px;
+             margin: 0 auto;
+             padding: 20px;
+         }
+ 
+         .card {
+             background-color: var(--card-bg);
+             border-radius: var(--border-radius);
+             box-shadow: var(--box-shadow);
+             border: none;
+             margin-bottom: 20px;
+             overflow: hidden;
+         }
+ 
+         .card-header {
+             background-color: var(--primary-color);
+             color: white;
+             font-weight: 600;
+             padding: 15px 20px;
+             border-bottom: none;
+         }
+ 
+         .video-container {
+             position: relative;
+             width: 100%;
+             background-color: #000;
+             border-radius: var(--border-radius);
+             overflow: hidden;
+             display: flex;
+             justify-content: center;
+             align-items: center;
+         }
+ 
+         video {
+             max-width: 100%;
+             max-height: 100%;
+             display: block;
+             border-radius: var(--border-radius);
+         }
+ 
+         .controls-container {
+             padding: 20px;
+         }
+ 
+         .btn-primary {
+             background-color: var(--primary-color);
+             border-color: var(--primary-color);
+         }
+ 
+         .btn-primary:hover {
+             background-color: var(--secondary-color);
+             border-color: var(--secondary-color);
+         }
+ 
+         .btn-outline-primary {
+             color: var(--primary-color);
+             border-color: var(--primary-color);
+         }
+ 
+         .btn-outline-primary:hover {
+             background-color: var(--primary-color);
+             color: white;
+         }
+ 
+         .form-control {
+             border-radius: var(--border-radius);
+             padding: 10px 15px;
+             border: 1px solid #ced4da;
+         }
+ 
+         .form-control:focus {
+             border-color: var(--accent-color);
+             box-shadow: 0 0 0 0.25rem rgba(67, 97, 238, 0.25);
+         }
+ 
+         .status-indicator {
+             width: 10px;
+             height: 10px;
+             border-radius: 50%;
+             display: inline-block;
+             margin-right: 5px;
+         }
+ 
+         .status-connected {
+             background-color: #28a745;
+         }
+ 
+         .status-disconnected {
+             background-color: #dc3545;
+         }
+ 
+         .status-connecting {
+             background-color: #ffc107;
+         }
+ 
+         .asr-container {
+             height: 300px;
+             overflow-y: auto;
+             padding: 15px;
+             background-color: #f8f9fa;
+             border-radius: var(--border-radius);
+             border: 1px solid #ced4da;
+         }
+ 
+         .asr-text {
+             margin-bottom: 10px;
+             padding: 10px;
+             background-color: white;
+             border-radius: var(--border-radius);
+             box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1);
+         }
+ 
+         .user-message {
+             background-color: #e3f2fd;
+             border-left: 4px solid var(--primary-color);
+         }
+ 
+         .system-message {
+             background-color: #f1f8e9;
+             border-left: 4px solid #8bc34a;
+         }
+ 
+         .recording-indicator {
+             position: absolute;
+             top: 15px;
+             right: 15px;
+             background-color: rgba(220, 53, 69, 0.8);
+             color: white;
+             padding: 5px 10px;
+             border-radius: 20px;
+             font-size: 0.8rem;
+             display: none;
+         }
+ 
+         .recording-indicator.active {
+             display: flex;
+             align-items: center;
+         }
+ 
+         .recording-indicator .blink {
+             width: 10px;
+             height: 10px;
+             background-color: #fff;
+             border-radius: 50%;
+             margin-right: 5px;
+             animation: blink 1s infinite;
+         }
+ 
+         @keyframes blink {
+             0% { opacity: 1; }
+             50% { opacity: 0.3; }
+             100% { opacity: 1; }
+         }
+ 
+         .mode-switch {
+             margin-bottom: 20px;
+         }
+ 
+         .nav-tabs .nav-link {
+             color: var(--text-color);
+             border: none;
+             padding: 10px 20px;
+             border-radius: var(--border-radius) var(--border-radius) 0 0;
+         }
+ 
+         .nav-tabs .nav-link.active {
+             color: var(--primary-color);
+             background-color: var(--card-bg);
+             border-bottom: 3px solid var(--primary-color);
+             font-weight: 600;
+         }
+ 
+         .tab-content {
+             padding: 20px;
+             background-color: var(--card-bg);
+             border-radius: 0 0 var(--border-radius) var(--border-radius);
+         }
+ 
+         .settings-panel {
+             padding: 15px;
+             background-color: #f8f9fa;
+             border-radius: var(--border-radius);
+             margin-top: 15px;
+         }
+ 
+         .footer {
+             text-align: center;
+             margin-top: 30px;
+             padding: 20px 0;
+             color: #6c757d;
+             font-size: 0.9rem;
+         }
+         
+         .voice-record-btn {
+             width: 60px;
+             height: 60px;
+             border-radius: 50%;
+             background-color: var(--primary-color);
+             color: white;
+             display: flex;
+             justify-content: center;
+             align-items: center;
+             cursor: pointer;
+             transition: all 0.2s ease;
+             box-shadow: 0 2px 5px rgba(0,0,0,0.2);
+             margin: 0 auto;
+         }
+         
+         .voice-record-btn:hover {
+             background-color: var(--secondary-color);
+             transform: scale(1.05);
+         }
+         
+         .voice-record-btn:active {
+             background-color: #dc3545;
+             transform: scale(0.95);
+         }
+         
+         .voice-record-btn i {
+             font-size: 24px;
+         }
+         
+         .voice-record-label {
+             text-align: center;
+             margin-top: 10px;
+             font-size: 14px;
+             color: #6c757d;
+         }
+         
+         .video-size-control {
+             margin-top: 15px;
+         }
+         
+         .recording-pulse {
+             animation: pulse 1.5s infinite;
+         }
+         
+         @keyframes pulse {
+             0% {
+                 box-shadow: 0 0 0 0 rgba(220, 53, 69, 0.7);
+             }
+             70% {
+                 box-shadow: 0 0 0 15px rgba(220, 53, 69, 0);
+             }
+             100% {
+                 box-shadow: 0 0 0 0 rgba(220, 53, 69, 0);
+             }
+         }
+     </style>
+ </head>
+ <body>
+     <div class="dashboard-container">
+         <div class="row">
+             <div class="col-12">
+                 <h1 class="text-center mb-4">livetalking数字人交互平台</h1>
+             </div>
+         </div>
+ 
+         <div class="row">
+             <!-- 视频区域 -->
+             <div class="col-lg-8">
+                 <div class="card">
+                     <div class="card-header d-flex justify-content-between align-items-center">
+                         <div>
+                             <span class="status-indicator status-disconnected" id="connection-status"></span>
+                             <span id="status-text">未连接</span>
+                         </div>
+                     </div>
+                     <div class="card-body p-0">
+                         <div class="video-container">
+                             <video id="video" autoplay playsinline></video>
+                             <div class="recording-indicator" id="recording-indicator">
+                                 <div class="blink"></div>
+                                 <span>录制中</span>
+                             </div>
+                         </div>
+                         
+                         <div class="controls-container">
+                             <div class="row">
+                                 <div class="col-md-6 mb-3">
+                                     <button class="btn btn-primary w-100" id="start">
+                                         <i class="bi bi-play-fill"></i> 开始连接
+                                     </button>
+                                     <button class="btn btn-danger w-100" id="stop" style="display: none;">
+                                         <i class="bi bi-stop-fill"></i> 停止连接
+                                     </button>
+                                 </div>
+                                 <div class="col-md-6 mb-3">
+                                     <div class="d-flex">
+                                         <button class="btn btn-outline-primary flex-grow-1 me-2" id="btn_start_record">
+                                             <i class="bi bi-record-fill"></i> 开始录制
+                                         </button>
+                                         <button class="btn btn-outline-danger flex-grow-1" id="btn_stop_record" disabled>
+                                             <i class="bi bi-stop-fill"></i> 停止录制
+                                         </button>
+                                     </div>
+                                 </div>
+                             </div>
+ 
+                             <div class="row">
+                                 <div class="col-12">
+                                     <div class="video-size-control">
+                                         <label for="video-size-slider" class="form-label">视频大小调节: <span id="video-size-value">100%</span></label>
+                                         <input type="range" class="form-range" id="video-size-slider" min="50" max="150" value="100">
+                                     </div>
+                                 </div>
+                             </div>
+                             
+                             <div class="settings-panel mt-3">
+                                 <div class="row">
+                                     <div class="col-md-12">
+                                         <div class="form-check form-switch mb-3">
+                                             <input class="form-check-input" type="checkbox" id="use-stun">
+                                             <label class="form-check-label" for="use-stun">使用STUN服务器</label>
+                                         </div>
+                                     </div>
+                                 </div>
+                             </div>
+                         </div>
+                     </div>
+                 </div>
+             </div>
+ 
+             <!-- 右侧交互 -->
+             <div class="col-lg-4">
+                 <div class="card">
+                     <div class="card-header">
+                         <ul class="nav nav-tabs card-header-tabs" id="interaction-tabs" role="tablist">
+                             <li class="nav-item" role="presentation">
+                                 <button class="nav-link active" id="chat-tab" data-bs-toggle="tab" data-bs-target="#chat" type="button" role="tab" aria-controls="chat" aria-selected="true">对话模式</button>
+                             </li>
+                             <li class="nav-item" role="presentation">
+                                 <button class="nav-link" id="tts-tab" data-bs-toggle="tab" data-bs-target="#tts" type="button" role="tab" aria-controls="tts" aria-selected="false">朗读模式</button>
+                             </li>
+                         </ul>
+                     </div>
+                     <div class="card-body">
+                         <div class="tab-content" id="interaction-tabs-content">
+                             <!-- 对话模式 -->
+                             <div class="tab-pane fade show active" id="chat" role="tabpanel" aria-labelledby="chat-tab">
+                                 <div class="asr-container mb-3" id="chat-messages">
+                                     <div class="asr-text system-message">
+                                         系统: 欢迎使用livetalking，请点击"开始连接"按钮开始对话。
+                                     </div>
+                                 </div>
+                                 
+                                 <form id="chat-form">
+                                     <div class="input-group mb-3">
+                                         <textarea class="form-control" id="chat-message" rows="3" placeholder="输入您想对数字人说的话..."></textarea>
+                                         <button class="btn btn-primary" type="submit">
+                                             <i class="bi bi-send"></i> 发送
+                                         </button>
+                                     </div>
+                                 </form>
+                                 
+                                 <!-- 按住说话按钮 -->
+                                 <div class="voice-record-btn" id="voice-record-btn">
+                                     <i class="bi bi-mic-fill"></i>
+                                 </div>
+                                 <div class="voice-record-label">按住说话，松开发送</div>
+                             </div>
+                             
+                             <!-- 朗读模式 -->
+                             <div class="tab-pane fade" id="tts" role="tabpanel" aria-labelledby="tts-tab">
+                                 <form id="echo-form">
+                                     <div class="mb-3">
+                                         <label for="message" class="form-label">输入要朗读的文本</label>
+                                         <textarea class="form-control" id="message" rows="6" placeholder="输入您想让数字人朗读的文字..."></textarea>
+                                     </div>
+                                     <button type="submit" class="btn btn-primary w-100">
+                                         <i class="bi bi-volume-up"></i> 朗读文本
+                                     </button>
+                                 </form>
+                             </div>
+                         </div>
+                     </div>
+                 </div>
+             </div>
+         </div>
+ 
+         <div class="footer">
+             <p>Made with ❤️ by Marstaos | Frontend & Performance Optimization</p>
+         </div>
+     </div>
+ 
+     <!-- 隐藏的会话ID -->
+     <input type="hidden" id="sessionid" value="0">
+ 
+ 
+     <script src="client.js"></script>
+     <script src="srs.sdk.js"></script>
+     <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>
+     <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
+     <script>
+         $(document).ready(function() {
+             $('#video-size-slider').on('input', function() {
+                 const value = $(this).val();
+                 $('#video-size-value').text(value + '%');
+                 $('#video').css('width', value + '%');
+             });
+             function updateConnectionStatus(status) {
+                 const statusIndicator = $('#connection-status');
+                 const statusText = $('#status-text');
+                 
+                 statusIndicator.removeClass('status-connected status-disconnected status-connecting');
+                 
+                 switch(status) {
+                     case 'connected':
+                         statusIndicator.addClass('status-connected');
+                         statusText.text('已连接');
+                         break;
+                     case 'connecting':
+                         statusIndicator.addClass('status-connecting');
+                         statusText.text('连接中...');
+                         break;
+                     case 'disconnected':
+                     default:
+                         statusIndicator.addClass('status-disconnected');
+                         statusText.text('未连接');
+                         break;
+                 }
+             }
+ 
+             // 添加聊天消息
+             function addChatMessage(message, type = 'user') {
+                 const messagesContainer = $('#chat-messages');
+                 const messageClass = type === 'user' ? 'user-message' : 'system-message';
+                 const sender = type === 'user' ? '您' : '数字人';
+                 
+                 const messageElement = $(`
+                     <div class="asr-text ${messageClass}">
+                         ${sender}: ${message}
+                     </div>
+                 `);
+                 
+                 messagesContainer.append(messageElement);
+                 messagesContainer.scrollTop(messagesContainer[0].scrollHeight);
+             }
+ 
+             // 开始/停止按钮
+             $('#start').click(function() {
+                 updateConnectionStatus('connecting');
+                 start();
+                 $(this).hide();
+                 $('#stop').show();
+                 
+                 // 添加定时器检查视频流是否已加载
+                 let connectionCheckTimer = setInterval(function() {
+                     const video = document.getElementById('video');
+                     // 检查视频是否有数据
+                     if (video.readyState >= 3 && video.videoWidth > 0) {
+                         updateConnectionStatus('connected');
+                         clearInterval(connectionCheckTimer);
+                     }
+                 }, 2000); // 每2秒检查一次
+                 
+                 // 60秒后如果还是连接中状态，就停止检查
+                 setTimeout(function() {
+                     if (connectionCheckTimer) {
+                         clearInterval(connectionCheckTimer);
+                     }
+                 }, 60000);
+             });
+ 
+             $('#stop').click(function() {
+                 stop();
+                 $(this).hide();
+                 $('#start').show();
+                 updateConnectionStatus('disconnected');
+             });
+ 
+             // 录制功能
+             $('#btn_start_record').click(function() {
+                 console.log('Starting recording...');
+                 fetch('/record', {
+                     body: JSON.stringify({
+                         type: 'start_record',
+                         sessionid: parseInt(document.getElementById('sessionid').value),
+                     }),
+                     headers: {
+                         'Content-Type': 'application/json'
+                     },
+                     method: 'POST'
+                 }).then(function(response) {
+                     if (response.ok) {
+                         console.log('Recording started.');
+                         $('#btn_start_record').prop('disabled', true);
+                         $('#btn_stop_record').prop('disabled', false);
+                         $('#recording-indicator').addClass('active');
+                     } else {
+                         console.error('Failed to start recording.');
+                     }
+                 }).catch(function(error) {
+                     console.error('Error:', error);
+                 });
+             });
+ 
+             $('#btn_stop_record').click(function() {
+                 console.log('Stopping recording...');
+                 fetch('/record', {
+                     body: JSON.stringify({
+                         type: 'end_record',
+                         sessionid: parseInt(document.getElementById('sessionid').value),
+                     }),
+                     headers: {
+                         'Content-Type': 'application/json'
+                     },
+                     method: 'POST'
+                 }).then(function(response) {
+                     if (response.ok) {
+                         console.log('Recording stopped.');
+                         $('#btn_start_record').prop('disabled', false);
+                         $('#btn_stop_record').prop('disabled', true);
+                         $('#recording-indicator').removeClass('active');
+                     } else {
+                         console.error('Failed to stop recording.');
+                     }
+                 }).catch(function(error) {
+                     console.error('Error:', error);
+                 });
+             });
+ 
+             $('#echo-form').on('submit', function(e) {
+                 e.preventDefault();
+                 var message = $('#message').val();
+                 if (!message.trim()) return;
+                 
+                 console.log('Sending echo message:', message);
+                 
+                 fetch('/human', {
+                     body: JSON.stringify({
+                         text: message,
+                         type: 'echo',
+                         interrupt: true,
+                         sessionid: parseInt(document.getElementById('sessionid').value),
+                     }),
+                     headers: {
+                         'Content-Type': 'application/json'
+                     },
+                     method: 'POST'
+                 });
+                 
+                 $('#message').val('');
+                 addChatMessage(`已发送朗读请求: "${message}"`, 'system');
+             });
+ 
+             // 聊天模式表单提交
+             $('#chat-form').on('submit', function(e) {
+                 e.preventDefault();
+                 var message = $('#chat-message').val();
+                 if (!message.trim()) return;
+                 
+                 console.log('Sending chat message:', message);
+                 
+                 fetch('/human', {
+                     body: JSON.stringify({
+                         text: message,
+                         type: 'chat',
+                         interrupt: true,
+                         sessionid: parseInt(document.getElementById('sessionid').value),
+                     }),
+                     headers: {
+                         'Content-Type': 'application/json'
+                     },
+                     method: 'POST'
+                 });
+                 
+                 addChatMessage(message, 'user');
+                 $('#chat-message').val('');
+             });
+ 
+             // 按住说话功能
+             let mediaRecorder;
+             let audioChunks = [];
+             let isRecording = false;
+             let recognition;
+             
+             // 检查浏览器是否支持语音识别
+             const isSpeechRecognitionSupported = 'webkitSpeechRecognition' in window || 'SpeechRecognition' in window;
+             
+             if (isSpeechRecognitionSupported) {
+                 recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
+                 recognition.continuous = true;
+                 recognition.interimResults = true;
+                 recognition.lang = 'zh-CN';
+                 
+                 recognition.onresult = function(event) {
+                     let interimTranscript = '';
+                     let finalTranscript = '';
+                     
+                     for (let i = event.resultIndex; i < event.results.length; ++i) {
+                         if (event.results[i].isFinal) {
+                             finalTranscript += event.results[i][0].transcript;
+                         } else {
+                             interimTranscript += event.results[i][0].transcript;
+                             $('#chat-message').val(interimTranscript);
+                         }
+                     }
+                     
+                     if (finalTranscript) {
+                         $('#chat-message').val(finalTranscript);
+                     }
+                 };
+                 
+                 recognition.onerror = function(event) {
+                     console.error('语音识别错误:', event.error);
+                 };
+             }
+             
+             // 按住说话按钮事件
+             $('#voice-record-btn').on('mousedown touchstart', function(e) {
+                 e.preventDefault();
+                 startRecording();
+             }).on('mouseup mouseleave touchend', function() {
+                 if (isRecording) {
+                     stopRecording();
+                 }
+             });
+             
+             // 开始录音
+             function startRecording() {
+                 if (isRecording) return;
+                 
+                 navigator.mediaDevices.getUserMedia({ audio: true })
+                     .then(function(stream) {
+                         audioChunks = [];
+                         mediaRecorder = new MediaRecorder(stream);
+                         
+                         mediaRecorder.ondataavailable = function(e) {
+                             if (e.data.size > 0) {
+                                 audioChunks.push(e.data);
+                             }
+                         };
+                         
+                         mediaRecorder.start();
+                         isRecording = true;
+                         
+                         $('#voice-record-btn').addClass('recording-pulse');
+                         $('#voice-record-btn').css('background-color', '#dc3545');
+                         
+                         if (recognition) {
+                             recognition.start();
+                         }
+                     })
+                     .catch(function(error) {
+                         console.error('无法访问麦克风:', error);
+                         alert('无法访问麦克风，请检查浏览器权限设置。');
+                     });
+             }
+ 
+             function stopRecording() {
+                 if (!isRecording) return;
+                 
+                 mediaRecorder.stop();
+                 isRecording = false;
+                 
+                 // 停止所有音轨
+                 mediaRecorder.stream.getTracks().forEach(track => track.stop());
+                 
+                 // 视觉反馈恢复
+                 $('#voice-record-btn').removeClass('recording-pulse');
+                 $('#voice-record-btn').css('background-color', '');
+                 
+                 // 停止语音识别
+                 if (recognition) {
+                     recognition.stop();
+                 }
+                 
+                 // 获取识别的文本并发送
+                 setTimeout(function() {
+                     const recognizedText = $('#chat-message').val().trim();
+                     if (recognizedText) {
+                         // 发送识别的文本
+                         fetch('/human', {
+                             body: JSON.stringify({
+                                 text: recognizedText,
+                                 type: 'chat',
+                                 interrupt: true,
+                                 sessionid: parseInt(document.getElementById('sessionid').value),
+                             }),
+                             headers: {
+                                 'Content-Type': 'application/json'
+                             },
+                             method: 'POST'
+                         });
+                         
+                         addChatMessage(recognizedText, 'user');
+                         $('#chat-message').val('');
+                     }
+                 }, 500); 
+             }
+ 
+             // WebRTC 相关功能
+             if (typeof window.onWebRTCConnected === 'function') {
+                 const originalOnConnected = window.onWebRTCConnected;
+                 window.onWebRTCConnected = function() {
+                     updateConnectionStatus('connected');
+                     if (originalOnConnected) originalOnConnected();
+                 };
+             } else {
+                 window.onWebRTCConnected = function() {
+                     updateConnectionStatus('connected');
+                 };
+             }
+ 
+             // 当连接断开时更新状态
+             if (typeof window.onWebRTCDisconnected === 'function') {
+                 const originalOnDisconnected = window.onWebRTCDisconnected;
+                 window.onWebRTCDisconnected = function() {
+                     updateConnectionStatus('disconnected');
+                     if (originalOnDisconnected) originalOnDisconnected();
+                 };
+             } else {
+                 window.onWebRTCDisconnected = function() {
+                     updateConnectionStatus('disconnected');
+                 };
+             }
+ 
+             // SRS WebRTC播放功能
+             var sdk = null; // 全局处理器，用于在重新发布时进行清理
+ 
+             function startPlay() {
+                 // 关闭之前的连接
+                 if (sdk) {
+                     sdk.close();
+                 }
+                 
+                 sdk = new SrsRtcWhipWhepAsync();
+                 $('#video').prop('srcObject', sdk.stream);
+                 
+                 var host = window.location.hostname;
+                 var url = "http://" + host + ":1985/rtc/v1/whep/?app=live&stream=livestream";
+                 
+                 sdk.play(url).then(function(session) {
+                     console.log('WebRTC播放已启动，会话ID:', session.sessionid);
+                 }).catch(function(reason) {
+                     sdk.close();
+                     console.error('WebRTC播放失败:', reason);
+                 });
+             }
+         });
+     </script>
+ </body>
+ </html>
\ No newline at end of file