冯杨

同步github官方更新截止Commits on Apr 18, 2025

a9c36c76e569107b5a39b3de8afd6e016b24d662
... ... @@ -15,4 +15,8 @@ pretrained
*.mp4
.DS_Store
workspace/log_ngp.txt
.idea
\ No newline at end of file
.idea
models/
*.log
dist
\ No newline at end of file
... ...
Real-time interactive streaming digital human enables synchronous audio and video dialogue. It can basically achieve commercial effects.
[Effect of wav2lip](https://www.bilibili.com/video/BV1scwBeyELA/) | [Effect of ernerf](https://www.bilibili.com/video/BV1G1421z73r/) | [Effect of musetalk](https://www.bilibili.com/video/BV1gm421N7vQ/)
## News
- December 8, 2024: Improved multi-concurrency, and the video memory does not increase with the number of concurrent connections.
- December 21, 2024: Added model warm-up for wav2lip and musetalk to solve the problem of stuttering during the first inference. Thanks to [@heimaojinzhangyz](https://github.com/heimaojinzhangyz)
- December 28, 2024: Added the digital human model Ultralight-Digital-Human. Thanks to [@lijihua2017](https://github.com/lijihua2017)
- February 7, 2025: Added fish-speech tts
- February 21, 2025: Added the open-source model wav2lip256. Thanks to @不蠢不蠢
- March 2, 2025: Added Tencent's speech synthesis service
- March 16, 2025: Supports mac gpu inference. Thanks to [@GcsSloop](https://github.com/GcsSloop)
## Features
1. Supports multiple digital human models: ernerf, musetalk, wav2lip, Ultralight-Digital-Human
2. Supports voice cloning
3. Supports interrupting the digital human while it is speaking
4. Supports full-body video stitching
5. Supports rtmp and webrtc
6. Supports video arrangement: Play custom videos when not speaking
7. Supports multi-concurrency
## 1. Installation
Tested on Ubuntu 20.04, Python 3.10, Pytorch 1.12 and CUDA 11.3
### 1.1 Install dependency
```bash
conda create -n nerfstream python=3.10
conda activate nerfstream
# If the cuda version is not 11.3 (confirm the version by running nvidia-smi), install the corresponding version of pytorch according to <https://pytorch.org/get-started/previous-versions/>
conda install pytorch==1.12.1 torchvision==0.13.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
# If you need to train the ernerf model, install the following libraries
# pip install "git+https://github.com/facebookresearch/pytorch3d.git"
# pip install tensorflow-gpu==2.8.0
# pip install --upgrade "protobuf<=3.20.1"
```
Common installation issues [FAQ](https://livetalking-doc.readthedocs.io/en/latest/faq.html)
For setting up the linux cuda environment, you can refer to this article https://zhuanlan.zhihu.com/p/674972886
## 2. Quick Start
- Download the models
Quark Cloud Disk <https://pan.quark.cn/s/83a750323ef0>
Google Drive <https://drive.google.com/drive/folders/1FOC_MD6wdogyyX_7V1d4NDIO7P9NlSAJ?usp=sharing>
Copy wav2lip256.pth to the models folder of this project and rename it to wav2lip.pth;
Extract wav2lip256_avatar1.tar.gz and copy the entire folder to the data/avatars folder of this project.
- Run
python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar1
Open http://serverip:8010/webrtcapi.html in a browser. First click'start' to play the digital human video; then enter any text in the text box and submit it. The digital human will broadcast this text.
<font color=red>The server side needs to open ports tcp:8010; udp:1-65536</font>
If you need to purchase a high-definition wav2lip model for commercial use, [Link](https://livetalking-doc.readthedocs.io/zh-cn/latest/service.html#wav2lip).
- Quick experience
<https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_GitHub_livetalking1.3> Create an instance with this image to run it.
If you can't access huggingface, before running
```
export HF_ENDPOINT=https://hf-mirror.com
```
## 3. More Usage
Usage instructions: <https://livetalking-doc.readthedocs.io/en/latest>
## 4. Docker Run
No need for the previous installation, just run directly.
```
docker run --gpus all -it --network=host --rm registry.cn-beijing.aliyuncs.com/codewithgpu2/lipku-metahuman-stream:2K9qaMBu8v
```
The code is in /root/metahuman-stream. First, git pull to get the latest code, and then execute the commands as in steps 2 and 3.
The following images are provided:
- autodl image: <https://www.codewithgpu.com/i/lipku/metahuman-stream/base>
[autodl Tutorial](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
- ucloud image: <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_livetalking1.3>
Any port can be opened, and there is no need to deploy an srs service additionally.
[ucloud Tutorial](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html)
## 5. TODO
- [x] Added chatgpt to enable digital human dialogue
- [x] Voice cloning
- [x] Replace the digital human with a video when it is silent
- [x] MuseTalk
- [x] Wav2Lip
- [x] Ultralight-Digital-Human
---
If this project is helpful to you, please give it a star. Friends who are interested are also welcome to join in and improve this project together.
* Knowledge Planet: https://t.zsxq.com/7NMyO, where high-quality common problems, best practice experiences, and problem solutions are accumulated.
* WeChat Official Account: Digital Human Technology
![](https://mmbiz.qpic.cn/sz_mmbiz_jpg/l3ZibgueFiaeyfaiaLZGuMGQXnhLWxibpJUS2gfs8Dje6JuMY8zu2tVyU9n8Zx1yaNncvKHBMibX0ocehoITy5qQEZg/640?wxfrom=12&tp=wxpic&usePicPrefetch=1&wx_fmt=jpeg&amp;from=appmsg)
\ No newline at end of file
... ...
Real time interactive streaming digital human, realize audio video synchronous dialogue. It can basically achieve commercial effects.
[English](./README-EN.md) | 中文版
实时交互流式数字人,实现音视频同步对话。基本可以达到商用效果
[wav2lip效果](https://www.bilibili.com/video/BV1scwBeyELA/) | [ernerf效果](https://www.bilibili.com/video/BV1G1421z73r/) | [musetalk效果](https://www.bilibili.com/video/BV1gm421N7vQ/)
[ernerf 效果](https://www.bilibili.com/video/BV1PM4m1y7Q2/) [musetalk 效果](https://www.bilibili.com/video/BV1gm421N7vQ/) [wav2lip 效果](https://www.bilibili.com/video/BV1Bw4m1e74P/)
## 为避免与 3d 数字人混淆,原项目 metahuman-stream 改名为 livetalking,原有链接地址继续可用
## 为避免与3d数字人混淆,原项目metahuman-stream改名为livetalking,原有链接地址继续可用
## News
- 2024.12.8 完善多并发,显存不随并发数增加
- 2024.12.21 添加 wav2lip、musetalk 模型预热,解决第一次推理卡顿问题。感谢@heimaojinzhangyz
- 2024.12.28 添加数字人模型 Ultralight-Digital-Human。 感谢@lijihua2017
- 2025.2.7 添加 fish-speech tts
- 2025.2.21 添加 wav2lip256 开源模型 感谢@不蠢不蠢
- 2024.12.21 添加wav2lip、musetalk模型预热,解决第一次推理卡顿问题。感谢[@heimaojinzhangyz](https://github.com/heimaojinzhangyz)
- 2024.12.28 添加数字人模型Ultralight-Digital-Human。 感谢[@lijihua2017](https://github.com/lijihua2017)
- 2025.2.7 添加fish-speech tts
- 2025.2.21 添加wav2lip256开源模型 感谢@不蠢不蠢
- 2025.3.2 添加腾讯语音合成服务
- 2025.3.16 支持mac gpu推理,感谢[@GcsSloop](https://github.com/GcsSloop)
## Features
1. 支持多种数字人模型: ernerf、musetalk、wav2lip、Ultralight-Digital-Human
2. 支持声音克隆
3. 支持数字人说话被打断
4. 支持全身视频拼接
5. 支持 rtmp 和 webrtc
5. 支持rtmp和webrtc
6. 支持视频编排:不说话时播放自定义视频
7. 支持多并发
... ... @@ -33,67 +31,61 @@ Tested on Ubuntu 20.04, Python3.10, Pytorch 1.12 and CUDA 11.3
```bash
conda create -n nerfstream python=3.10
conda activate nerfstream
#如果cuda版本不为11.3(运行nvidia-smi确认版本),根据<https://pytorch.org/get-started/previous-versions/>安装对应版本的pytorch
#如果cuda版本不为11.3(运行nvidia-smi确认版本),根据<https://pytorch.org/get-started/previous-versions/>安装对应版本的pytorch
conda install pytorch==1.12.1 torchvision==0.13.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
#如果需要训练ernerf模型,安装下面的库
# pip install "git+https://github.com/facebookresearch/pytorch3d.git"
# pip install tensorflow-gpu==2.8.0
# pip install --upgrade "protobuf<=3.20.1"
```
```
安装常见问题[FAQ](https://livetalking-doc.readthedocs.io/en/latest/faq.html)
linux cuda 环境搭建可以参考这篇文章 https://zhuanlan.zhihu.com/p/674972886
linux cuda环境搭建可以参考这篇文章 https://zhuanlan.zhihu.com/p/674972886
## 2. Quick Start
## 2. Quick Start
- 下载模型
百度云盘<https://pan.baidu.com/s/1yOsQ06-RIDTJd3HFCw4wtA> 密码: ltua
夸克云盘<https://pan.quark.cn/s/83a750323ef0>
GoogleDriver <https://drive.google.com/drive/folders/1FOC_MD6wdogyyX_7V1d4NDIO7P9NlSAJ?usp=sharing>
将 wav2lip256.pth 拷到本项目的 models 下, 重命名为 wav2lip.pth;
将 wav2lip256_avatar1.tar.gz 解压后整个文件夹拷到本项目的 data/avatars 下
将wav2lip256.pth拷到本项目的models下, 重命名为wav2lip.pth;
将wav2lip256_avatar1.tar.gz解压后整个文件夹拷到本项目的data/avatars下
- 运行
python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar1 --preload 2
使用 GPU 启动模特 3 号:python app.py --transport webrtc --model wav2lip --avatar_id wav2lip256_avatar3 --preload 2
用浏览器打开 http://serverip:8010/webrtcapi.html , 先点‘start',播放数字人视频;然后在文本框输入任意文字,提交。数字人播报该段文字
用浏览器打开http://serverip:8010/webrtcapi.html , 先点‘start',播放数字人视频;然后在文本框输入任意文字,提交。数字人播报该段文字
<font color=red>服务端需要开放端口 tcp:8010; udp:1-65536 </font>
如果需要商用高清 wav2lip 模型,可以与我联系购买
如果需要商用高清wav2lip模型,[链接](https://livetalking-doc.readthedocs.io/zh-cn/latest/service.html#wav2lip)
- 快速体验
<https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_GitHub_livetalking1.3> 用该镜像创建实例即可运行成功
如果访问不了 huggingface,在运行前
如果访问不了huggingface,在运行前
```
export HF_ENDPOINT=https://hf-mirror.com
```
```
## 3. More Usage
## 3. More Usage
使用说明: <https://livetalking-doc.readthedocs.io/>
## 4. Docker Run
不需要前面的安装,直接运行。
```
docker run --gpus all -it --network=host --rm registry.cn-beijing.aliyuncs.com/codewithgpu2/lipku-metahuman-stream:2K9qaMBu8v
```
代码在/root/metahuman-stream,先 git pull 拉一下最新代码,然后执行命令同第 2、3 步
代码在/root/metahuman-stream,先git pull拉一下最新代码,然后执行命令同第2、3步
提供如下镜像
- autodl镜像: <https://www.codewithgpu.com/i/lipku/metahuman-stream/base>
[autodl教程](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
- ucloud镜像: <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_livetalking1.3>
可以开放任意端口,不需要另外部署srs服务.
[ucloud教程](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html)
- autodl 镜像: <https://www.codewithgpu.com/i/lipku/metahuman-stream/base>
[autodl 教程](https://livetalking-doc.readthedocs.io/en/latest/autodl/README.html)
- ucloud 镜像: <https://www.compshare.cn/images-detail?ImageID=compshareImage-18tpjhhxoq3j&referral_code=3XW3852OBmnD089hMMrtuU&ytag=GPU_livetalking1.3>
可以开放任意端口,不需要另外部署 srs 服务.
[ucloud 教程](https://livetalking-doc.readthedocs.io/en/latest/ucloud/ucloud.html)
## 5. TODO
- [x] 添加 chatgpt 实现数字人对话
- [x] 添加chatgpt实现数字人对话
- [x] 声音克隆
- [x] 数字人静音时用一段视频代替
- [x] MuseTalk
... ... @@ -101,9 +93,8 @@ docker run --gpus all -it --network=host --rm registry.cn-beijing.aliyuncs.com/c
- [x] Ultralight-Digital-Human
---
如果本项目对你有帮助,帮忙点个star。也欢迎感兴趣的朋友一起来完善该项目.
* 知识星球: https://t.zsxq.com/7NMyO 沉淀高质量常见问题、最佳实践经验、问题解答
* 微信公众号:数字人技术
![](https://mmbiz.qpic.cn/sz_mmbiz_jpg/l3ZibgueFiaeyfaiaLZGuMGQXnhLWxibpJUS2gfs8Dje6JuMY8zu2tVyU9n8Zx1yaNncvKHBMibX0ocehoITy5qQEZg/640?wxfrom=12&tp=wxpic&usePicPrefetch=1&wx_fmt=jpeg&amp;from=appmsg)
如果本项目对你有帮助,帮忙点个 star。也欢迎感兴趣的朋友一起来完善该项目.
- 知识星球: https://t.zsxq.com/7NMyO 沉淀高质量常见问题、最佳实践经验、问题解答
- 微信公众号:数字人技术
![](https://mmbiz.qpic.cn/sz_mmbiz_jpg/l3ZibgueFiaeyfaiaLZGuMGQXnhLWxibpJUS2gfs8Dje6JuMY8zu2tVyU9n8Zx1yaNncvKHBMibX0ocehoITy5qQEZg/640?wxfrom=12&tp=wxpic&usePicPrefetch=1&wx_fmt=jpeg&from=appmsg)
... ...
... ... @@ -201,7 +201,7 @@ async def set_audiotype(request):
params = await request.json()
sessionid = params.get('sessionid',0)
nerfreals[sessionid].set_curr_state(params['audiotype'],params['reinit'])
nerfreals[sessionid].set_custom_state(params['audiotype'],params['reinit'])
return web.Response(
content_type="application/json",
... ... @@ -495,6 +495,8 @@ if __name__ == '__main__':
elif opt.transport=='rtcpush':
pagename='rtcpushapi.html'
logger.info('start http server; http://<serverip>:'+str(opt.listenport)+'/'+pagename)
logger.info('如果使用webrtc,推荐访问webrtc集成前端: http://<serverip>:'+str(opt.listenport)+'/dashboard.html')
def run_server(runner):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
... ...
... ... @@ -35,7 +35,7 @@ import soundfile as sf
import av
from fractions import Fraction
from ttsreal import EdgeTTS,VoitsTTS,XTTS,CosyVoiceTTS,FishTTS,TencentTTS
from ttsreal import EdgeTTS,SovitsTTS,XTTS,CosyVoiceTTS,FishTTS,TencentTTS
from logger import logger
from tqdm import tqdm
... ... @@ -57,7 +57,7 @@ class BaseReal:
if opt.tts == "edgetts":
self.tts = EdgeTTS(opt,self)
elif opt.tts == "gpt-sovits":
self.tts = VoitsTTS(opt,self)
self.tts = SovitsTTS(opt,self)
elif opt.tts == "xtts":
self.tts = XTTS(opt,self)
elif opt.tts == "cosyvoice":
... ... @@ -66,7 +66,7 @@ class BaseReal:
self.tts = FishTTS(opt,self)
elif opt.tts == "tencent":
self.tts = TencentTTS(opt,self)
self.speaking = False
self.recording = False
... ... @@ -84,11 +84,11 @@ class BaseReal:
def put_msg_txt(self,msg,eventpoint=None):
self.tts.put_msg_txt(msg,eventpoint)
def put_audio_frame(self,audio_chunk,eventpoint=None): #16khz 20ms pcm
self.asr.put_audio_frame(audio_chunk,eventpoint)
def put_audio_file(self,filebyte):
def put_audio_file(self,filebyte):
input_stream = BytesIO(filebyte)
stream = self.__create_bytes_stream(input_stream)
streamlen = stream.shape[0]
... ... @@ -97,7 +97,7 @@ class BaseReal:
self.put_audio_frame(stream[idx:idx+self.chunk])
streamlen -= self.chunk
idx += self.chunk
def __create_bytes_stream(self,byte_stream):
#byte_stream=BytesIO(buffer)
stream, sample_rate = sf.read(byte_stream) # [T*sample_rate,] float64
... ... @@ -107,7 +107,7 @@ class BaseReal:
if stream.ndim > 1:
logger.info(f'[WARN] audio has {stream.shape[1]} channels, only use the first.')
stream = stream[:, 0]
if sample_rate != self.sample_rate and stream.shape[0]>0:
logger.info(f'[WARN] audio sample rate is {sample_rate}, resampling into {self.sample_rate}.')
stream = resampy.resample(x=stream, sr_orig=sample_rate, sr_new=self.sample_rate)
... ... @@ -120,7 +120,7 @@ class BaseReal:
def is_speaking(self)->bool:
return self.speaking
def __loadcustom(self):
for item in self.opt.customopt:
logger.info(item)
... ... @@ -155,9 +155,9 @@ class BaseReal:
'-s', "{}x{}".format(self.width, self.height),
'-r', str(25),
'-i', '-',
'-pix_fmt', 'yuv420p',
'-pix_fmt', 'yuv420p',
'-vcodec', "h264",
#'-f' , 'flv',
#'-f' , 'flv',
f'temp{self.opt.sessionid}.mp4']
self._record_video_pipe = subprocess.Popen(command, shell=False, stdin=subprocess.PIPE)
... ... @@ -169,7 +169,7 @@ class BaseReal:
'-ar', '16000',
'-i', '-',
'-acodec', 'aac',
#'-f' , 'wav',
#'-f' , 'wav',
f'temp{self.opt.sessionid}.aac']
self._record_audio_pipe = subprocess.Popen(acommand, shell=False, stdin=subprocess.PIPE)
... ... @@ -177,10 +177,10 @@ class BaseReal:
# self.recordq_video.queue.clear()
# self.recordq_audio.queue.clear()
# self.container = av.open(path, mode="w")
# process_thread = Thread(target=self.record_frame, args=())
# process_thread.start()
def record_video_data(self,image):
if self.width == 0:
print("image.shape:",image.shape)
... ... @@ -191,14 +191,14 @@ class BaseReal:
def record_audio_data(self,frame):
if self.recording:
self._record_audio_pipe.stdin.write(frame.tostring())
# def record_frame(self):
# def record_frame(self):
# videostream = self.container.add_stream("libx264", rate=25)
# videostream.codec_context.time_base = Fraction(1, 25)
# audiostream = self.container.add_stream("aac")
# audiostream.codec_context.time_base = Fraction(1, 16000)
# init = True
# framenum = 0
# framenum = 0
# while self.recording:
# try:
# videoframe = self.recordq_video.get(block=True, timeout=1)
... ... @@ -231,18 +231,18 @@ class BaseReal:
# self.recordq_video.queue.clear()
# self.recordq_audio.queue.clear()
# print('record thread stop')
def stop_recording(self):
"""停止录制视频"""
if not self.recording:
return
self.recording = False
self._record_video_pipe.stdin.close() #wait()
self.recording = False
self._record_video_pipe.stdin.close() #wait()
self._record_video_pipe.wait()
self._record_audio_pipe.stdin.close()
self._record_audio_pipe.wait()
cmd_combine_audio = f"ffmpeg -y -i temp{self.opt.sessionid}.aac -i temp{self.opt.sessionid}.mp4 -c:v copy -c:a copy data/record.mp4"
os.system(cmd_combine_audio)
os.system(cmd_combine_audio)
#os.remove(output_path)
def mirror_index(self,size, index):
... ... @@ -252,8 +252,8 @@ class BaseReal:
if turn % 2 == 0:
return res
else:
return size - res - 1
return size - res - 1
def get_audio_stream(self,audiotype):
idx = self.custom_audio_index[audiotype]
stream = self.custom_audio_cycle[audiotype][idx:idx+self.chunk]
... ... @@ -261,9 +261,9 @@ class BaseReal:
if self.custom_audio_index[audiotype]>=self.custom_audio_cycle[audiotype].shape[0]:
self.curr_state = 1 #当前视频不循环播放,切换到静音状态
return stream
def set_curr_state(self,audiotype, reinit):
print('set_curr_state:',audiotype)
def set_custom_state(self,audiotype, reinit=True):
print('set_custom_state:',audiotype)
self.curr_state = audiotype
if reinit:
self.custom_audio_index[audiotype] = 0
... ...
... ... @@ -179,8 +179,11 @@ print(f'[INFO] fitting light...')
batch_size = 32
device_default = torch.device("cuda:0")
device_render = torch.device("cuda:0")
device_default = torch.device("cuda:0" if torch.cuda.is_available() else (
"mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
device_render = torch.device("cuda:0" if torch.cuda.is_available() else (
"mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
renderer = Render_3DMM(arg_focal, h, w, batch_size, device_render)
sel_ids = np.arange(0, num_frames, int(num_frames / batch_size))[:batch_size]
... ...
... ... @@ -83,7 +83,7 @@ class Render_3DMM(nn.Module):
img_h=500,
img_w=500,
batch_size=1,
device=torch.device("cuda:0"),
device=torch.device("cuda:0" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")),
):
super(Render_3DMM, self).__init__()
... ...
... ... @@ -147,7 +147,7 @@ if __name__ == '__main__':
seed_everything(opt.seed)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
model = NeRFNetwork(opt)
... ...
... ... @@ -442,7 +442,7 @@ class LPIPSMeter:
self.N = 0
self.net = net
self.device = device if device is not None else torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.device = device if device is not None else torch.device('cuda' if torch.cuda.is_available() else ('mps' if hasattr(torch.backends, "mps") and torch.backends.mps.is_available() else 'cpu'))
self.fn = lpips.LPIPS(net=net).eval().to(self.device)
def clear(self):
... ... @@ -456,13 +456,13 @@ class LPIPSMeter:
inp = inp.to(self.device)
outputs.append(inp)
return outputs
def update(self, preds, truths):
preds, truths = self.prepare_inputs(preds, truths) # [B, H, W, 3] --> [B, 3, H, W], range in [0, 1]
v = self.fn(truths, preds, normalize=True).item() # normalize=True: [0, 1] to [-1, 1]
self.V += v
self.N += 1
def measure(self):
return self.V / self.N
... ... @@ -499,7 +499,7 @@ class LMDMeter:
self.V = 0
self.N = 0
def get_landmarks(self, img):
if self.backend == 'dlib':
... ... @@ -515,7 +515,7 @@ class LMDMeter:
else:
lms = self.predictor.get_landmarks(img)[-1]
# self.vis_landmarks(img, lms)
lms = lms.astype(np.float32)
... ... @@ -537,7 +537,7 @@ class LMDMeter:
inp = (inp * 255).astype(np.uint8)
outputs.append(inp)
return outputs
def update(self, preds, truths):
# assert B == 1
preds, truths = self.prepare_inputs(preds[0], truths[0]) # [H, W, 3] numpy array
... ... @@ -553,13 +553,13 @@ class LMDMeter:
# avarage
lms_pred = lms_pred - lms_pred.mean(0)
lms_truth = lms_truth - lms_truth.mean(0)
# distance
dist = np.sqrt(((lms_pred - lms_truth) ** 2).sum(1)).mean(0)
self.V += dist
self.N += 1
def measure(self):
return self.V / self.N
... ... @@ -567,14 +567,14 @@ class LMDMeter:
writer.add_scalar(os.path.join(prefix, f"LMD ({self.backend})"), self.measure(), global_step)
def report(self):
return f'LMD ({self.backend}) = {self.measure():.6f}'
return f'LMD ({self.backend}) = {self.measure():.6f}'
class Trainer(object):
def __init__(self,
def __init__(self,
name, # name of this experiment
opt, # extra conf
model, # network
model, # network
criterion=None, # loss function, if None, assume inline implementation in train_step
optimizer=None, # optimizer
ema_decay=None, # if use EMA, set the decay
... ... @@ -596,7 +596,7 @@ class Trainer(object):
use_tensorboardX=True, # whether to use tensorboard for logging
scheduler_update_every_step=False, # whether to call scheduler.step() after every train step
):
self.name = name
self.opt = opt
self.mute = mute
... ... @@ -618,7 +618,11 @@ class Trainer(object):
self.flip_init_lips = self.opt.init_lips
self.time_stamp = time.strftime("%Y-%m-%d_%H-%M-%S")
self.scheduler_update_every_step = scheduler_update_every_step
self.device = device if device is not None else torch.device(f'cuda:{local_rank}' if torch.cuda.is_available() else 'cpu')
self.device = device if device is not None else torch.device(
f'cuda:{local_rank}' if torch.cuda.is_available() else (
'mps' if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else 'cpu'
)
)
self.console = Console()
model.to(self.device)
... ...
... ... @@ -56,10 +56,8 @@ from ultralight.unet import Model
from ultralight.audio2feature import Audio2Feature
from logger import logger
device = 'cuda' if torch.cuda.is_available() else 'cpu'
logger.info('Using {} for inference.'.format(device))
device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
print('Using {} for inference.'.format(device))
def load_model(opt):
audio_processor = Audio2Feature()
... ...
... ... @@ -44,8 +44,8 @@ from basereal import BaseReal
from tqdm import tqdm
from logger import logger
device = 'cuda' if torch.cuda.is_available() else 'cpu'
logger.info('Using {} for inference.'.format(device))
device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
print('Using {} for inference.'.format(device))
def _load(checkpoint_path):
if device == 'cuda':
... ...
... ... @@ -51,7 +51,7 @@ from logger import logger
def load_model():
# load model weights
audio_processor,vae, unet, pe = load_all_model()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
timesteps = torch.tensor([0], device=device)
pe = pe.half()
vae.vae = vae.vae.half()
... ... @@ -64,7 +64,7 @@ def load_avatar(avatar_id):
#self.video_path = '' #video_path
#self.bbox_shift = opt.bbox_shift
avatar_path = f"./data/avatars/{avatar_id}"
full_imgs_path = f"{avatar_path}/full_imgs"
full_imgs_path = f"{avatar_path}/full_imgs"
coords_path = f"{avatar_path}/coords.pkl"
latents_out_path= f"{avatar_path}/latents.pt"
video_out_path = f"{avatar_path}/vid_output/"
... ... @@ -74,7 +74,7 @@ def load_avatar(avatar_id):
# self.avatar_info = {
# "avatar_id":self.avatar_id,
# "video_path":self.video_path,
# "bbox_shift":self.bbox_shift
# "bbox_shift":self.bbox_shift
# }
input_latent_list_cycle = torch.load(latents_out_path) #,weights_only=True
... ... @@ -124,19 +124,19 @@ def __mirror_index(size, index):
if turn % 2 == 0:
return res
else:
return size - res - 1
return size - res - 1
@torch.no_grad()
def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,audio_out_queue,res_frame_queue,
vae, unet, pe,timesteps): #vae, unet, pe,timesteps
# vae, unet, pe = load_diffusion_model()
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# timesteps = torch.tensor([0], device=device)
# pe = pe.half()
# vae.vae = vae.vae.half()
# unet.model = unet.model.half()
length = len(input_latent_list_cycle)
index = 0
count=0
... ... @@ -169,7 +169,7 @@ def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,a
latent = input_latent_list_cycle[idx]
latent_batch.append(latent)
latent_batch = torch.cat(latent_batch, dim=0)
# for i, (whisper_batch,latent_batch) in enumerate(gen):
audio_feature_batch = torch.from_numpy(whisper_batch)
audio_feature_batch = audio_feature_batch.to(device=unet.device,
... ... @@ -179,8 +179,8 @@ def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,a
# print('prepare time:',time.perf_counter()-t)
# t=time.perf_counter()
pred_latents = unet.model(latent_batch,
timesteps,
pred_latents = unet.model(latent_batch,
timesteps,
encoder_hidden_states=audio_feature_batch).sample
# print('unet time:',time.perf_counter()-t)
# t=time.perf_counter()
... ... @@ -203,7 +203,7 @@ def inference(render_event,batch_size,input_latent_list_cycle,audio_feat_queue,a
#self.__pushmedia(res_frame,loop,audio_track,video_track)
res_frame_queue.put((res_frame,__mirror_index(length,index),audio_frames[i*2:i*2+2]))
index = index + 1
#print('total batch time:',time.perf_counter()-starttime)
#print('total batch time:',time.perf_counter()-starttime)
logger.info('musereal inference processor stop')
class MuseReal(BaseReal):
... ... @@ -226,12 +226,12 @@ class MuseReal(BaseReal):
self.asr = MuseASR(opt,self,self.audio_processor)
self.asr.warm_up()
self.render_event = mp.Event()
def __del__(self):
logger.info(f'musereal({self.sessionid}) delete')
def __mirror_index(self, index):
size = len(self.coord_list_cycle)
... ... @@ -240,9 +240,9 @@ class MuseReal(BaseReal):
if turn % 2 == 0:
return res
else:
return size - res - 1
return size - res - 1
def __warm_up(self):
def __warm_up(self):
self.asr.run_step()
whisper_chunks = self.asr.get_next_feat()
whisper_batch = np.stack(whisper_chunks)
... ... @@ -260,30 +260,57 @@ class MuseReal(BaseReal):
audio_feature_batch = self.pe(audio_feature_batch)
latent_batch = latent_batch.to(dtype=self.unet.model.dtype)
pred_latents = self.unet.model(latent_batch,
self.timesteps,
pred_latents = self.unet.model(latent_batch,
self.timesteps,
encoder_hidden_states=audio_feature_batch).sample
recon = self.vae.decode_latents(pred_latents)
def process_frames(self,quit_event,loop=None,audio_track=None,video_track=None):
enable_transition = True # 设置为False禁用过渡效果,True启用
if enable_transition:
self.last_speaking = False
self.transition_start = time.time()
self.transition_duration = 0.1 # 过渡时间
self.last_silent_frame = None # 静音帧缓存
self.last_speaking_frame = None # 说话帧缓存
while not quit_event.is_set():
try:
res_frame,idx,audio_frames = self.res_frame_queue.get(block=True, timeout=1)
except queue.Empty:
continue
if audio_frames[0][1]!=0 and audio_frames[1][1]!=0: #全为静音数据,只需要取fullimg
if enable_transition:
# 检测状态变化
current_speaking = not (audio_frames[0][1]!=0 and audio_frames[1][1]!=0)
if current_speaking != self.last_speaking:
logger.info(f"状态切换:{'说话' if self.last_speaking else '静音'} → {'说话' if current_speaking else '静音'}")
self.transition_start = time.time()
self.last_speaking = current_speaking
if audio_frames[0][1]!=0 and audio_frames[1][1]!=0:
self.speaking = False
audiotype = audio_frames[0][1]
if self.custom_index.get(audiotype) is not None: #有自定义视频
if self.custom_index.get(audiotype) is not None:
mirindex = self.mirror_index(len(self.custom_img_cycle[audiotype]),self.custom_index[audiotype])
combine_frame = self.custom_img_cycle[audiotype][mirindex]
target_frame = self.custom_img_cycle[audiotype][mirindex]
self.custom_index[audiotype] += 1
# if not self.custom_opt[audiotype].loop and self.custom_index[audiotype]>=len(self.custom_img_cycle[audiotype]):
# self.curr_state = 1 #当前视频不循环播放,切换到静音状态
else:
combine_frame = self.frame_list_cycle[idx]
target_frame = self.frame_list_cycle[idx]
if enable_transition:
# 说话→静音过渡
if time.time() - self.transition_start < self.transition_duration and self.last_speaking_frame is not None:
alpha = min(1.0, (time.time() - self.transition_start) / self.transition_duration)
combine_frame = cv2.addWeighted(self.last_speaking_frame, 1-alpha, target_frame, alpha, 0)
else:
combine_frame = target_frame
# 缓存静音帧
self.last_silent_frame = combine_frame.copy()
else:
combine_frame = target_frame
else:
self.speaking = True
bbox = self.coord_list_cycle[idx]
... ... @@ -291,20 +318,29 @@ class MuseReal(BaseReal):
x1, y1, x2, y2 = bbox
try:
res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
except:
except Exception as e:
logger.warning(f"resize error: {e}")
continue
mask = self.mask_list_cycle[idx]
mask_crop_box = self.mask_coords_list_cycle[idx]
#combine_frame = get_image(ori_frame,res_frame,bbox)
#t=time.perf_counter()
combine_frame = get_image_blending(ori_frame,res_frame,bbox,mask,mask_crop_box)
#print('blending time:',time.perf_counter()-t)
image = combine_frame #(outputs['image'] * 255).astype(np.uint8)
current_frame = get_image_blending(ori_frame,res_frame,bbox,mask,mask_crop_box)
if enable_transition:
# 静音→说话过渡
if time.time() - self.transition_start < self.transition_duration and self.last_silent_frame is not None:
alpha = min(1.0, (time.time() - self.transition_start) / self.transition_duration)
combine_frame = cv2.addWeighted(self.last_silent_frame, 1-alpha, current_frame, alpha, 0)
else:
combine_frame = current_frame
# 缓存说话帧
self.last_speaking_frame = combine_frame.copy()
else:
combine_frame = current_frame
image = combine_frame
new_frame = VideoFrame.from_ndarray(image, format="bgr24")
asyncio.run_coroutine_threadsafe(video_track._queue.put((new_frame,None)), loop)
self.record_video_data(image)
#self.recordq_video.put(new_frame)
for audio_frame in audio_frames:
frame,type,eventpoint = audio_frame
... ... @@ -312,12 +348,8 @@ class MuseReal(BaseReal):
new_frame = AudioFrame(format='s16', layout='mono', samples=frame.shape[0])
new_frame.planes[0].update(frame.tobytes())
new_frame.sample_rate=16000
# if audio_track._queue.qsize()>10:
# time.sleep(0.1)
asyncio.run_coroutine_threadsafe(audio_track._queue.put((new_frame,eventpoint)), loop)
self.record_audio_data(frame)
#self.notify(eventpoint)
#self.recordq_audio.put(new_frame)
logger.info('musereal process_frames thread stop')
def render(self,quit_event,loop=None,audio_track=None,video_track=None):
... ...
... ... @@ -36,7 +36,7 @@ class UNet():
unet_config = json.load(f)
self.model = UNet2DConditionModel(**unet_config)
self.pe = PositionalEncoding(d_model=384)
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
weights = torch.load(model_path) if torch.cuda.is_available() else torch.load(model_path, map_location=self.device)
self.model.load_state_dict(weights)
if use_float16:
... ...
... ... @@ -23,7 +23,7 @@ class VAE():
self.model_path = model_path
self.vae = AutoencoderKL.from_pretrained(self.model_path)
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
self.vae.to(self.device)
if use_float16:
... ...
... ... @@ -325,7 +325,7 @@ def create_musetalk_human(file, avatar_id):
# initialize the mmpose model
device = "cuda" if torch.cuda.is_available() else "cpu"
device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
fa = FaceAlignment(1, flip_input=False, device=device)
config_file = os.path.join(current_dir, 'utils/dwpose/rtmpose-l_8xb32-270e_coco-ubody-wholebody-384x288.py')
checkpoint_file = os.path.abspath(os.path.join(current_dir, '../models/dwpose/dw-ll_ucoco_384.pth'))
... ...
... ... @@ -13,14 +13,14 @@ import torch
from tqdm import tqdm
# initialize the mmpose model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
config_file = './musetalk/utils/dwpose/rtmpose-l_8xb32-270e_coco-ubody-wholebody-384x288.py'
checkpoint_file = './models/dwpose/dw-ll_ucoco_384.pth'
model = init_model(config_file, checkpoint_file, device=device)
# initialize the face detection model
device = "cuda" if torch.cuda.is_available() else "cpu"
fa = FaceAlignment(LandmarksType._2D, flip_input=False,device=device)
device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
fa = FaceAlignment(LandmarksType._2D, flip_input=False, device=device)
# maker if the bbox is not sufficient
coord_placeholder = (0.0,0.0,0.0,0.0)
... ...
... ... @@ -91,7 +91,7 @@ def load_model(name: str, device: Optional[Union[str, torch.device]] = None, dow
"""
if device is None:
device = "cuda" if torch.cuda.is_available() else "cpu"
device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
if download_root is None:
download_root = os.getenv(
"XDG_CACHE_HOME",
... ...
... ... @@ -78,17 +78,19 @@ def transcribe(
if dtype == torch.float16:
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
dtype = torch.float32
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
warnings.warn("Performing inference on CPU when MPS is available")
if dtype == torch.float32:
decode_options["fp16"] = False
mel = log_mel_spectrogram(audio)
all_segments = []
def add_segment(
*, start: float, end: float, encoder_embeddings
):
all_segments.append(
{
"start": start,
... ... @@ -100,20 +102,20 @@ def transcribe(
num_frames = mel.shape[-1]
seek = 0
previous_seek_value = seek
sample_skip = 3000 #
sample_skip = 3000 #
with tqdm.tqdm(total=num_frames, unit='frames', disable=verbose is not False) as pbar:
while seek < num_frames:
# seek是开始的帧数
end_seek = min(seek + sample_skip, num_frames)
segment = pad_or_trim(mel[:,seek:seek+sample_skip], N_FRAMES).to(model.device).to(dtype)
single = segment.ndim == 2
if single:
segment = segment.unsqueeze(0)
if dtype == torch.float16:
segment = segment.half()
audio_features, embeddings = model.encoder(segment, include_embeddings = True)
encoder_embeddings = embeddings
#print(f"encoder_embeddings shape {encoder_embeddings.shape}")
add_segment(
... ... @@ -124,7 +126,7 @@ def transcribe(
encoder_embeddings=encoder_embeddings,
)
seek+=sample_skip
return dict(segments=all_segments)
... ... @@ -135,7 +137,7 @@ def cli():
parser.add_argument("audio", nargs="+", type=str, help="audio file(s) to transcribe")
parser.add_argument("--model", default="small", choices=available_models(), help="name of the Whisper model to use")
parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "mps", help="device to use for PyTorch inference")
parser.add_argument("--output_dir", "-o", type=str, default=".", help="directory to save the outputs")
parser.add_argument("--verbose", type=str2bool, default=True, help="whether to print out the progress and debug messages")
... ...
... ... @@ -30,7 +30,7 @@ class NerfASR(BaseASR):
def __init__(self, opt, parent, audio_processor,audio_model):
super().__init__(opt,parent)
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
self.device = "cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu")
if 'esperanto' in self.opt.asr_model:
self.audio_dim = 44
elif 'deepspeech' in self.opt.asr_model:
... ...
... ... @@ -77,7 +77,7 @@ def load_model(opt):
seed_everything(opt.seed)
logger.info(opt)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('cuda' if torch.cuda.is_available() else ('mps' if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else 'cpu'))
model = NeRFNetwork(opt)
criterion = torch.nn.MSELoss(reduction='none')
... ...
... ... @@ -90,7 +90,7 @@ class BaseTTS:
###########################################################################################
class EdgeTTS(BaseTTS):
def txt_to_audio(self,msg):
voicename = "zh-CN-XiaoxiaoNeural"
voicename = "zh-CN-YunxiaNeural"
text,textevent = msg
t = time.time()
asyncio.new_event_loop().run_until_complete(self.__main(voicename,text))
... ... @@ -98,7 +98,7 @@ class EdgeTTS(BaseTTS):
if self.input_stream.getbuffer().nbytes<=0: #edgetts err
logger.error('edgetts err!!!!!')
return
self.input_stream.seek(0)
stream = self.__create_bytes_stream(self.input_stream)
streamlen = stream.shape[0]
... ... @@ -107,15 +107,15 @@ class EdgeTTS(BaseTTS):
eventpoint=None
streamlen -= self.chunk
if idx==0:
eventpoint={'status':'start','text':text,'msgenvent':textevent}
eventpoint={'status':'start','text':text,'msgevent':textevent}
elif streamlen<self.chunk:
eventpoint={'status':'end','text':text,'msgenvent':textevent}
eventpoint={'status':'end','text':text,'msgevent':textevent}
self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
idx += self.chunk
#if streamlen>0: #skip last frame(not 20ms)
# self.queue.put(stream[idx:])
self.input_stream.seek(0)
self.input_stream.truncate()
self.input_stream.truncate()
def __create_bytes_stream(self,byte_stream):
#byte_stream=BytesIO(buffer)
... ... @@ -126,13 +126,13 @@ class EdgeTTS(BaseTTS):
if stream.ndim > 1:
logger.info(f'[WARN] audio has {stream.shape[1]} channels, only use the first.')
stream = stream[:, 0]
if sample_rate != self.sample_rate and stream.shape[0]>0:
logger.info(f'[WARN] audio sample rate is {sample_rate}, resampling into {self.sample_rate}.')
stream = resampy.resample(x=stream, sr_orig=sample_rate, sr_new=self.sample_rate)
return stream
async def __main(self,voicename: str, text: str):
try:
communicate = edge_tts.Communicate(text, voicename)
... ... @@ -153,12 +153,12 @@ class EdgeTTS(BaseTTS):
###########################################################################################
class FishTTS(BaseTTS):
def txt_to_audio(self,msg):
def txt_to_audio(self,msg):
text,textevent = msg
self.stream_tts(
self.fish_speech(
text,
self.opt.REF_FILE,
self.opt.REF_FILE,
self.opt.REF_TEXT,
"zh", #en args.language,
self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
... ... @@ -190,9 +190,9 @@ class FishTTS(BaseTTS):
if res.status_code != 200:
logger.error("Error:%s", res.text)
return
first = True
for chunk in res.iter_content(chunk_size=17640): # 1764 44100*20ms*2
#print('chunk len:',len(chunk))
if first:
... ... @@ -209,7 +209,7 @@ class FishTTS(BaseTTS):
text,textevent = msg
first = True
for chunk in audio_stream:
if chunk is not None and len(chunk)>0:
if chunk is not None and len(chunk)>0:
stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
stream = resampy.resample(x=stream, sr_orig=44100, sr_new=self.sample_rate)
#byte_stream=BytesIO(buffer)
... ... @@ -219,22 +219,22 @@ class FishTTS(BaseTTS):
while streamlen >= self.chunk:
eventpoint=None
if first:
eventpoint={'status':'start','text':text,'msgenvent':textevent}
eventpoint={'status':'start','text':text,'msgevent':textevent}
first = False
self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
streamlen -= self.chunk
idx += self.chunk
eventpoint={'status':'end','text':text,'msgenvent':textevent}
self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
eventpoint={'status':'end','text':text,'msgevent':textevent}
self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
###########################################################################################
class VoitsTTS(BaseTTS):
def txt_to_audio(self,msg):
class SovitsTTS(BaseTTS):
def txt_to_audio(self,msg):
text,textevent = msg
self.stream_tts(
self.gpt_sovits(
text,
self.opt.REF_FILE,
self.opt.REF_FILE,
self.opt.REF_TEXT,
"zh", #en args.language,
self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
... ... @@ -271,9 +271,9 @@ class VoitsTTS(BaseTTS):
if res.status_code != 200:
logger.error("Error:%s", res.text)
return
first = True
for chunk in res.iter_content(chunk_size=None): #12800 1280 32K*20ms*2
logger.info('chunk len:%d',len(chunk))
if first:
... ... @@ -295,7 +295,7 @@ class VoitsTTS(BaseTTS):
if stream.ndim > 1:
logger.info(f'[WARN] audio has {stream.shape[1]} channels, only use the first.')
stream = stream[:, 0]
if sample_rate != self.sample_rate and stream.shape[0]>0:
logger.info(f'[WARN] audio sample rate is {sample_rate}, resampling into {self.sample_rate}.')
stream = resampy.resample(x=stream, sr_orig=sample_rate, sr_new=self.sample_rate)
... ... @@ -306,7 +306,7 @@ class VoitsTTS(BaseTTS):
text,textevent = msg
first = True
for chunk in audio_stream:
if chunk is not None and len(chunk)>0:
if chunk is not None and len(chunk)>0:
#stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
#stream = resampy.resample(x=stream, sr_orig=32000, sr_new=self.sample_rate)
byte_stream=BytesIO(chunk)
... ... @@ -316,22 +316,22 @@ class VoitsTTS(BaseTTS):
while streamlen >= self.chunk:
eventpoint=None
if first:
eventpoint={'status':'start','text':text,'msgenvent':textevent}
eventpoint={'status':'start','text':text,'msgevent':textevent}
first = False
self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
streamlen -= self.chunk
idx += self.chunk
eventpoint={'status':'end','text':text,'msgenvent':textevent}
eventpoint={'status':'end','text':text,'msgevent':textevent}
self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
###########################################################################################
class CosyVoiceTTS(BaseTTS):
def txt_to_audio(self,msg):
text,textevent = msg
text,textevent = msg
self.stream_tts(
self.cosy_voice(
text,
self.opt.REF_FILE,
self.opt.REF_FILE,
self.opt.REF_TEXT,
"zh", #en args.language,
self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
... ... @@ -348,16 +348,16 @@ class CosyVoiceTTS(BaseTTS):
try:
files = [('prompt_wav', ('prompt_wav', open(reffile, 'rb'), 'application/octet-stream'))]
res = requests.request("GET", f"{server_url}/inference_zero_shot", data=payload, files=files, stream=True)
end = time.perf_counter()
logger.info(f"cosy_voice Time to make POST: {end-start}s")
if res.status_code != 200:
logger.error("Error:%s", res.text)
return
first = True
for chunk in res.iter_content(chunk_size=9600): # 960 24K*20ms*2
if first:
end = time.perf_counter()
... ... @@ -372,7 +372,7 @@ class CosyVoiceTTS(BaseTTS):
text,textevent = msg
first = True
for chunk in audio_stream:
if chunk is not None and len(chunk)>0:
if chunk is not None and len(chunk)>0:
stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
stream = resampy.resample(x=stream, sr_orig=24000, sr_new=self.sample_rate)
#byte_stream=BytesIO(buffer)
... ... @@ -382,13 +382,13 @@ class CosyVoiceTTS(BaseTTS):
while streamlen >= self.chunk:
eventpoint=None
if first:
eventpoint={'status':'start','text':text,'msgenvent':textevent}
eventpoint={'status':'start','text':text,'msgevent':textevent}
first = False
self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
streamlen -= self.chunk
idx += self.chunk
eventpoint={'status':'end','text':text,'msgenvent':textevent}
self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
eventpoint={'status':'end','text':text,'msgevent':textevent}
self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
###########################################################################################
_PROTOCOL = "https://"
... ... @@ -407,7 +407,7 @@ class TencentTTS(BaseTTS):
self.sample_rate = 16000
self.volume = 0
self.speed = 0
def __gen_signature(self, params):
sort_dict = sorted(params.keys())
sign_str = "POST" + _HOST + _PATH + "?"
... ... @@ -440,11 +440,11 @@ class TencentTTS(BaseTTS):
return params
def txt_to_audio(self,msg):
text,textevent = msg
text,textevent = msg
self.stream_tts(
self.tencent_voice(
text,
self.opt.REF_FILE,
self.opt.REF_FILE,
self.opt.REF_TEXT,
"zh", #en args.language,
self.opt.TTS_SERVER, #"http://127.0.0.1:5000", #args.server_url,
... ... @@ -465,12 +465,12 @@ class TencentTTS(BaseTTS):
try:
res = requests.post(url, headers=headers,
data=json.dumps(params), stream=True)
end = time.perf_counter()
logger.info(f"tencent Time to make POST: {end-start}s")
first = True
for chunk in res.iter_content(chunk_size=6400): # 640 16K*20ms*2
#logger.info('chunk len:%d',len(chunk))
if first:
... ... @@ -483,7 +483,7 @@ class TencentTTS(BaseTTS):
except:
end = time.perf_counter()
logger.info(f"tencent Time to first chunk: {end-start}s")
first = False
first = False
if chunk and self.state==State.RUNNING:
yield chunk
except Exception as e:
... ... @@ -494,7 +494,7 @@ class TencentTTS(BaseTTS):
first = True
last_stream = np.array([],dtype=np.float32)
for chunk in audio_stream:
if chunk is not None and len(chunk)>0:
if chunk is not None and len(chunk)>0:
stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
stream = np.concatenate((last_stream,stream))
#stream = resampy.resample(x=stream, sr_orig=24000, sr_new=self.sample_rate)
... ... @@ -505,14 +505,14 @@ class TencentTTS(BaseTTS):
while streamlen >= self.chunk:
eventpoint=None
if first:
eventpoint={'status':'start','text':text,'msgenvent':textevent}
eventpoint={'status':'start','text':text,'msgevent':textevent}
first = False
self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
streamlen -= self.chunk
idx += self.chunk
last_stream = stream[idx:] #get the remain stream
eventpoint={'status':'end','text':text,'msgenvent':textevent}
self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
eventpoint={'status':'end','text':text,'msgevent':textevent}
self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
###########################################################################################
... ... @@ -522,7 +522,7 @@ class XTTS(BaseTTS):
self.speaker = self.get_speaker(opt.REF_FILE, opt.TTS_SERVER)
def txt_to_audio(self,msg):
text,textevent = msg
text,textevent = msg
self.stream_tts(
self.xtts(
text,
... ... @@ -558,7 +558,7 @@ class XTTS(BaseTTS):
return
first = True
for chunk in res.iter_content(chunk_size=9600): #24K*20ms*2
if first:
end = time.perf_counter()
... ... @@ -568,12 +568,12 @@ class XTTS(BaseTTS):
yield chunk
except Exception as e:
print(e)
def stream_tts(self,audio_stream,msg):
text,textevent = msg
first = True
for chunk in audio_stream:
if chunk is not None and len(chunk)>0:
if chunk is not None and len(chunk)>0:
stream = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32767
stream = resampy.resample(x=stream, sr_orig=24000, sr_new=self.sample_rate)
#byte_stream=BytesIO(buffer)
... ... @@ -583,10 +583,10 @@ class XTTS(BaseTTS):
while streamlen >= self.chunk:
eventpoint=None
if first:
eventpoint={'status':'start','text':text,'msgenvent':textevent}
eventpoint={'status':'start','text':text,'msgevent':textevent}
first = False
self.parent.put_audio_frame(stream[idx:idx+self.chunk],eventpoint)
streamlen -= self.chunk
idx += self.chunk
eventpoint={'status':'end','text':text,'msgenvent':textevent}
eventpoint={'status':'end','text':text,'msgevent':textevent}
self.parent.put_audio_frame(np.zeros(self.chunk,np.float32),eventpoint)
\ No newline at end of file
... ...
... ... @@ -236,7 +236,7 @@ if __name__ == '__main__':
if hasattr(module, 'reparameterize'):
module.reparameterize()
return model
device = torch.device("cuda")
device = torch.device("cuda" if torch.cuda.is_available() else ("mps" if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else "cpu"))
def check_onnx(torch_out, torch_in, audio):
onnx_model = onnx.load(onnx_path)
onnx.checker.check_model(onnx_model)
... ...
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>livetalking数字人交互平台</title>
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.10.0/font/bootstrap-icons.css">
<style>
:root {
--primary-color: #4361ee;
--secondary-color: #3f37c9;
--accent-color: #4895ef;
--background-color: #f8f9fa;
--card-bg: #ffffff;
--text-color: #212529;
--border-radius: 10px;
--box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
}
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
background-color: var(--background-color);
color: var(--text-color);
min-height: 100vh;
padding-top: 20px;
}
.dashboard-container {
max-width: 1400px;
margin: 0 auto;
padding: 20px;
}
.card {
background-color: var(--card-bg);
border-radius: var(--border-radius);
box-shadow: var(--box-shadow);
border: none;
margin-bottom: 20px;
overflow: hidden;
}
.card-header {
background-color: var(--primary-color);
color: white;
font-weight: 600;
padding: 15px 20px;
border-bottom: none;
}
.video-container {
position: relative;
width: 100%;
background-color: #000;
border-radius: var(--border-radius);
overflow: hidden;
display: flex;
justify-content: center;
align-items: center;
}
video {
max-width: 100%;
max-height: 100%;
display: block;
border-radius: var(--border-radius);
}
.controls-container {
padding: 20px;
}
.btn-primary {
background-color: var(--primary-color);
border-color: var(--primary-color);
}
.btn-primary:hover {
background-color: var(--secondary-color);
border-color: var(--secondary-color);
}
.btn-outline-primary {
color: var(--primary-color);
border-color: var(--primary-color);
}
.btn-outline-primary:hover {
background-color: var(--primary-color);
color: white;
}
.form-control {
border-radius: var(--border-radius);
padding: 10px 15px;
border: 1px solid #ced4da;
}
.form-control:focus {
border-color: var(--accent-color);
box-shadow: 0 0 0 0.25rem rgba(67, 97, 238, 0.25);
}
.status-indicator {
width: 10px;
height: 10px;
border-radius: 50%;
display: inline-block;
margin-right: 5px;
}
.status-connected {
background-color: #28a745;
}
.status-disconnected {
background-color: #dc3545;
}
.status-connecting {
background-color: #ffc107;
}
.asr-container {
height: 300px;
overflow-y: auto;
padding: 15px;
background-color: #f8f9fa;
border-radius: var(--border-radius);
border: 1px solid #ced4da;
}
.asr-text {
margin-bottom: 10px;
padding: 10px;
background-color: white;
border-radius: var(--border-radius);
box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1);
}
.user-message {
background-color: #e3f2fd;
border-left: 4px solid var(--primary-color);
}
.system-message {
background-color: #f1f8e9;
border-left: 4px solid #8bc34a;
}
.recording-indicator {
position: absolute;
top: 15px;
right: 15px;
background-color: rgba(220, 53, 69, 0.8);
color: white;
padding: 5px 10px;
border-radius: 20px;
font-size: 0.8rem;
display: none;
}
.recording-indicator.active {
display: flex;
align-items: center;
}
.recording-indicator .blink {
width: 10px;
height: 10px;
background-color: #fff;
border-radius: 50%;
margin-right: 5px;
animation: blink 1s infinite;
}
@keyframes blink {
0% { opacity: 1; }
50% { opacity: 0.3; }
100% { opacity: 1; }
}
.mode-switch {
margin-bottom: 20px;
}
.nav-tabs .nav-link {
color: var(--text-color);
border: none;
padding: 10px 20px;
border-radius: var(--border-radius) var(--border-radius) 0 0;
}
.nav-tabs .nav-link.active {
color: var(--primary-color);
background-color: var(--card-bg);
border-bottom: 3px solid var(--primary-color);
font-weight: 600;
}
.tab-content {
padding: 20px;
background-color: var(--card-bg);
border-radius: 0 0 var(--border-radius) var(--border-radius);
}
.settings-panel {
padding: 15px;
background-color: #f8f9fa;
border-radius: var(--border-radius);
margin-top: 15px;
}
.footer {
text-align: center;
margin-top: 30px;
padding: 20px 0;
color: #6c757d;
font-size: 0.9rem;
}
.voice-record-btn {
width: 60px;
height: 60px;
border-radius: 50%;
background-color: var(--primary-color);
color: white;
display: flex;
justify-content: center;
align-items: center;
cursor: pointer;
transition: all 0.2s ease;
box-shadow: 0 2px 5px rgba(0,0,0,0.2);
margin: 0 auto;
}
.voice-record-btn:hover {
background-color: var(--secondary-color);
transform: scale(1.05);
}
.voice-record-btn:active {
background-color: #dc3545;
transform: scale(0.95);
}
.voice-record-btn i {
font-size: 24px;
}
.voice-record-label {
text-align: center;
margin-top: 10px;
font-size: 14px;
color: #6c757d;
}
.video-size-control {
margin-top: 15px;
}
.recording-pulse {
animation: pulse 1.5s infinite;
}
@keyframes pulse {
0% {
box-shadow: 0 0 0 0 rgba(220, 53, 69, 0.7);
}
70% {
box-shadow: 0 0 0 15px rgba(220, 53, 69, 0);
}
100% {
box-shadow: 0 0 0 0 rgba(220, 53, 69, 0);
}
}
</style>
</head>
<body>
<div class="dashboard-container">
<div class="row">
<div class="col-12">
<h1 class="text-center mb-4">livetalking数字人交互平台</h1>
</div>
</div>
<div class="row">
<!-- 视频区域 -->
<div class="col-lg-8">
<div class="card">
<div class="card-header d-flex justify-content-between align-items-center">
<div>
<span class="status-indicator status-disconnected" id="connection-status"></span>
<span id="status-text">未连接</span>
</div>
</div>
<div class="card-body p-0">
<div class="video-container">
<video id="video" autoplay playsinline></video>
<div class="recording-indicator" id="recording-indicator">
<div class="blink"></div>
<span>录制中</span>
</div>
</div>
<div class="controls-container">
<div class="row">
<div class="col-md-6 mb-3">
<button class="btn btn-primary w-100" id="start">
<i class="bi bi-play-fill"></i> 开始连接
</button>
<button class="btn btn-danger w-100" id="stop" style="display: none;">
<i class="bi bi-stop-fill"></i> 停止连接
</button>
</div>
<div class="col-md-6 mb-3">
<div class="d-flex">
<button class="btn btn-outline-primary flex-grow-1 me-2" id="btn_start_record">
<i class="bi bi-record-fill"></i> 开始录制
</button>
<button class="btn btn-outline-danger flex-grow-1" id="btn_stop_record" disabled>
<i class="bi bi-stop-fill"></i> 停止录制
</button>
</div>
</div>
</div>
<div class="row">
<div class="col-12">
<div class="video-size-control">
<label for="video-size-slider" class="form-label">视频大小调节: <span id="video-size-value">100%</span></label>
<input type="range" class="form-range" id="video-size-slider" min="50" max="150" value="100">
</div>
</div>
</div>
<div class="settings-panel mt-3">
<div class="row">
<div class="col-md-12">
<div class="form-check form-switch mb-3">
<input class="form-check-input" type="checkbox" id="use-stun">
<label class="form-check-label" for="use-stun">使用STUN服务器</label>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 右侧交互 -->
<div class="col-lg-4">
<div class="card">
<div class="card-header">
<ul class="nav nav-tabs card-header-tabs" id="interaction-tabs" role="tablist">
<li class="nav-item" role="presentation">
<button class="nav-link active" id="chat-tab" data-bs-toggle="tab" data-bs-target="#chat" type="button" role="tab" aria-controls="chat" aria-selected="true">对话模式</button>
</li>
<li class="nav-item" role="presentation">
<button class="nav-link" id="tts-tab" data-bs-toggle="tab" data-bs-target="#tts" type="button" role="tab" aria-controls="tts" aria-selected="false">朗读模式</button>
</li>
</ul>
</div>
<div class="card-body">
<div class="tab-content" id="interaction-tabs-content">
<!-- 对话模式 -->
<div class="tab-pane fade show active" id="chat" role="tabpanel" aria-labelledby="chat-tab">
<div class="asr-container mb-3" id="chat-messages">
<div class="asr-text system-message">
系统: 欢迎使用livetalking,请点击"开始连接"按钮开始对话。
</div>
</div>
<form id="chat-form">
<div class="input-group mb-3">
<textarea class="form-control" id="chat-message" rows="3" placeholder="输入您想对数字人说的话..."></textarea>
<button class="btn btn-primary" type="submit">
<i class="bi bi-send"></i> 发送
</button>
</div>
</form>
<!-- 按住说话按钮 -->
<div class="voice-record-btn" id="voice-record-btn">
<i class="bi bi-mic-fill"></i>
</div>
<div class="voice-record-label">按住说话,松开发送</div>
</div>
<!-- 朗读模式 -->
<div class="tab-pane fade" id="tts" role="tabpanel" aria-labelledby="tts-tab">
<form id="echo-form">
<div class="mb-3">
<label for="message" class="form-label">输入要朗读的文本</label>
<textarea class="form-control" id="message" rows="6" placeholder="输入您想让数字人朗读的文字..."></textarea>
</div>
<button type="submit" class="btn btn-primary w-100">
<i class="bi bi-volume-up"></i> 朗读文本
</button>
</form>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="footer">
<p>Made with ❤️ by Marstaos | Frontend & Performance Optimization</p>
</div>
</div>
<!-- 隐藏的会话ID -->
<input type="hidden" id="sessionid" value="0">
<script src="client.js"></script>
<script src="srs.sdk.js"></script>
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>
<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
<script>
$(document).ready(function() {
$('#video-size-slider').on('input', function() {
const value = $(this).val();
$('#video-size-value').text(value + '%');
$('#video').css('width', value + '%');
});
function updateConnectionStatus(status) {
const statusIndicator = $('#connection-status');
const statusText = $('#status-text');
statusIndicator.removeClass('status-connected status-disconnected status-connecting');
switch(status) {
case 'connected':
statusIndicator.addClass('status-connected');
statusText.text('已连接');
break;
case 'connecting':
statusIndicator.addClass('status-connecting');
statusText.text('连接中...');
break;
case 'disconnected':
default:
statusIndicator.addClass('status-disconnected');
statusText.text('未连接');
break;
}
}
// 添加聊天消息
function addChatMessage(message, type = 'user') {
const messagesContainer = $('#chat-messages');
const messageClass = type === 'user' ? 'user-message' : 'system-message';
const sender = type === 'user' ? '您' : '数字人';
const messageElement = $(`
<div class="asr-text ${messageClass}">
${sender}: ${message}
</div>
`);
messagesContainer.append(messageElement);
messagesContainer.scrollTop(messagesContainer[0].scrollHeight);
}
// 开始/停止按钮
$('#start').click(function() {
updateConnectionStatus('connecting');
start();
$(this).hide();
$('#stop').show();
// 添加定时器检查视频流是否已加载
let connectionCheckTimer = setInterval(function() {
const video = document.getElementById('video');
// 检查视频是否有数据
if (video.readyState >= 3 && video.videoWidth > 0) {
updateConnectionStatus('connected');
clearInterval(connectionCheckTimer);
}
}, 2000); // 每2秒检查一次
// 60秒后如果还是连接中状态,就停止检查
setTimeout(function() {
if (connectionCheckTimer) {
clearInterval(connectionCheckTimer);
}
}, 60000);
});
$('#stop').click(function() {
stop();
$(this).hide();
$('#start').show();
updateConnectionStatus('disconnected');
});
// 录制功能
$('#btn_start_record').click(function() {
console.log('Starting recording...');
fetch('/record', {
body: JSON.stringify({
type: 'start_record',
sessionid: parseInt(document.getElementById('sessionid').value),
}),
headers: {
'Content-Type': 'application/json'
},
method: 'POST'
}).then(function(response) {
if (response.ok) {
console.log('Recording started.');
$('#btn_start_record').prop('disabled', true);
$('#btn_stop_record').prop('disabled', false);
$('#recording-indicator').addClass('active');
} else {
console.error('Failed to start recording.');
}
}).catch(function(error) {
console.error('Error:', error);
});
});
$('#btn_stop_record').click(function() {
console.log('Stopping recording...');
fetch('/record', {
body: JSON.stringify({
type: 'end_record',
sessionid: parseInt(document.getElementById('sessionid').value),
}),
headers: {
'Content-Type': 'application/json'
},
method: 'POST'
}).then(function(response) {
if (response.ok) {
console.log('Recording stopped.');
$('#btn_start_record').prop('disabled', false);
$('#btn_stop_record').prop('disabled', true);
$('#recording-indicator').removeClass('active');
} else {
console.error('Failed to stop recording.');
}
}).catch(function(error) {
console.error('Error:', error);
});
});
$('#echo-form').on('submit', function(e) {
e.preventDefault();
var message = $('#message').val();
if (!message.trim()) return;
console.log('Sending echo message:', message);
fetch('/human', {
body: JSON.stringify({
text: message,
type: 'echo',
interrupt: true,
sessionid: parseInt(document.getElementById('sessionid').value),
}),
headers: {
'Content-Type': 'application/json'
},
method: 'POST'
});
$('#message').val('');
addChatMessage(`已发送朗读请求: "${message}"`, 'system');
});
// 聊天模式表单提交
$('#chat-form').on('submit', function(e) {
e.preventDefault();
var message = $('#chat-message').val();
if (!message.trim()) return;
console.log('Sending chat message:', message);
fetch('/human', {
body: JSON.stringify({
text: message,
type: 'chat',
interrupt: true,
sessionid: parseInt(document.getElementById('sessionid').value),
}),
headers: {
'Content-Type': 'application/json'
},
method: 'POST'
});
addChatMessage(message, 'user');
$('#chat-message').val('');
});
// 按住说话功能
let mediaRecorder;
let audioChunks = [];
let isRecording = false;
let recognition;
// 检查浏览器是否支持语音识别
const isSpeechRecognitionSupported = 'webkitSpeechRecognition' in window || 'SpeechRecognition' in window;
if (isSpeechRecognitionSupported) {
recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.continuous = true;
recognition.interimResults = true;
recognition.lang = 'zh-CN';
recognition.onresult = function(event) {
let interimTranscript = '';
let finalTranscript = '';
for (let i = event.resultIndex; i < event.results.length; ++i) {
if (event.results[i].isFinal) {
finalTranscript += event.results[i][0].transcript;
} else {
interimTranscript += event.results[i][0].transcript;
$('#chat-message').val(interimTranscript);
}
}
if (finalTranscript) {
$('#chat-message').val(finalTranscript);
}
};
recognition.onerror = function(event) {
console.error('语音识别错误:', event.error);
};
}
// 按住说话按钮事件
$('#voice-record-btn').on('mousedown touchstart', function(e) {
e.preventDefault();
startRecording();
}).on('mouseup mouseleave touchend', function() {
if (isRecording) {
stopRecording();
}
});
// 开始录音
function startRecording() {
if (isRecording) return;
navigator.mediaDevices.getUserMedia({ audio: true })
.then(function(stream) {
audioChunks = [];
mediaRecorder = new MediaRecorder(stream);
mediaRecorder.ondataavailable = function(e) {
if (e.data.size > 0) {
audioChunks.push(e.data);
}
};
mediaRecorder.start();
isRecording = true;
$('#voice-record-btn').addClass('recording-pulse');
$('#voice-record-btn').css('background-color', '#dc3545');
if (recognition) {
recognition.start();
}
})
.catch(function(error) {
console.error('无法访问麦克风:', error);
alert('无法访问麦克风,请检查浏览器权限设置。');
});
}
function stopRecording() {
if (!isRecording) return;
mediaRecorder.stop();
isRecording = false;
// 停止所有音轨
mediaRecorder.stream.getTracks().forEach(track => track.stop());
// 视觉反馈恢复
$('#voice-record-btn').removeClass('recording-pulse');
$('#voice-record-btn').css('background-color', '');
// 停止语音识别
if (recognition) {
recognition.stop();
}
// 获取识别的文本并发送
setTimeout(function() {
const recognizedText = $('#chat-message').val().trim();
if (recognizedText) {
// 发送识别的文本
fetch('/human', {
body: JSON.stringify({
text: recognizedText,
type: 'chat',
interrupt: true,
sessionid: parseInt(document.getElementById('sessionid').value),
}),
headers: {
'Content-Type': 'application/json'
},
method: 'POST'
});
addChatMessage(recognizedText, 'user');
$('#chat-message').val('');
}
}, 500);
}
// WebRTC 相关功能
if (typeof window.onWebRTCConnected === 'function') {
const originalOnConnected = window.onWebRTCConnected;
window.onWebRTCConnected = function() {
updateConnectionStatus('connected');
if (originalOnConnected) originalOnConnected();
};
} else {
window.onWebRTCConnected = function() {
updateConnectionStatus('connected');
};
}
// 当连接断开时更新状态
if (typeof window.onWebRTCDisconnected === 'function') {
const originalOnDisconnected = window.onWebRTCDisconnected;
window.onWebRTCDisconnected = function() {
updateConnectionStatus('disconnected');
if (originalOnDisconnected) originalOnDisconnected();
};
} else {
window.onWebRTCDisconnected = function() {
updateConnectionStatus('disconnected');
};
}
// SRS WebRTC播放功能
var sdk = null; // 全局处理器,用于在重新发布时进行清理
function startPlay() {
// 关闭之前的连接
if (sdk) {
sdk.close();
}
sdk = new SrsRtcWhipWhepAsync();
$('#video').prop('srcObject', sdk.stream);
var host = window.location.hostname;
var url = "http://" + host + ":1985/rtc/v1/whep/?app=live&stream=livestream";
sdk.play(url).then(function(session) {
console.log('WebRTC播放已启动,会话ID:', session.sessionid);
}).catch(function(reason) {
sdk.close();
console.error('WebRTC播放失败:', reason);
});
}
});
</script>
</body>
</html>
\ No newline at end of file
... ...