如何接收 SIP 音频并将 wav 流发送到节点中的 Google 语音识别 API？答案

【问题标题】：How to receive SIP audio and send wav stream to Google Speech recognition API in node?如何接收 SIP 音频并将 wav 流发送到节点中的 Google 语音识别 API？
【发布时间】：2019-11-09 02:24:49
【问题描述】：

到目前为止，我一直在尝试sipster，但它有一些令人生畏的限制 (e.g. lack of configurability)。任何想法如何做到这一点？也许使用像asterisk-manager 这样的星号节点包装器？

在一些更详细的基本思想是

运行虚拟 sip 客户端，可以接收 SIP 连接
将该连接中的音频转换为常规 wav 格式
将 wav 音频流式传输到 Google 语音 API
有其他方法可以通过节点作用于 sip 流，例如播放声音

【问题讨论】：

sipster 是可配置的，您可以将 pjsua2 配置选项传递给init()。这些选项可以在 pjsua2 文档中找到，它们没有在 sipster 文档中列出，因为有很多并且会重复文档。
假设你的“波流”在谷歌文档中意味着“流连续”，你需要在 googl 端走 GRPC / proto-buffers 的路由。你应该查看你的 api 以访问音频缓冲区的字节...假设那些编码的 fmt && 比特率与语音 api 输入兼容，您只需 ArrayCopy.myAudioBytes() && 写入您为语音打开的 goog.api.channel ...@987654324 @

标签： node.js audio speech-recognition asterisk sip

【解决方案1】：

这篇文章已经很老了，看起来 Google 方面的情况已经有了很大的改进，无论是语音处理器本身，它变得越来越准确，还有 Node.js 方面，如 @987654321 @ 与 Google Cloud Speech API 的接口会定期更新。

根据@arheops 的建议，您可能想看看 Asterisk 的 EAGI 和 Node.js，以便获得由 Google 转录的音频样本。

以下 EAGI bash 脚本可能在这方面有所帮助（详细说明可用 here）：

#!/bin/bash

# Read all variables sent by Asterisk store them as an array, but won't use them
declare -a array
while read -e ARG && [ "$ARG" ] ; do
        array=(` echo $ARG | sed -e 's/://'`)
        export ${array[0]}=${array[1]}
done

# First argument is language
case "$1" in
"fr-FR" | "en-GB" | "es-ES" | "it-IT" )
  LANG=$1
  ;;
*)
  LANG=en-US
  ;;
esac

NODECMD=$(which node)

# Second argument is a timeout, in seconds. The duration to wait for voice input form the caller.
DURATION=$2
SAMPLE_RATE=8000
SAMPLE_SIZE_BYTES=2
let "SAMPLE_SIZE_BITS = SAMPLE_SIZE_BYTES * 8"

# EAGI_AUDIO_FORMAT is an asterisk variable that specifies the sample rate and
# sample size (usually 16 bits per sample) of the caller's voice stream.
# Depending on the codec used here, you can get sample rate values ranging from
# 8000Hz (e.g. G.711 uLaw) to 48000Hz (e.g. opus).
echo "GET VARIABLE EAGI_AUDIO_FORMAT"
read line
EAGI_AUDIO_FORMAT=$(echo $line | sed -r 's/.*\((.*)\).*/\1/')

# 5 seconds of audio input are gathered in ( SAMPLE_RATE / sample_size ) * 5 bytes
# - SAMPLE_RATE is set as per EAGI_AUDIO_FORMAT
# - sample_size is set to 2 (16 bits per sample)
#
# We don't do much here to adapt the sample rate, this code should be improved
case "${EAGI_AUDIO_FORMAT}" in
"slin48")
  SAMPLE_RATE=48000
  ;;
*)
  SAMPLE_RATE=8000
  ;;
esac

# Temporary file to store raw audio samples
AUDIO_FILE=/tmp/audio-${SAMPLE_SIZE_BITS}_bits-${SAMPLE_RATE}_hz-${DURATION}_sec.raw

# We use `dd` here to copy the raw audio samples we're getting from file
# descriptor 3 (this is the Enhanced version in EAGI) to the temporary file.
# The number of blocks to copy is a function of the DURATION to record audio and
# the sample rate. SAMPLE_SIZE_BYTES cannot be changed as it is assumed that each
# sample is 16 bits in size.
let "COUNT = SAMPLE_RATE * SAMPLE_SIZE_BYTES * DURATION"
# By default, dd stores blocks of 512 bytes
let "BLOCKS = COUNT / 512"
echo "exec noop \"Number of bytes to store : ${COUNT}\""
read line

echo "exec noop \"Number of dd blocks to store : ${BLOCKS}\""
read line

echo "exec playback \"beep\""
read line

dd if=/dev/fd/3 count=${BLOCKS} of=${AUDIO_FILE}
echo "exec noop \"File saved !\""

echo "exec noop \"AUDIO_FILE : ${AUDIO_FILE}\""
read line
echo "exec noop \"SAMPLE_RATE : ${SAMPLE_RATE}\""
read line
echo "exec noop \"LANG : ${LANG}\""
read line

# Submit audio to Google Cloud Speech API and get the result
export GOOGLE_APPLICATION_CREDENTIALS=/usr/local/node_programs/service_account_file.json
RES=$(${NODECMD} /usr/local/node_programs/nodejs-speech/samples/recognize.js sync ${AUDIO_FILE} -e LINEAR16 -r ${SAMPLE_RATE} -l ${LANG})

# clean up result returned from recognize.js :
# - remove new lines
# - remove 'Transcription :' header
RES=$(echo $RES | tr -d '\n' | sed -e 's/Transcription: \(.*$\)/\1/')

# Set GOOGLE_TRANSCRIPTION_RESULT variable, remove temporary file
# and continue dialplan execution
echo "set variable GOOGLE_TRANSCRIPTION_RESULT \"${RES}\""
read line

/bin/rm -f ${AUDIO_FILE}

exit 0

希望这会有所帮助！

【讨论】：

【解决方案2】：

最简单的方法 - 使用星号 EAGI 界面并将声音从标准输入/流读取到谷歌。

然而，目前谷歌语音识别 api 并不稳定。有些日子它只是停止工作，然后第二天开始工作。

【讨论】：

我试过了，但是没有用。可以分享示例代码吗？