【語(yǔ)音識(shí)別】語(yǔ)音端點(diǎn)檢測(cè)及Python實(shí)現(xiàn)
- 一、語(yǔ)音信號(hào)的分幀處理
- 二、端點(diǎn)檢測(cè)方法
- 2.1、短時(shí)能量
- 2.2、短時(shí)過(guò)零率
- 三、Python實(shí)現(xiàn)
從接收的語(yǔ)音信號(hào)中準(zhǔn)確檢測(cè)出人聲開(kāi)始和結(jié)束的端點(diǎn)是進(jìn)行語(yǔ)音識(shí)別的前提。本博文介紹基于短時(shí)過(guò)零率和短時(shí)能量的基本語(yǔ)音端點(diǎn)檢測(cè)方法及Python實(shí)現(xiàn)。如圖所示為語(yǔ)音信號(hào),紅色方框內(nèi)為人聲:

一、語(yǔ)音信號(hào)的分幀處理
語(yǔ)音信號(hào)是時(shí)序信號(hào),其具有長(zhǎng)時(shí)隨機(jī)性和短時(shí)平穩(wěn)性。長(zhǎng)時(shí)隨機(jī)性指語(yǔ)音信號(hào)隨時(shí)間變化是一個(gè)隨機(jī)過(guò)程,短時(shí)平穩(wěn)性指在短時(shí)間內(nèi)其特性基本不變,因?yàn)槿苏f(shuō)話是肌肉具有慣性,從一個(gè)狀態(tài)到另一個(gè)狀態(tài)不可能瞬時(shí)完成。語(yǔ)音通常在10-30ms之間相對(duì)平穩(wěn),因此語(yǔ)音信號(hào)處理的第一步基本都是對(duì)語(yǔ)音信號(hào)進(jìn)行分幀處理,幀長(zhǎng)度一般取10-30ms。
語(yǔ)音信號(hào)的分幀處理通常采用滑動(dòng)窗的方式,窗口可以采用直角窗、Hamming窗等。窗口長(zhǎng)度決定每一幀信號(hào)中包含原始語(yǔ)音信號(hào)中信息的數(shù)量,窗口每次的滑動(dòng)距離等于窗口長(zhǎng)度時(shí),每一幀信息沒(méi)有重疊,當(dāng)窗口滑動(dòng)距離小于窗口長(zhǎng)度時(shí)幀信息有重合。本博文采用直角窗進(jìn)行語(yǔ)音信號(hào)的分幀處理:
直角窗:
h ( n ) = { 1 , 0 ≤ n ≤ N ? 1 0 , o t h e r {\rm{h}}(n) = \left\{ {\begin{matrix} {1, 0\le n \le N - 1}\\ {0,{\rm{other}}} \end{matrix}} \right.
h
(
n
)
=
{
1
,
0
≤
n
≤
N
?
1
0
,
o
t
h
e
r
?
二、端點(diǎn)檢測(cè)方法
端點(diǎn)檢測(cè)是指找出人聲開(kāi)始和結(jié)束的端點(diǎn)。利用人聲信號(hào)短時(shí)特性與非人聲信號(hào)短時(shí)特性的差異可以有效地找出人聲開(kāi)始和結(jié)束的端點(diǎn),本博文介紹短時(shí)能量和短時(shí)過(guò)零率結(jié)合進(jìn)行端點(diǎn)檢測(cè)的方法。
2.1、短時(shí)能量
第n幀信號(hào)的短時(shí)平均能量定義為:
E n = ∑ m = n ? N + 1 n [ x ( m ) w ( n ? m ) ] 2 {E_n} = \sum\limits_{m = n - N + 1}^n {{{\left[ {x\left( m \right)w\left( {n - m} \right)} \right]}^2}}
E
n
?
=
m
=
n
?
N
+
1
∑
n
?
[
x
(
m
)
w
(
n
?
m
)
]
2
包含人聲信號(hào)的幀的短時(shí)平均能量大于非人聲信號(hào)的幀。
2.2、短時(shí)過(guò)零率
過(guò)零信號(hào)指通過(guò)零值,相鄰取樣值改變符號(hào)即過(guò)零,過(guò)零數(shù)是樣本改變符號(hào)的數(shù)量。
第n幀信號(hào)的平均短時(shí)過(guò)零數(shù)為:
Z n = ∑ m = n ? N + 1 n ∣ s g n [ x ( m ) ] ? s g n [ x ( m ? 1 ) ] ∣ w ( n ? m ) {Z_n} = \sum\limits_{m = n - N + 1}^n {\left| {{\mathop{\rm sgn}} \left[ {x\left( m \right)} \right] - {\mathop{\rm sgn}} \left[ {x\left( {m - 1} \right)} \right]} \right|w\left( {n - m} \right)}
Z
n
?
=
m
=
n
?
N
+
1
∑
n
?
∣
s
g
n
[
x
(
m
)
]
?
s
g
n
[
x
(
m
?
1
)
]
∣
w
(
n
?
m
)
w ( n ) = { 1 / ( 2 N ) , 0 ≤ n ≤ N ? 1 0 , o t h e r w\left( n \right) = \left\{ {\begin{matrix} {1/\left( {2N} \right),0 \le n \le N - 1}\\ {0,other} \end{matrix}} \right. w ( n ) = { 1 / ( 2 N ) , 0 ≤ n ≤ N ? 1 0 , o t h e r ?
三、Python實(shí)現(xiàn)
import
wave
import
numpy as np
import
matplotlib
.
pyplot as plt
def
read
(
data_path
)
:
''
'讀取語(yǔ)音信號(hào)
''
'
wavepath
=
data_path
f
=
wave
.
open
(
wavepath
,
'rb'
)
params
=
f
.
getparams
(
)
nchannels
,
sampwidth
,
framerate
,
nframes
=
params
[
:
4
]
#聲道數(shù)、量化位數(shù)、采樣頻率、采樣點(diǎn)數(shù)
str_data
=
f
.
readframes
(
nframes
)
#讀取音頻,字符串格式
f
.
close
(
)
wavedata
=
np
.
fromstring
(
str_data
,
dtype
=
np
.
short
)
#將字符串轉(zhuǎn)化為浮點(diǎn)型數(shù)據(jù)
wavedata
=
wavedata
*
1.0
/
(
max
(
abs
(
wavedata
)
)
)
#wave幅值歸一化
return
wavedata
,
nframes
,
framerate
def
plot
(
data
,
time
)
:
plt
.
plot
(
time
,
data
)
plt
.
grid
(
'on'
)
plt
.
show
(
)
def
enframe
(
data
,
win
,
inc
)
:
''
'對(duì)語(yǔ)音數(shù)據(jù)進(jìn)行分幀處理
input
:
data
(
一維array
)
:
語(yǔ)音信號(hào)
wlen
(
int
)
:
滑動(dòng)窗長(zhǎng)
inc
(
int
)
:
窗口每次移動(dòng)的長(zhǎng)度
output
:
f
(
二維array
)
每次滑動(dòng)窗內(nèi)的數(shù)據(jù)組成的二維array
''
'
nx
=
len
(
data
)
#語(yǔ)音信號(hào)的長(zhǎng)度
try
:
nwin
=
len
(
win
)
except Exception as err
:
nwin
=
1
if
nwin
==
1
:
wlen
=
win
else
:
wlen
=
nwin
nf
=
int
(
np
.
fix
(
(
nx
-
wlen
)
/
inc
)
+
1
)
#窗口移動(dòng)的次數(shù)
f
=
np
.
zeros
(
(
nf
,
wlen
)
)
#初始化二維數(shù)組
indf
=
[
inc
*
j
for
j in
range
(
nf
)
]
indf
=
(
np
.
mat
(
indf
)
)
.
T
inds
=
np
.
mat
(
range
(
wlen
)
)
indf_tile
=
np
.
tile
(
indf
,
wlen
)
inds_tile
=
np
.
tile
(
inds
,
(
nf
,
1
)
)
mix_tile
=
indf_tile
+
inds_tile
f
=
np
.
zeros
(
(
nf
,
wlen
)
)
for
i in
range
(
nf
)
:
for
j in
range
(
wlen
)
:
f
[
i
,
j
]
=
data
[
mix_tile
[
i
,
j
]
]
return
f
def
point_check
(
wavedata
,
win
,
inc
)
:
''
'語(yǔ)音信號(hào)端點(diǎn)檢測(cè)
input
:
wavedata
(
一維array
)
:原始語(yǔ)音信號(hào)
output
:
StartPoint
(
int
)
:
起始端點(diǎn)
EndPoint
(
int
)
:
終止端點(diǎn)
''
'
#
1.
計(jì)算短時(shí)過(guò)零率
FrameTemp1
=
enframe
(
wavedata
[
0
:
-
1
]
,
win
,
inc
)
FrameTemp2
=
enframe
(
wavedata
[
1
:
]
,
win
,
inc
)
signs
=
np
.
sign
(
np
.
multiply
(
FrameTemp1
,
FrameTemp2
)
)
# 計(jì)算每一位與其相鄰的數(shù)據(jù)是否異號(hào),異號(hào)則過(guò)零
signs
=
list
(
map
(
lambda x
:
[
[
i
,
0
]
[
i
>
0
]
for
i in x
]
,
signs
)
)
signs
=
list
(
map
(
lambda x
:
[
[
i
,
1
]
[
i
<
0
]
for
i in x
]
,
signs
)
)
diffs
=
np
.
sign
(
abs
(
FrameTemp1
-
FrameTemp2
)
-
0.01
)
diffs
=
list
(
map
(
lambda x
:
[
[
i
,
0
]
[
i
<
0
]
for
i in x
]
,
diffs
)
)
zcr
=
list
(
(
np
.
multiply
(
signs
,
diffs
)
)
.
sum
(
axis
=
1
)
)
#
2.
計(jì)算短時(shí)能量
amp
=
list
(
(
abs
(
enframe
(
wavedata
,
win
,
inc
)
)
)
.
sum
(
axis
=
1
)
)
# # 設(shè)置門(mén)限
#
print
(
'設(shè)置門(mén)限'
)
ZcrLow
=
max
(
[
round
(
np
.
mean
(
zcr
)
*
0.1
)
,
3
]
)
#過(guò)零率低門(mén)限
ZcrHigh
=
max
(
[
round
(
max
(
zcr
)
*
0.1
)
,
5
]
)
#過(guò)零率高門(mén)限
AmpLow
=
min
(
[
min
(
amp
)
*
10
,
np
.
mean
(
amp
)
*
0.2
,
max
(
amp
)
*
0.1
]
)
#能量低門(mén)限
AmpHigh
=
max
(
[
min
(
amp
)
*
10
,
np
.
mean
(
amp
)
*
0.2
,
max
(
amp
)
*
0.1
]
)
#能量高門(mén)限
# 端點(diǎn)檢測(cè)
MaxSilence
=
8
#最長(zhǎng)語(yǔ)音間隙時(shí)間
MinAudio
=
16
#最短語(yǔ)音時(shí)間
Status
=
0
#狀態(tài)
0
:
靜音段
,
1
:
過(guò)渡段
,
2
:
語(yǔ)音段
,
3
:
結(jié)束段
HoldTime
=
0
#語(yǔ)音持續(xù)時(shí)間
SilenceTime
=
0
#語(yǔ)音間隙時(shí)間
print
(
'開(kāi)始端點(diǎn)檢測(cè)'
)
StartPoint
=
0
for
n in
range
(
len
(
zcr
)
)
:
if
Status
==
0
or Status
==
1
:
if
amp
[
n
]
>
AmpHigh or zcr
[
n
]
>
ZcrHigh
:
StartPoint
=
n
-
HoldTime
Status
=
2
HoldTime
=
HoldTime
+
1
SilenceTime
=
0
elif amp
[
n
]
>
AmpLow or zcr
[
n
]
>
ZcrLow
:
Status
=
1
HoldTime
=
HoldTime
+
1
else
:
Status
=
0
HoldTime
=
0
elif Status
==
2
:
if
amp
[
n
]
>
AmpLow or zcr
[
n
]
>
ZcrLow
:
HoldTime
=
HoldTime
+
1
else
:
SilenceTime
=
SilenceTime
+
1
if
SilenceTime
<
MaxSilence
:
HoldTime
=
HoldTime
+
1
elif
(
HoldTime
-
SilenceTime
)
<
MinAudio
:
Status
=
0
HoldTime
=
0
SilenceTime
=
0
else
:
Status
=
3
elif Status
==
3
:
break
if
Status
==
3
:
break
HoldTime
=
HoldTime
-
SilenceTime
EndPoint
=
StartPoint
+
HoldTime
return
StartPoint
,
EndPoint
,
FrameTemp1
if
__name__
==
'__main__'
:
data_path
=
'audio_data.wav'
win
=
240
inc
=
80
wavedata
,
nframes
,
framerate
=
read
(
data_path
)
time_list
=
np
.
array
(
range
(
0
,
nframes
)
)
*
(
1.0
/
framerate
)
plot
(
wavedata
,
time_list
)
StartPoint
,
EndPoint
,
FrameTemp
=
point_check
(
wavedata
,
win
,
inc
)
checkdata
,
Framecheck
=
check_signal
(
StartPoint
,
EndPoint
,
FrameTemp
,
win
,
inc
)
端點(diǎn)檢測(cè)結(jié)果:
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號(hào)聯(lián)系: 360901061
您的支持是博主寫(xiě)作最大的動(dòng)力,如果您喜歡我的文章,感覺(jué)我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長(zhǎng)非常感激您!手機(jī)微信長(zhǎng)按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對(duì)您有幫助就好】元
