메뉴 건너뛰기
.. 내서재 .. 알림
소속 기관/학교 인증
인증하면 논문, 학술자료 등을  무료로 열람할 수 있어요.
한국대학교, 누리자동차, 시립도서관 등 나의 기관을 확인해보세요
(국내 대학 90% 이상 구독 중)
로그인 회원가입 고객센터 ENG
주제분류

추천
검색

논문 기본 정보

자료유형
학위논문
저자정보

강병옥 (충북대학교, 충북대학교 대학원)

지도교수
권오욱
발행연도
2018
저작권
충북대학교 논문은 저작권에 의해 보호받습니다.

이용수0

표지
AI에게 요청하기
추천
검색

이 논문의 연구 히스토리 (2)

초록· 키워드

오류제보하기
This paper describes studies that have been conducted for acoustic models robust to variation of speakers and speaking styles in automatic speech recognition (ASR) systems.
Related to acoustic models robust to variation of speaking styles, we analyse the characteristics of Korean atypical spontaneous speech and attempt to improve the performance of automatic speech recognition. Atypical spontaneous speech is a speaking style uttered in everyday life, which includes various phenomena such as abnormal speaking rate, ambiguous pronunciation, lengthening, repetition/correction, filled pause, aborted articulation, and colloquial expression. These phenomena are believed to be the main cause of speech recognition errors. In this paper, the linguistic and acoustic characteristics that are common in atypical spontaneous speech are identified and the effects of these attributes on speech recognition errors are analysed. Through the quantitative analysis of speech recognition errors in tagged Korean atypical spontaneous speech data, we find that among the various attributes of atypical spontaneous speech, acoustic characteristics such as ambiguous pronunciation and abnormal speaking rate are the main cause of performance degradation in speech recognition systems. Considering these results of analysis, we apply two-step approaches in terms of acoustic model and achieve the average error rate reduction (ERR) of 49.0% compared to the baseline model. Through the stepwise process of improving the performance of the acoustic model, we analyze the change of speech recognition error patterns and the proportion of error rate caused by each tagged attribute of Korean atypical spontaneous speech.
To improve the performance of acoustic models robust to variation of speakers, we propose a new method to combine multiple acoustic models in Gaussian mixture model (GMM) spaces which combines several sub-acoustic models to generate an optimal GMM-based acoustic models for the desired new speech recognition task. After forming a huge pool of states with multiple acoustic characteristics by gathering HMM states from multiple individually trained sub-models, an acoustic model with states that are optimal for a new task is generated by combining pair of states occupying a similar acoustic space with the proposed algorithm. To evaluate the proposed acoustic modeling method, we perform experiments for non-native English speech recognition task for English learning systems as a second language and a noise-robust speech recognition task for car navigation systems. For the non-native speech recognition task, we generated the combined native and non-native acoustic model. From the experimental results, the proposed method of combining native and non-native models in the GMM spaces is shown to achieve an average ERR of 12.2% compared to the conventional method. For the noise-robust speech recognition task for car navigation systems, we apply the proposed method to combine the acoustic model trained with original clean speech data and the acoustic model trained with noise-added speech data. The proposed method is shown to accomplish the ERR of 7.5% ~ 20.6% under various driving conditions compared to the conventional multi-condition training method.
After the success of deep neural network-based acoustic models applied to large vocabulary continuous speech recognition (LVCSR) was first reported, most of speech recognition systems currently in service adopt DNN-based acoustic models. By extending the previous method of combining multiple acoustic models in Gaussian mixture model spaces, we propose a new method to train deep neural network (DNN)-based acoustic models for non-native speakers’ speech recognition. The proposed method consists of determining multi-set state clusters with various acoustic properties, training a DNN-based acoustic model with pre-determined multi-set state clusters as output nodes, and recognizing input speech based on the trained acoustic model. In the proposed method, hidden nodes of DNN are shared, but output nodes are separated to accommodate different acoustic properties for native and non-native speech. In an English speech recognition task, the proposed method shows the ERR of 2.1% and 14.5% for Korean and English speakers, respectively, compared to the conventional DNN-based acoustic model, which already has a high level of speech recognition accuracy.

목차

Ⅰ. 서론 1
1.1 연구 배경 1
1.2 연구 내용 5
1.3 논문 구성 8
Ⅱ. 배경 이론 9
2.1 음성 인식 시스템 개요 9
2.2 음성 인식을 위한 음향 모델 개요 10
2.1.1 GMM-HMM 기반 음향 모델 11
2.1.2 DNN-HMM 기반 음향 모델 16
Ⅲ. 발화 스타일 변이에 강인한 음성 인식 21
3.1 대화체 음성 인식 21
3.2 대화체 음성 특성 22
3.3 대화체 음성 인식 오류 분석 26
3.4 대화체 음성 인식 성능 개선 33
3.5 녹취 데이터 인식을 위한 심층 신경망 기반 음성 인식 시스템 성능 개선 39
3.5.1 심층 신경망 기반 음향 모델 성능 개선 39
3.5.2 다채널 동시 처리를 위한 최적화 44
3.6 요약 46
Ⅳ. 비원어민 화자에 강인한 심층 신경망 기반 음성 인식 48
4.1 연구 개요 및 구성 48
4.2 GMM 공간에서의 다중 음향 모델 결합 방법 51
4.2.1 서브 음향 모델 훈련 52
4.2.2 상태 결합을 위한 로그 우도 계산 53
4.2.3 비슷한 음향 공간을 차지하는 상태 결합 54
4.2.4 상태 및 GMM 파라미터 가중치 조정 58
4.3 다중 집합 상태 클러스터를 기반으로 한 DNN 음향 모델 59
4.3.1 DNN 학습 방법 63
4.3.2 인식 방법 65
4.4 실험 결과 및 토의 65
4.4.1 원어민과 비원어민 화자 대상 음성 인식 65
4.4.2 원어민과 비원어민 화자 대상 DNN 기반 음성 인식 77
4.4.3 자동차 내비게이션 시스템을 위한 음성 인식 84
4.5 요약 88
Ⅴ. 결론 91
5.1 연구내용 요약 91
5.2 향후 연구과제 93
부록 A. DNN 구조 94
부록 B. DNN 학습 98
참고문헌 104

최근 본 자료

전체보기

댓글(0)

0