CST-UNet: Cross Swin Transformer Enhanced U-Net with Masked Bottleneck for Single-Channel Speech Enhancement

Published in Circuits Systems and Signal Processing, 2024

Speech enhancement performance has improved significantly with the introduction of deep learning models, especially methods based on the Long–Short-Term Memory architecture. However, these methods face challenges such as high computational complexity and redundancy of input features. To address these issues, we propose a U-Net-based approach that utilizes an encoder/decoder to extract more concise features, thereby enhancing single-channel speech performance and reducing computation complexity. The proposed method includes a Cross-Swin-Transformer block and a masked bottleneck module, which down-samples features while preserving the detailed representation through skip connections and carefully designed blocks. The bottleneck module extracts coarse representations of hidden features as masks. We evaluated our method against other U-Net-based approaches on VCTK and DNS corpora using CBAK, eSTOI, PESQ, STOI, and SI-SDR metrics. The results demonstrate that the proposed method achieves promising performance while significantly reducing computational complexity.

Recommended citation: Zhang, Z., Chen, W., Guo, W., Liu, Y., Yang, J., & Liu, H. (2024). CST-UNet: Cross Swin Transformer Enhanced U-Net with Masked Bottleneck for Single-Channel Speech Enhancement. Circuits, Systems, and Signal Processing, 43(9), 5989-6010.
Download Paper | Download Slides | Download Bibtex

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

ZHANG, ZIPENG

Share on