Article details

Research area
Speech recognition

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),


Fan Yang, Paul Vozila

Semi-supervised Chinese word segmentation using partial-label learning with conditional random fields


There is rich knowledge encoded in online web data. For example, punctuation and entity tags in Wikipedia data define some word boundaries in a sentence. In this paper we adopt partial-label learning with conditional random fields to make use of this valuable knowledge for semi-supervised Chinese word segmentation. The basic idea of partial-label learning is to optimize a cost function that marginalizes the probability mass in the constrained space that encodes this knowledge. By integrating some domain adaptation techniques, such as Easy Adapt, our result reaches an F-measure of 95.98% on the CTB-6 corpus, a significant improvement from both the supervised baseline and a previous proposed approach, namely constrained decode.

Read/download now