Article details

Research area
Speech recognition

Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing


Fan Yang, Paul Vozila

An empirical study of semi-supervised chinese word segmentation Using co-training


In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. We utilize two segmenters:1) a word-based segmenter leveraginga word-level language model, and 2) a character-based segmenter using characterlevelfeatures within a CRF-based sequencelabeler. These two segmenters are initiallytrained with a small amount of segmenteddata, and then iteratively improve each other using the large amount of unlabelled data.Our experimental results show that co-training captures 20% and 31% of the performance improvement achieved by supervised training with an order of magnitude more data for the SIGHAN Bakeoff 2005 PKU and CU corpora respectively.

Read/download now