Article details

Research area
Speech recognition

Location
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Date
2013

Author(s)
Fan Yang, Paul Vozila

An empirical study of semi-supervised chinese word segmentation Using co-training

Synopsis:

In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. We utilize two segmenters:1) a word-based segmenter leveraginga word-level language model, and 2) a character-based segmenter using characterlevelfeatures within a CRF-based sequencelabeler. These two segmenters are initiallytrained with a small amount of segmenteddata, and then iteratively improve each other using the large amount of unlabelled data.Our experimental results show that co-training captures 20% and 31% of the performance improvement achieved by supervised training with an order of magnitude more data for the SIGHAN Bakeoff 2005 PKU and CU corpora respectively.

Read/download now