crawl.java ~ to get text data from websites
Match.java ~ to align the english and chinese data by paragraph
sentences.java ~ to separate the paragraph data to sentences
MergeFile.java ~ to merge many files to one
80-20.py and 20-half.py ~ to split the files to train(80%), development(10%) and test(10%) data