Paraphrase can be defined as “the same meaning of a sentence is expressed in another sentence using different words”. Paraphrases can be identified, generated or extracted. The proposed task is focused on sentence level paraphrase identification for Indian languages (Tamil, Malayalam, Hindi and Punjabi). Identifying paraphrases in Indian languages is a difficult task, because evaluating the semantic similarity of the underlying content and the understanding the morphological variations of the language are more critical. Paraphrase identification is strongly connected with generation and extraction of paraphrases. The paraphrase identification systems improve the performance of a paraphrase generation in terms of choosing the best paraphrase candidate from the list of paraphrases candidates generated by paraphrases generation system. Paraphrase Identification is also used in validating the paraphrase extraction system and the machine translation system. In QA system, Paraphrase Identification plays a vital role in matching the questions asked by the user to the original questions for choosing the best answer. Plagiarism detection is another task which needs the Paraphrase Identification technique to detect the sentences which are paraphrases of others.
One of the most commonly used corpora for paraphrase detection is the MSRP corpus (Dolan and Brockett 2005), which contains 5,801 English sentence pairs from news articles manually labelled with 67% paraphrases and 33% non-paraphrases. Since there are no annotated corpora or automated semantic interpretation systems available for Indian languages till date, creating benchmark data for paraphrases and utilizing that data in open shared task competitions will motivate the research community for further research in Indian languages.
Sub Task 1: Given a pair of sentences from news paper domain, the task is to classify them as paraphrases (P) or not paraphrases (NP).
Sub Task 2: Given two sentences from news paper domain, the task is to identify whether they are completely equivalent (E) or roughly equivalent (RE) or not equivalent (NE). This task is similar to the subtask 1, but the main difference is 3-point scale tag in paraphrases.
1. Automatic paraphrasing for Indian languages
2. Semantic similarity between sentences and documents
3. Summarization and Text entailment.
4. Plagiarism detection for Indian Languages.
|Anand Kumar M, CEN, Amrita Vishwa Vidyapeetham, Coimbatore, India|
|Soman K P , CEN, Amrita Vishwa Vidyapeetham, Coimbatore, India|
Prof. Ramanan, RelAgent Pvt Ltd, Chennai
Prof. N. Deiva Sundaram, NDS Lingsoft Solutions Pvt. Ltd., Chennai
Prof. Rajendran S, CEN, Amrita Vishwa Vidyapeetham, Coimbatore, India
Dr. V. Dhanalakshmi, Assistant Director, Tamil Virtual Academy,Chennai
Dr. Govind D , CEN,Amrita Vishwa Vidyapeetham Mr. Vijay Krishnan Menon ,CEN,Amrita Vishwa Vidyapeetham Mr. Barathi Ganesh , Data Science practitioner,TCS Cochin
Dolan, W. B., & Brockett, C. (2005, October). Automatically constructing a corpus of sentential paraphrases. In Proc. of IWP.
Sundaram, Mahalakshmi Shanmuga, Anand Kumar M, and Soman Kotti Padannayil. "AMRITA CEN@ SemEval-2015: Paraphrase Detection for Twitter using Unsupervised Feature Learning with Recursive Autoencoders." SemEval-2015 (2015): 45.
Mahalakshmi, S., Anand Kumar, M., Soman, K.P. Paraphrase detection for Tamil language using deep learning algorithm, (2015) International Journal of Applied Engineering Research, 10 (17), pp. 13929-13934
Wei Wu., Yun-Cheng Ju., Xiao Li and Ye-Yi Wang. 2010. Paraphrase detection on SMS messages in automobiles. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on (pp. 5326-5329).
Richard Socher., Eric H. Huang., Jeffrey Pennin., Christopher D. Manning and Andrew Y. Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in Neural Information Processing Systems (pp. 801-809).