Shared task on Detecting Paraphrases in Indian Languages (DPIL)

held in conjunction with FIRE 2016 @ ISI – Kolkata (8-10 Dec 2016)

About CENTask Description

Download Dataset

               Download Workshop Proceedings

No of Visitors

Task description

Paraphrase can be defined as “the same meaning of a sentence is expressed in another sentence using different words”. Paraphrases can be identified, generated or extracted. The proposed task is focused on sentence level paraphrase identification for Indian languages (Tamil, Malayalam, Hindi and Punjabi). Identifying paraphrases in Indian languages is a difficult task, because evaluating the semantic similarity of the underlying content and the understanding the morphological variations of the language are more critical. Paraphrase identification is strongly connected with generation and extraction of paraphrases. The paraphrase identification systems improve the performance of a paraphrase generation in terms of choosing the best paraphrase candidate from the list of paraphrases candidates generated by paraphrases generation system. Paraphrase Identification is also used in validating the paraphrase extraction system and the machine translation system. In QA system, Paraphrase Identification plays a vital role in matching the questions asked by the user to the original questions for choosing the best answer. Plagiarism detection is another task which needs the Paraphrase Identification technique to detect the sentences which are paraphrases of others.

One of the most commonly used corpora for paraphrase detection is the MSRP corpus (Dolan and Brockett 2005), which contains 5,801 English sentence pairs from news articles manually labelled with 67% paraphrases and 33% non-paraphrases. Since there are no annotated corpora or automated semantic interpretation systems available for Indian languages till date, creating benchmark data for paraphrases and utilizing that data in open shared task competitions will motivate the research community for further research in Indian languages.

Sub Task 1: Given a pair of sentences from news paper domain, the task is to classify them as paraphrases (P) or not paraphrases (NP).

Sub Task 2: Given two sentences from news paper domain, the task is to identify whether they are completely equivalent (E) or roughly equivalent (RE) or not equivalent (NE). This task is similar to the subtask 1, but the main difference is 3-point scale tag in paraphrases.



1. Automatic paraphrasing for Indian languages

2. Semantic similarity between sentences and documents

3. Summarization and Text entailment.

4. Plagiarism detection for Indian Languages.

Output format



Registration closed,


  • Tamil
  • Malayalam
  • Hindi
  • Punjabi
Fill the Data use agreement form and send it to

Important Dates

Training Data Release

Training data for all language released

Test Data Release

Testing data for all language released

Run Submission Deadline

5th September 2016

Results Declared

15th September 2016

Working Notes Due

10th October 2016


8-10 Dec 2016

evaluation and results

To see results, check here


  • 1. Fill the Data use agreement form and send it to
  • 2. Tamil and Malayalam train data released on 15th July 15th July
  • 3. Registration ends on 25th July
  • 4. Training data for Hindi and Punjabi were released
  • 5. More than 35 teams were registered for this Shared task
  • 6. Details about Sarwan awrads were released
  • 7. Testing data released
  • 8. Output format for result submission released.
Anand Kumar M, CEN, Amrita Vishwa Vidyapeetham, Coimbatore, India
Soman K P , CEN, Amrita Vishwa Vidyapeetham, Coimbatore, India
Advisory committee

Prof. Ramanan, RelAgent Pvt Ltd, Chennai
Prof. N. Deiva Sundaram, NDS Lingsoft Solutions Pvt. Ltd., Chennai
Prof. Rajendran S, CEN, Amrita Vishwa Vidyapeetham, Coimbatore, India
Dr. V. Dhanalakshmi, Assistant Director, Tamil Virtual Academy,Chennai
Dr. Govind D , CEN,Amrita Vishwa Vidyapeetham
Mr. Vijay Krishnan Menon ,CEN,Amrita Vishwa Vidyapeetham
Mr. Barathi Ganesh , Data Science practitioner,TCS Cochin

Sarwan awards


Dolan, W. B., & Brockett, C. (2005, October). Automatically constructing a corpus of sentential paraphrases. In Proc. of IWP.

Sundaram, Mahalakshmi Shanmuga, Anand Kumar M, and Soman Kotti Padannayil. "AMRITA CEN@ SemEval-2015: Paraphrase Detection for Twitter using Unsupervised Feature Learning with Recursive Autoencoders." SemEval-2015 (2015): 45.

Mahalakshmi, S., Anand Kumar, M., Soman, K.P. Paraphrase detection for Tamil language using deep learning algorithm, (2015) International Journal of Applied Engineering Research, 10 (17), pp. 13929-13934

Wei Wu., Yun-Cheng Ju., Xiao Li and Ye-Yi Wang. 2010. Paraphrase detection on SMS messages in automobiles. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on (pp. 5326-5329).

Richard Socher., Eric H. Huang., Jeffrey Pennin., Christopher D. Manning and Andrew Y. Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in Neural Information Processing Systems (pp. 801-809).

Sarwan Awards

Sarwan award (cash prize or travel grant) will be given to the top performing team in each language (totally four). Team members should participate in both subtasks and final working notes submission are eligible for receiving this award.


You Have any Queries? That's great! Give us a call or send us an email and we will get back to you as soon as possible!


Phone: 0 422 2685000 (5594)