Friday, November 4, 2011

Switching Languages: Perl to Python


Python

Today I announced that I will be switching languages from Perl to Python to build a large part of Project MADA. I read through a lot of literature and a lot of debates before finally making my decision. I initially thought I'd write a blog post about it and describe my sources but I thought that this would help get the message across best.

And a bit of credit to what finally pushed me over.

Time to head back into code mode.

Friday, October 21, 2011

The Current State of Plagiarism Detection

 

pla·gia·rism/ˈplājəˌrizəm/

Noun:

The practice of taking someone else's work or ideas and passing them off as one's own.

Synonyms:

plagiary - crib

As seen when searching on Google for "plagiarism definition".

Plagiarism in the field of academics and even professionals is by no means a small issue. With the growth of the internet has come innumerable ways of locating work relevant to ones domain area and copy pasting someone else's work has never been easier. As a result, several companies have sprung up to solve this issue by promising solutions that will detect if the text in a document is from an unattributed source. Just typing in the word plagiarism in Google will bring up the two leaders in the automated plagiarism detection world which are Turnitin and iThenticate. Both work admirably well but unfortunately fail to truly address the problem in a way which is difficult to break. And the reason for this is the narrow understanding of what plagiarism is. But first, how do the current systems work?

How the current generation Plagiarism Detection Systems work

The current generation plagiarism detection systems are built to identify only work that has been stolen or rather work that previously existed as plagiarism. Thus what happens in the system is

  1. The individual checking the document uploads the document to their website.
  2. The system takes that document and feeds it into its servers to be broken down and processed.
  3. The document is broken line by line, paragraph by paragraph and compared to the documents and web pages residing within the enormous database owned by the plagiarism detection company and a full list of matches are profiled and given.
  4. Depending on the matching percentage the system will give a recommendation of whether the document is heavily plagiarized or not. Else, based on data returned the lecturer or individual uploading the document can make a decision.

Thus, in this process what makes a company successful are two factors.

  • The quality and the size of the database. This can cause an exponential increase in quality and size if the market turns towards a particular system for detecting plagiarism since each document uploaded will further increase the database size thus making that particular company even more attractive to potential customers
  • The quality of the search algorithm in terms of speed and accuracy. In this industry, speed and accuracy matter. In a small batch upload there can be upwards of 30 documents per class and about 4 classes coming in at a time. 120 documents being scanned at 10 minutes per document is 20 hours. Aiming for one minute is still 2 hours! That’s a lot of time. But it’s amongst the industry best. And then the quality of the search. I’ll come to that later.

And from these two factors the first one is something that is extremely difficult to catch up on in a very short period of time unless there is a massive amount of money being pumped into the company. But again all of this fails in the face in the way a system like this can be broken.

Breaking the system

The first way one can break the system is fairly simple. Changing a few words here and there in a piece of text which has been ripped off from another website or document has been proven to drop the percentage of plagiarism from 70% to 30%. The only solution to this is to keep refining the search algorithm to include synonyms that might solve the problem but it’s a difficult call.

The second way that the system can be broken is something that breaks the system entirely. And that is by looking at plagiarism in a way that no one else seems to do. The definition given above states very clearly that taking someone else's work and attempting to pass it off as your own is plagiarism. The natural assumption when reading this is that one would think of someone stealing an existing piece of work and attempting to pass it off as their own. This however, is incorrect. The reality is that there are two possible ways of committing plagiarism that fall under the above definition and the second type of plagiarism is the truly difficult one to detect. The second type of plagiarism is commonly referred to as "outsourcing". Think about it. When one outsources part of or the entire document to another individual and then submits it under their own name this also falls under the definition given above. The problem here is that since this work never existed before it comes up in the existing systems as a completely new and original text and is therefore returned as a perfectly clean document when it is obviously not.

Enter Project MADA

Project MADA attempts to look at the problem of plagiarism from the incredibly difficult point of view of solving the second breaking point of existing systems. MADA (remember.. pronounce Mordor) is being built using groundbreaking algorithms in Natural Language Processing and in the future, Artificial Intelligence. But why do we need Project MADA?

The issue of multiple authorship is something that can often be naturally ‘sensed’ by an individual reading the document especially when that individual is in an intellectually challenging environment such as academia. The natural perception of a human however, is something that in a case as serious as ethical code reviews cannot be used as an argument. Humans have natural tendencies to be emotionally biased and thus motives can be questioned at any time. What is required is an automated, systematic, measurable/quantitative method of presenting data to say that a document may not satisfy the ethics standards based on doubtful origins of information.

What is needed… is Project MADA.

Thursday, October 20, 2011

Project MADA: Going Live

 

It’s been less than a year since I last demoed this system to my lecturers at APIIT and since then Project MADA has been very much a dead project. Yesterday however, due to me managing to open my big mouth at Refresh Colombo I found myself having to do a demo once again but with a slight difference this time. The commitment. Unlike the demo at university which was mainly to showcase the system I worked on for 4 months, this demonstration at Refresh Colombo was all about the product and getting up there and saying “I am going to do this. I am going to finish what I started”.

Now that I’ve done that in front of the crowd in which I know there were potential investors as well, if I don’t continue to work on this project, I’ll just be a lot of hot air.

So I wanted to dedicate this first post to Nazly, Milad, Aloka and Hamid and everyone else who gets Refresh Colombo up and running every month because if, and I hope it does, Project MADA is successful, the first place I’m coming back to talk about it will be Refresh.

Going Live

Unfortunately I don’t have the video but when it comes up I’ll be sure to post it here to show what was demoed at Refresh. What I do want to discuss is the term ‘going live’ in this context. In the context of MADA ‘going live’ did not mean opening up the web app or giving an application for people to download but rather to have a first time public viewing and a promise to send out alpha invites as early as possible. Going live here meant more a demo that one would see at the demo day of something like Y Combinator. And with it, again, the commitment to a lifelong (possibly) project.

Future Updates

So there are going to be a plethora of updates coming up very soon discussing the ideas behind the project, to the execution, future plans, business models, revenue models and even a few sneak peeks at what goes on behind the scene to make MADA work. But if you are scratching your head wondering what on earth MADA is… Hang in there. About Page is coming up soon.