Monday, June 08, 2015

An Excellent Experimental Framework.

Recently I have been watching through a lecture series on Deep Learning for NLP.  It's a topic that I have long been interested in, and I'm learning lots as I go along.  It has been a welcome relief from thesis writing.

In the first 15 minutes of the fifth lecture, Richard Socher outlines the steps that his students should take for their class projects. I wanted to reproduce them here as I think they are really useful for anybody who is performing experiments in Machine Learning with Natural Language Processing. Although the steps are aimed at a class project in a university course, I think they are applicable to anybody starting out in NLP, and helpful as a reminder to all who are established.  I had to figure these steps out myself as I went along, so it is very encouraging to see them being taught to students.  The eight steps (with my notes) are as follows. Or, if you'd prefer, scroll to the bottom of the page and watch the first fifteen minutes of the video there.

Step 1 - Define Task

Before you make a start you need to know what your task is going to be.  Read the literature. Read the literature's background literature. Email researchers who you find interesting and ask about their work. Skype them if they'll let you.  Make sure you know clearly what your task is.

Step 2 - Define Dataset

There are a lot of ready made datasets that you can go and grab.  You'll already know what these are if you've read the literature suitably.  If there is no suitable dataset for your task, then you will need to build one.  This is a whole other area and can lead to an entire paper in itself.

Step 3 - Define Your Metric

How are you going to evaluate your results?  Is it the right way? Often your dataset will naturally lead you to one evaluation metric.  If not, look at what other people are using. Try to understand a metric before using it.  This will speed up your analysis later.

Step 4 - Split Your Dataset

Training, Validation, Testing. Make sure you have these three partitions.  If you're feeling really confident, don't look at the testing data until you absolutely have to, then run your algorithm as few times as possible. Once should be enough.  The training set is the meat and bones of your algorithm's learning and the validation set is for fiddling with parameters.  But be careful, the more you fiddle with parameters, the more likely you become to over fit to your validation set.

Step 5 - Establish a Baseline

What would be a sensible baseline?  Random decisions? Common sense judgements?  Maximum class? A competitor's algorithm?  Think carefully about this.  How will your model be different from the baseline?

Step 6 - Implement an Existing (Neural Net) Model

Neural Net is in brackets because it is specific to the course in the video.  Go find an interesting ML model and apply it.  I have started to use WEKA for all my ML experiments and I find it really efficient.

Step 7 - Always Be Close to Your Data

Go look at your data.  If you're tagging something look at what's been tagged.  If you're generating something read the resultant text.  Analyse the data. Make metadata. Analyse the metadata.  Repeat. Where do things go wrong?  If you have a pipeline, are some modules better than others? Are errors propagating through?

Step 8 - Try Different Models

Play around a bit.  Maybe another model will perform much better or worse than the first one you try.  If you analyse the differences in performance, you might start to realise something interesting about your task.  I would recommend setting up a framework for your experiments, so as you can quickly change things around and re-run with different models / parameters.

Step 9 - Do Something New

 OK, this isn't a step in the video per se. But, it is the extension of the methodology.  Now that you have tried something, you can see where it was good and where it was bad.  Now is the time to figure out how to make it better.  Different models, different features, different ways of using features, are all good places to start.

The video is below and also accessible at this link:

If you are interested in the topic then the course syllabus with all videos, and lecture notes is here: