Based on lectures 17 and 18
In this lab you will build a text classification model for classifying tweets into 14 different categories.
BUILD A TEXT CLASSIFIER
- Download the following file
There are two files in the compressed file:
- Tweets.14cat.train contains 2504 tweets to be used for taining the classification model
- Tweets.14cat.test contains 625 tweets to be used testing
The format of the files is as follows (tab separated):
tweet_ID tweet category
- You need to build a module to extract the BOW features from the training files (some of the steps here are specific to Python. If you use another language, you will need to find the equivalent methods). Apply the following:
- Read the tweets and preprocess them (remove links + tokenise). Do not apply stemming or stopping at this stage
- Find all the unique terms, and give each of them a unique ID (starting from 0 to the number of terms) - Then you should create a count matrix where each row is a document and each column corresponds to a word in the vocabulary.
- The value of element (i,j) is the count of the number of times the word j appears in document i.
- Since most documents will not contain many words from the total vocabulary, is is recommended that you use a sparse matrix. This means that only the nonzero values are actually stored in memory.
- For example, you may find it convienent to use the dictionary-of-keys sparse matrix from scipy.sparse. You may need to install scipy if you have never used it before.
- See the Examples section of the dok_matrix page (and lecture 18) for an example of how you can fill in the values of this kind of matrix.
- Let's call this matrix X. - You should also make a mapping from the categories to numberic IDs. So the list of correct categories ['Gaming', 'Gaming', 'Entertainment', ...] becomes [0,0,1, ...] where the same category is also mapped to the same number from 0 to 1 - number of categories.
- Use this mapping to convert the category for each tweet to a number. Call this list of correct categories y. - Once the training and test features are ready, you can now train a classifier and test the performance. In this lab, we will use an SVM multiclass Classifier.
- First instantiate a model: model = sklearn.svm.SVC(C=1000)
- The C parameter could be optimized for better performance. However, for this stage, just set it to 1000.
- Then fit the model to your data: model.fit(X,y) where X and y are defined as above.
- To make predictions about tweets, convert them to the same format as X and run y_pred = model.predict(X).
- y_pred now contains a list of the predicted categories corresponding to the documents in X. - Use the same module from the previous steps to convert the test file of tweets into features file as well, using the same format.
- Make sure the same words are mapped to the same IDs as used in the training data. So if "dog" has the ID 453 in the training data, it should have the same ID in the test.
- For terms in the test file that do not exist in the training data (does not have a corresponding feature ID, since it never appear in the training data), please either neglect these terms, or define a new feature ID that corresponds to OOV (out-of-vocabulary) terms. - Finally, classify the tweets from the testing set. Do NOT train the model again (model.fit()), only make new predictions given the testing data as input to model.predict().
EVALUATION
Build a module that has the following outputs given the list of predicted values (category ids) and the list of correct categories (category ids):
- Accuracy
- Precision, recall, and F1 for each of the classes
- Macro-F1 score of the system
MODIFYING YOUR CLASSIFIER
Think about what can make a better classification.
You can think about trying the following:
- Apply stemming and stopping.
- Duplicate hashtags words (e.g. convert "#car" to "#car car").
- Expand tweet text that has link with the page title of that link
- Try non-textual features: Tweet length, presence of hashtags, links, emojis, etc.
- Use Tf-IDF features intead of just counts.
- Can you try another classifier and compare the results?
- Get more training data! You can check this paper for potential idea on the same task
- Any other ideas you have about how to make the model better.
Try at least two possible methods for improving the classification.
- Find out which method achieves the best performance (measure by Macro-F1), and generate the confusion matrix among all classes for it.
NOTE
Parts of this lab are part of CW2, so please try to finish it properly.