Natural Language Processing

A. Naive Bayes to Classify Movie Reviews Based on Sentiment

In this project, I predict whether a review is negative or positive, based on the text of the review. I implemented Naive Bayes for this classification and also checked out its scikitlearn implementation. Read my post on detailed implementation of this model or chekout my Jupyter Notebook.

B. Predicting Upvotes on Hacker News Data

Hacker News is a community where users can submit articles, and other users can upvote those articles. The articles with the most upvotes make it to the front page, where they’re more visible to the community. In this project I predicted the number of upvotes the articles received, based on text of their headlines. I employed “Bag of Words” model to tokenize the data and trained a linear regression algorithm that predicts the number of upvotes a headline would receive. Read my post on detailed implementation of this model or checkout my Jupyter notebook.

Predicting Digits from their Handwritten Images

In this project, I work with the popular MNIST dataset using TensorFlow and TFlearn. MNIST is a simple computer vision dataset. It consists of 60,000 training samples and 10,000 testing samples of hand-written and labeled digits, 0 through 9. See example below:

A. Deep Neural Network with TensorFlow

In the first part of this project, I train a deep neural network on the MNIST training set using TensorFlow. Our implementation of deep neural networks give an accuracy of 95% in just 10 epochs. 95% accuracy, however, isn’t considered the best and most of the applications of neural networks work on over 99% accuracy. Read my post on detailed implementation of this model or visit my Jupyter notebook on github.

B. Convolutional Neural Network with TensorFlow

To further improve the accuracy in recognizing digits, we implement the state-of-the-art Convolutional Neural Networks (ConvNets) on the same data set. For same number of epochs, we were able to improve the efficiency to 98%. Read my post on detailed implementation of this model or check out my Jupyter notebook on github.

C. Convolutional Neural Network with TFlearn

I extend the implementation of ConvNets from part B, to the TFlearn abstraction of TensorFlow. The code is less verbose in TFlearn, easier to interpret and prone to less errors. Since the computational model was identical to part B, I get similar accuracy. Checkout my Jupyter Notebook here.

Clustering NBA Players

Unsupervised Machine Learning using KMeans Clustering

Point guards play one of the most crucial roles on a team because their primary responsibility is to create scoring opportunities for the team. We visualize the types of point guards as well as group similar point guards together using the popular KMeans clustering. We use the Assist to Turnover Ratio and Points Per Game as our features. For detailed implementation of k-means algorithm from scratch and with sklearn please read my detailed blogpost or checkout my Jupyter notebook on github.

Titanic - Who Survived the Disaster?

Machine Learning Project with Kaggle

The Titanic shipwreck is the most famous shipwreck in history and led to the discussions of better safety regulations for ships. One substantial safety issue was that there were not enough lifeboats for every passenger on board, which meant that some passengers were prioritized over others to use the lifeboats. The goal of the competition is to build machine learning models that can predict if a passenger survives from their attributes such as age, sex, cabin class etc. I looked at the correlation between various features and survival frequency to select features. I also engineered a new feature called “family size” and performed missing value imputations on the dataset. I finally ran logistic regression to achieve an accuracy of about 80% to predict the survival for Kaggle test dataset.

Classification of Iris Varieties

A. Supervised Machine Learning using KNN-classification

Three Iris varieties were used in the Iris flower data set outlined by Ronald Fisher in his famous 1936 paper “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” PDF. In this project, I use this popular Iris dataset to make predictions on the variety of iris. The prediction is based on shape of an iris leaf represented by its sepal length, sepal width, petal length and petal width as shown in the image. I implemented k-nearest neighbor classification and also tested for most optimal value of ‘k’. View my Jupyter Notebook on Github or Read my blog post if you are interested in a detailed walk-through.

B. Unsupervised Machine Learning using Neural Networks on the Same dataset

I also classified the same Iris varieties using Neural Networks (See Jupyter notebook linked below). A detailed blog-post on this part is coming soon, in the meanwhile View my Jupyter Notebook on Github.