Inspiration

The dataset itself inspired us to use machine learning and NLP because it contained so much information hidden inside reviews written in human language. We hoped that trying different models would give us insights into how people encode their sentiments into their reviews, and that at least one model would give us a decent method of predicting sentiment given a particular review.

What it does

Our model uses Google's Word2Vec algorithm to first learn semantic information and context from the raw reviews, and then vectorizes reviews using this learned context. We treat these vectorized reviews as feature vectors that we augment with the other features in the dataset, and then pass these feature vectors as training data to different supervised learning algorithms such as LR, LDA, and neural nets.

How We Built It

We used scikit-learn for implementing the learning models, and used gensim to implement Word2Vec by following the SkipGram-based implementation of the algorithm. After this written, all we needed was to load and clean the data using Pandas and nltk, and pass this to the learning algorithms.

Challenges we ran into

The biggest challenge was the training time required to run all the different models we wanted to use to determine which was the most accurate. The size of the dataset and the gensim library were also issues, as we needed a lot of space to work with these locally.

Accomplishments we're proud of

We're proud of our implementation of Word2Vec, which seems to effectively learn semantic and contextual relationships between keywords in the reviews. This allowed our models to predict star ratings as effictively as they did.

What we learned

Doing this taught us about the techniques being used in NLP and sentiment analysis, and about which supervised learning algorithms are the most effective given certain types of language data (by trying all the algorithms, we were able to get a sense of which was working the best with the type of data and the vectorization that we had).

What's next for Yelp Prediction Challenge

The next step would be to improve the NLP aspect of the project; if we can more effectively analyze sentiment by better analyzing context in reviews, we would no doubt be able to improve our prediction accuracy. Given more training time, we also could experiment with more complex models to see how these perform on the dataset.

Built With

Share this project:

Updates