COURSERA CAPSTONE PROJECT SWIFTKEY

To improve accuracy, Jelinek-Mercer smoothing was used in the algorithm, combining trigram, bigram, and unigram probabilities. The accuracy of the prediction depends on the continuity of the text entered. Use of the application is straightforward and can be easily adapted to many educational and commercial uses. As depicted below, the user begins just by typing some text without punctuation in the supplied input box. When the user enters a word or phrase the app will use the predictive algorithm to suggest the most likely sucessive word.

Remove Profanity words Profanity words are removed from the corpus data. Datasets can be found https: As part of the prediction model, the generated stems will be used to gererate and algorithm to match input phrases, in order to predict the word that will be displayed next. Cleaning the data is a critical step for ngram and tokenization process. Note that the document term matrix is a sample of all 3 documents, therefore the visualizations shown below include the 3 document datasets in scope. Less data has its cost, I assume it will decrease the accuracy of the prediction. Data Processing After we load libraries our first step is to get the data set from the Coursera website.

It has provided some interesting facts about how the data looks like. Using the algorithm, a Shiny Natural Language Processing application was developed that accepts a phrase as input, suggests word completion from awiftkey unigrams, and predicts the most likely next word based on the linear interpolation of trigrams, bigrams, and unigrams.

Data Visualization Now that the data is cleaned, we can visualize our data to better understand what we are working with.

  HOMEWORK HA IL PLURALE

Coursera Swiftkey Word Prediction Capstone Project

There are 3 files coming from blogs, news and twitter data. From our data processing we noticed the data sets are very big. White paper can be found http: We must clean the data set.

coursera capstone project swiftkey

Create Uni-grams Uni-gram frequency table is created doursera the corpus. Love to see you. Term frequencies are identified for the most common words in the dataset and a frequency table is created.

Capstone Project SwiftKey

Less data has its cost, I assume it will decrease the accuracy of the prediction. Create Tri-grams Tri-gram frequency table is created for the corpus.

coursera capstone project swiftkey

Stored N-gram frequencies of the corpus source is used to predicting the successive word in a sequence of words. Now that the data is cleaned, we can visualize our data to better understand what we are working with. When the user enters a word or phrase the app will use the predictive algorithm to suggest the most likely sucessive word.

A profanity filter was also utilized on all output using Google’s bad words list. Load Dataset and Clean the Data Loading the dataset.

Higher degree of N-grams will have lower frequency than that of lower degree N-grams. This project will focus on the English language datasets. Then dataset is cleansed to remove the following; non-word characters, lower-case, punctuations, whitespaces. A corpus is body of text, usually containing a large number of sentences.

The project includes but is not limited too: Coursera Data Science Capstone: The dataset consists of 3 files all in english language.

  RGUHS DISSERTATION TOPICS 2005

coursera capstone project swiftkey

This preliminary report is aimed to create understanding of the data set. Tokenize and Clean Dataset Tokenization is performed by splitting each line into sentences.

Coursera Data Science Capstone: SwiftKey Project

Speed will be important as we move to the shiny application. Term Frequencies Term frequencies are identified for the most common words in the dataset and a frequency table is created.

As part of the prediction model, the generated stems will be used to gererate and algorithm to match input phrases, in order to predict swiffkey word that will be displayed next. Tokenization is performed by splitting each line into sentences.

Coursera Data Science Capstone: SwiftKey Project

The resulting application will be published as a shiny app, that will be open for review of anyone interested. Dataset for this project is sourced from this website. Create Bi-grams Bi-gram frequency table is created for the corpus. Clean means alphabetical coursea changed to lower case, remove whitespace and removing punctuation to name a few.

Next step of this capstone project would be to swiftkfy and precision the predictive algorithm model, and deploy the same using Shiny app.

Sampling the corpus and create the Document Term Matrix.