Introduction
What is Amazon’s Best Seller Rank?
How did we get the initial sales data? I have been selling professionally on Amazon and have been tracking my own sales vs ranks for all my products in various categories. Additionally I interviewed other professional sellers to get an approximate idea of their sales.
With all the data obtained, cleaned and setup I entered the next phase of design: Choosing the best framework for Predictive Analysis
Enter Spark
While no clear answer was provided we certainly know that Spark alone is not the future of data analysis. We depended on a number of third-party tools like Stanford NLP for natural language parsing and tensor flow for image analysis but Spark formed the backbone of almost all of it.
While there might be some truth to the above chart I tend to believe the Spark has not reached peak hype yet. Or maybe it seems that way from here down under in Australia and probably Spark has surpassed peak in the Silicon Valley.
This trained model is then read by the Spark in Spring Boot to quickly make predictions or process any incoming information from web users in real time. With all the infrastructure setup we had estimated a week to complete the linear regression algorithms or worst case scenario of two weeks if the problem turned out to be the more complex log-linear regression.
Houston, We have a problem
val glr = new GeneralizedLinearRegression() .setFamily("gaussian") .setLink("identity") .setMaxIter(100) .setRegParam(0.4)
So it looked like a log-linear model and I assumed the poisson family of GeneralizedLinearRegression would be a good fit. We changed the GLM family to poisson and ran the tests few more times however the Mean Squared Error & the RMSE was too huge.
I suspect Spark has issues in dealing with sparse data.
With limited amount of input data, Spark MSE was too high and even for our products for which we knew the sales, the predicted sales were way off the mark.
Over the next few weeks I spent my time trying out all combinations of Regression family on Spark and none of them gave the desired results. I had absolutely no idea on how to proceed now and this reminded me of this quote by Dan Ariely
“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…”
My last resort was to use Deep Neural Network Learning. I had completed the Andrew Ng course on Machine Learning when he has first launched it and I cannot recommend it enough. Now my aim was to use Neural Network to solve this problem.
We must go Deeper.
I eventually made the decision to keep Spark for big data analysis however we would resort to other libraries for Deep Learning.
I narrowed down my choices to two libraries
1) DeepLearning4J
2) Sparkling water- H2O
I was particularly impressed with this article that discusses how H2O deep learning was used to predict crimes and arrests in San Francisco and Chicago. Since our future use cases are similar where we will be predicting fraudulent users and fraudulent competitors I decided to plunge into Deep Learning using H20 rather than attempting to work around with Spark ML issues.
Sparkling Water-H2O runs within the Spark framework so I could use their integrated framework without replacing Spark. This was a definite bonus for me. Also the documentation was nicely done and I was thoroughly impressed with H2o web-UI , Flow.
I could test my data and algorithms on the browser without writing any code.
The Web-UI provided excellent insights into the data and true to my beliefs, the Deep Learning Neural Networks provided exceptional results.
Using the optimized parameters from H2O Flow, I quickly coded the Deep Learning network in my CLI Program.
val train = result('categoryIndex, 'bsr, 'sales)// Configure Deep Learning algorithmval dlParams = new DeepLearningParameters()dlParams._train = traindlParams._response_column = 'salesdlParams._fast_mode = falsedlParams._epochs = 30dlParams._nfolds = 3dlParams._distribution = DistributionFamily.gaussianval dl = new DeepLearning(dlParams)
val dlModel = dl.trainModel.get
//save the model
ModelSerializationSupport.exportH2OModel(dlModel, new File(“/data/deeplearning.bin”).toURI)
On the Web API (Spring Boot) application I read this model in and used it for making predictions in realtime from web users.
def startup(): Unit = { dlModel = ModelSerializationSupport.loadH2OModel(new File("/data/deeplearning.bin").toURI) println("Initialization Of BSR Deep learning Module complete")}
def predict(categoryIndex:Int, bsr:Int): Double = {
if(null==dlModel) {
startup()
}
println(“n====> Making prediction with help of DeepLearning modeln“)
val caseClassDS = Seq(InputBSR(categoryIndex, bsr, 0)).toDS()
val finalresult = dlModel.score(caseClassDS)(‘predict)
val finaldf = asDataFrame(finalresult)(sqlContext)
val predictedSales = finaldf.first().getDouble(0)
println(s”For category index ${categoryIndex} and BSR ${bsr} the result is … ${predictedSales}“)
predictedSales
}
The end result is this:
As you can see from the screenshot, we can only predict sales of products which have Best Selling Ranks for top level category since we have trained the Neural Network with data of sales only from top selling categories.
As we keep collecting data and our algorithm has sufficient confidence to predict sales of lower level categories, the app will start making prediction for more number of products.
This, IMHO, is the best thing about Big Data and Deep Learning. The machine never stops learning and eventually as more data is fed into it, the algorithms automatically start making better predictions.
The chrome extension is available here on the Google app store