masin

Machine learning is the next big wave

[et_pb_section fb_built=”1″ admin_label=”section” _builder_version=”3.22.3″ collapsed=”off”][et_pb_row admin_label=”row” _builder_version=”3.22.3″ background_size=”initial” background_position=”top_left” background_repeat=”repeat”][et_pb_column type=”4_4″ _builder_version=”3.0.47″][et_pb_text admin_label=”Text” _builder_version=”3.0.74″ background_size=”initial” background_position=”top_left” background_repeat=”repeat”]

Machine learning is the next big wave after the Internet that’s going to change lives especially in developing world and it’s interesting for us, as an AI focussed companny, to practically see how it is making life easier for our customers.
It’s very hard knowing what challenges ecommerce sellers face in developing world unless we speak to them and for us at AiHello speaking to one of the biggest sellers on Amazon India was a very informative session.
In the process of attempting to solve the problems for our customers, we learnt how Amazon India works differently compared to Amazon US or Europe and are more impressed with how Amazon India has adapted to Indian challenges in ingenious ways and is not only thriving, but also succeeding against all competition.
We recently signed up an Amazon Professional Enterprise who sold around 10,000 different items (SKUs) on Amazon India and were interested to try our software to ease their pains of selling.
The major challenge for this customer was the commute time for their staff and they were very intrested to work with us in solving this problem. At first, on casual glance this challenge seemed very off topic for us.
We focussed on data analytics and deep learning to help online ecommerce sell better and we had no idea how we had anything to do with solving the traffic issues in India.  He wanted to have a phone call with us to see if we could help solve this issue as he said he already knew of ways to solve it.
The main problem he described was this: Amazon needs all orders placed by it’s customers in the morning to be packed and ready for shipment by 2 pm by the seller. Since all the Amazon customers were online, the orders would start pouring in from 7 am. 
The seller’s employees who helped him start packing the items according to Amazon standards would leave their home around 8 am and depending on the traffic for that day would reach anytime between 9 am to 10 am. This uncertainty was a big challenge for him and unfortunately for us, according to us, we had no way to solve this  as we are an IT company using Apache Spark on Azure and thousands of kilometers away from India where our customer was having issues regarding commute time due to traffic and bad roads.
We decided to get on a phone call with the customer and his proposition was very interesting. He wanted to know if it was possible to predict what items he would sell one day in advance. His solution to the challenge was, if he knew today, what he would be selling tomorrow with a degree of accuracy then he could have all the items prepacked well in advance ready to be shipped the next day.
This way he wouldn’t have to worry about his staff arriving few hours late in the morning as most of the work would already have been done the previous evening.
With the problem well defined in terms of software engineering we had an easy solution. Sales prediction was always in our road map but sellers in US or Europe were not particularly interested in it. They were more interested in knowing when their stock would run out.  However Amazon India sellers didn’t seem to be particularly interested in knowing when their stock would sell out. Most of them had infinite supply of goods and also due to an Amazon India specific program known as “Seller Flex” which allows sellers to manage their own warehouse.
After a quick internal discussion with our team we decided to go ahead with Apache Spark ML LinearRegression. Apache Spark was already the work horse of most of our Machine Learning routines (Along with H2o) and adding Linear Regression didn’t seem to be much of a problem. In this blog post we give a quick outline of the steps and code we created to complete this routine.
Our idea was to use Linear Regression to extrapolate the historial sales of all items and choose the top selling items to be prepacked. We know that in the future we might have to use Deep Learning to improve the precision, however for now simple regression would suffice.
First we read in the historical sales data of the customers in our dataframe from our local database.
var jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:mysql://localhost:3306/database")
  .option("dbtable", "order_logs")
  .load()
Our next plan is to get the day of the month and day of the week from the purchase date. The logic is that the sale of a item depends a lot on the day of the month (Usually sales are high right after salary dates) and sales also depends on weekday (Sales are lowest on Sunday and highest on Monday for this customer for example)
We used Spark internal functions to get the day of the month and day of the week.
var formattedDF = jdbcDF.withColumn("DayOfMonth", 
  dayofmonth(df("PurchaseDate")).cast(IntegerType)).
  withColumn("LocalDate", df("PurchaseDate").cast(DateType))
  .withColumn("DayOfWeek", 
    date_format(df("PurchaseDate"), "u").cast(IntegerType))
Here, the purchase date in our database was a TimeStamp and we needed to get the Date out of it (without the time part)
Finally we just selected the Day Of Month, Day Of Week and quantity sold
val finaldf = formattedDF
  .groupBy("SKU", "DayOfMonth", "DayOfWeek")
  .agg(sum("Quantity").cast(IntegerType).as("label"))
Here we summed the quantity of items sold for day of month & day of week and tagged it as label. Spark ML needs to predict this column.
Now we had all the data necessary to run it through the Spark Machine Learning algorithm  for us to get the prediction.
Here we run it through a simple Linear Regression with pretty good output.
val lr = new LinearRegression()
Then we create a series of pipeline stages:
val SKUIndexer = new StringIndexer().setInputCol("SKU")
  .setOutputCol("SKUIndex")
val dayOfMonthEncoder = new OneHotEncoder()
  .setInputCol("DayOfMonth")
  .setOutputCol("DayOfMonthVec")
val dayOfWeekEncoder = new OneHotEncoder()
  .setInputCol("DayOfWeek")
  .setOutputCol("DayOfWeekVec")
val assembler = new VectorAssembler()
  .setInputCols(Array("SKUIndex", "DayOfWeekVec",
    "DayOfMonthVec"))
  .setOutputCol("features")
Essentially we are doing the following steps
1) Using a SKUIndexer we are converting the unique id of the product to a category from String
2 & 3) We convert the day of month and day of week to One Hot encoding. See here for a very good explanation of One Hot encoding
4) We then assemble the category index, day of month, day of week into one single vector. Machine Learning using multiple inputs need a vector to be passed in. We labelled this column as features. Spark ML will by default use the column named “Features” as input for its ML algorithms.
We then run the entire pipeline along with the LinearRegression
val pipeline = new Pipeline()
  .setStages(Array(SKUIndexer, dayOfWeekEncoder, dayOfMonthEncoder, assembler, lr))
We fit our dataset into this pipeline
val pipModel = pipeline.fit(finaldf)
and then finally we run the machine learning algorithm on our dataset to check how well it performed.
val finModel = pipModel.transform(finaldf)
After the algorithm has run we can print a subset of the Spark DataFrame to check how well it has done
finModel.select("SKU", "DayOfMonth", "DayOfWeek", "label", "prediction")
+——–+———-+———+—–+——————+
|     SKU|DayOfMonth|DayOfWeek|label|        prediction|
+——–+———-+———+—–+——————+
|NTSNB250|         5|        7|    22|23.384718598617278|
|NTSNB250|        13|        4|    32|32.6856347162066054|
|NTSNB250|        10|        3|    34|33.9738670408357994|
|NTSNB250|        29|        4|    25|22.8543167300058476|
|NTSNB250|        29|        3|    24|22.961526995444813|
And we can see that it has performed quite well.  There is still a margin of error but we anticipate that we collect more historical data of the customer, our algorithms keep getting smarter. Finally, when we move from Regression to Deep Learning we might have bit better results.
Intel also has Big Data library for Spark https://github.com/intel-analytics/BigDL.
This is really interesting to us as we run our Apache Spark on Intel Platform and we not have access to GPU on our Azure platform.

 [/et_pb_text][/et_pb_column][/et_pb_row][/et_pb_section]