Using Incorta and PySpark Linear Regression ML package to predict eCommerce Customer

Project Overview

I got a dataset from kaggle.com
Assumption: eCommerce company based in New York City that sells clothing online but they also have in-store style and clothing advice sessions. Customers come into the store, have sessions or meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.

We need to predict 'Yearly Amount Spent'

Here are the features or attributes collected in the dataset:

  • 'Avg__Session_Length'
  • 'Time_on_App'
  • 'Time_on_Website'
  • 'Length_of_Membership'



Step 1: Upload csv file in incorta.

Upload the CSV file to Incorta, and add a file table in the schema named Ecommerce_Customer.




Step 2: Read the Ecommerce Customer file

Use PySpark to read the table named SparkTesting.Ecommerce_Customer.

The CSV file loaded into Incota can be read into PySpark using 

df=read("SparkTesting.Ecommerce_Customer")








Step 3: VectorAssemblerTest

Use VectorAssembler combine 'Avg__Session_Length', 'Time_on_App', 'Time_on_Website' and 'Length_of_Membership' into a single feature vector called 'features'. 




Step 4: Split data into Training and Testing data sets

The PySpark randomSplit function randomly samples the data to help ensure that the testing and training sets are similar. I split the data into two sets, 70% used in training, and 30% used in testing.

Step 5: Train data

I use PySpark LinearRegression to train a linear regression model and extract model summary statistics.





Step 6: Test model

For evaluating the linear regression model, I used 30% of data as testing data and used RMSE and R2 to evaluate the model.



Step 7: Prediction

The final product is the model that we can use for predicting 'Yearly_Amount_Spent' for unlabeled data. 





Comments

  1. Thank you for sharing the post keep posting growth partner ecommerce if you looking to start your own ecommerce business.

    ReplyDelete

Post a Comment

Popular posts from this blog

How to create histogram in Incorta use bin function.

Using Github to version and manage notebooks(Jupyter notebooks)