Using Incorta and PySpark Linear Regression ML package to predict eCommerce Customer

Project Overview

I got a dataset from kaggle.com
Assumption: eCommerce company based in New York City that sells clothing online but they also have in-store style and clothing advice sessions. Customers come into the store, have sessions or meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.

We need to predict 'Yearly Amount Spent'

Here are the features or attributes collected in the dataset:

  • 'Avg__Session_Length'
  • 'Time_on_App'
  • 'Time_on_Website'
  • 'Length_of_Membership'



Step 1: Upload csv file in incorta.

Upload the CSV file to Incorta, and add a file table in the schema named Ecommerce_Customer.




Step 2: Read the Ecommerce Customer file

Use PySpark to read the table named SparkTesting.Ecommerce_Customer.

The CSV file loaded into Incota can be read into PySpark using 

df=read("SparkTesting.Ecommerce_Customer")








Step 3: VectorAssemblerTest

Use VectorAssembler combine 'Avg__Session_Length', 'Time_on_App', 'Time_on_Website' and 'Length_of_Membership' into a single feature vector called 'features'. 




Step 4: Split data into Training and Testing data sets

The PySpark randomSplit function randomly samples the data to help ensure that the testing and training sets are similar. I split the data into two sets, 70% used in training, and 30% used in testing.

Step 5: Train data

I use PySpark LinearRegression to train a linear regression model and extract model summary statistics.





Step 6: Test model

For evaluating the linear regression model, I used 30% of data as testing data and used RMSE and R2 to evaluate the model.



Step 7: Prediction

The final product is the model that we can use for predicting 'Yearly_Amount_Spent' for unlabeled data. 





Comments

  1. Thank you for sharing the post keep posting growth partner ecommerce if you looking to start your own ecommerce business.

    ReplyDelete

Post a Comment

Popular posts from this blog

How to create histogram in Incorta use bin function.

Using Github to version and manage notebooks(Jupyter notebooks)

Using Time Series Analysis Electric Production by ARIMA Model