Using Incorta and PySpark Linear Regression ML package to predict eCommerce Customer

August 22, 2020

Project Overview

I got a dataset from kaggle.com.
Assumption: eCommerce company based in New York City that sells clothing online but they also have in-store style and clothing advice sessions. Customers come into the store, have sessions or meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.

We need to predict 'Yearly Amount Spent'

Here are the features or attributes collected in the dataset:

'Avg__Session_Length'
'Time_on_App'
'Time_on_Website'
'Length_of_Membership'

Step 1: Upload csv file in incorta.

Upload the CSV file to Incorta, and add a file table in the schema named Ecommerce_Customer.

Step 2: Read the Ecommerce Customer file

Use PySpark to read the table named SparkTesting.Ecommerce_Customer.

The CSV file loaded into Incota can be read into PySpark using

df=read("SparkTesting.Ecommerce_Customer")

Step 3: VectorAssemblerTest

Use VectorAssembler combine 'Avg__Session_Length', 'Time_on_App', 'Time_on_Website' and 'Length_of_Membership' into a single feature vector called 'features'.

Step 4: Split data into Training and Testing data sets

The PySpark randomSplit function randomly samples the data to help ensure that the testing and training sets are similar. I split the data into two sets, 70% used in training, and 30% used in testing.