The Ecommerce_Customer schema has four variables, Time On App, Time On Website, Length Of Membership, and Yearly Amount Spent. I want to see the distribution of these data. Incorta let me preview the data and show max and min data of each using its Preview function. Here are the steps I used to create a histogram in Incorta. First, I used the bin function in Incorta divided into different levels. Here is the documentation for the bin function. https://docs.incorta.com/4.5/r-bin Here is the result of the bin function. I divided the average of session length into 6 levels. If the length is less than 30, it will be labeled as 'SLV1', and if the length is greater than 30 but less than 32, it will be labeled as 'SLV2', by basically, according to the min and max value. I see the minimum value is close to 30, and the maximum value close to 38. I decided to use 2 minutes as the interval and created the formula using the bin function. I'm grouping customers by level
Version Control and Manage is a vital part of data science workflows. Between multiple experiments, it is essential to know what changed and which updates were made by which team member. We can use Github to version and manage notebooks(Jupyter notebooks) following the below steps. Step 1: Go to the Incorta environment at the terminal. $ ssh -i <key file> incorta@<IP address> Step 2: Find Jupyter file path follow below command: $ cd / $ find . -name '*.ipynb' -print We can see the Jupyter notebooks file path is /home/incorta/Notebooks Step 3: Go to Github, create a new repository. Step 4: Then clone this link under the Notebooks directory. $ git clone https://github.com/SuzieJi/Jupyter-Notebooks We can see the folder in Jupyter notebooks. Step 5: Go to the git repository directory that we cloned from Github. $ cd < directory >/ Git config (When the first time) $ git config --global user.name "xxx" $ git config --global user.email "xxxxx"
Project Overview I got a dataset from kaggle.com . Assumption: eCommerce company based in New York City that sells clothing online but they also have in-store style and clothing advice sessi ons. Customers come into the store, have sessions or meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want. We need to predict 'Yearly Amount Spent' Here are the features or attributes collected in the dataset: 'Avg__Session_Length' 'Time_on_App' 'Time_on_Website' 'Length_of_Membership' Step 1: Upload csv file in incorta. Upload the CSV file to Incorta, and add a file table in the schema named Ecommerce_Customer. Step 2: Read the Ecommerce Customer file Use PySpark to read the table named SparkTesting.Ecommerce_Customer. The CSV file loaded into Incota can be read into PySpark using df=read("SparkTesting.Ecommerce_Customer") Step 3: VectorAssemblerTest Use
Comments
Post a Comment