Suzie's PySpark and Incorta Notes

Posts

Showing posts from February, 2021

Using Github to version and manage notebooks(Jupyter notebooks)

February 18, 2021

Version Control and Manage is a vital part of data science workflows. Between multiple experiments, it is essential to know what changed and which updates were made by which team member. We can use Github to version and manage notebooks(Jupyter notebooks) following the below steps. Step 1: Go to the Incorta environment at the terminal. $ ssh -i <key file> incorta@<IP address> Step 2: Find Jupyter file path follow below command: $ cd / $ find . -name '*.ipynb' -print We can see the Jupyter notebooks file path is /home/incorta/Notebooks Step 3: Go to Github, create a new repository. Step 4: Then clone this link under the Notebooks directory. $ git clone https://github.com/SuzieJi/Jupyter-Notebooks We can see the folder in Jupyter notebooks. Step 5: Go to the git repository directory that we cloned from Github. $ cd < directory >/ Git config (When the first time) $ git config --global user.name "xxx" $ git config --global user.email "xxxxx...

Read parquet file via data lake connector in Incorta

February 16, 2021

Incorta allow to read parquet file, this is how to read parquet file via data lake connector in Incorta. Step 1: Save the parquet file in external notebooks(Jupyter notebooks) Step 2: In Incorta external data source, Add a new data source. Step 3: Select 'data lake - local files', give directory path. Step 4: Go to Incorta Schema add a new data lake table.

How to do data profiling in Incorta

February 15, 2021

Sometimes we need to better upstanding about data, we can do data profiling using Spark Python in Incorta. Firstly, Add a new Materialized View in Incorta. Select Spark Python. Then, I have two methods do data profiling. Method 1: Using df.describe() This function can provide min, max, count, mean, stddev. But only for data types of string and number. Method 2: Calculate each metric ourselves. Below is the syntax: