How to do data profiling in Incorta

Sometimes we need to better upstanding about data, we can do data profiling using Spark Python in Incorta.

Firstly, Add a new Materialized View in Incorta. Select Spark Python.



Then, I have two methods do data profiling.

Method 1:

Using df.describe() 

This function can provide min, max, count, mean, stddev. But only for data types of string and number. 

Method 2:

Calculate each metric ourselves. 


Below is the syntax:

from pyspark.sql.functions import *
# Change any table for profiling.
df = read("canvas.enrollments")
# Method 1
# df_profile = df.describe()
# Method 2
countDF = df.select([count(c).cast("string").alias(c) for c in df.columns])
nullDf = df.select([count(when(col(c).isNull(), c)).cast("string").alias(c) for c in df.columns])
distinctDf=df.select([countDistinct(c).cast("string").alias(c) for c in df.columns])
minDF = df.select([min(c).cast("string").alias(c) for c in df.columns])
maxDF = df.select([max(c).cast("string").alias(c) for c in df.columns])
countDF=countDF.select(lit("# of Occurances").alias("summary"), "*")
nullDf=nullDf.select(lit("# of Null").alias("summary"), "*")
distinctDf=distinctDf.select(lit("# of Distinct Values").alias("summary"), "*")
minDF=minDF.select(lit("Min").alias("summary"), "*")
maxDF=maxDF.select(lit("Max").alias("summary"), "*")
df_output = nullDf.unionAll(distinctDf).unionAll(countDF).unionAll(minDF).unionAll(maxDF)
save(df_output)
view raw DataProfiling hosted with ❤ by GitHub

 





Comments

Popular posts from this blog

How to create histogram in Incorta use bin function.

Using Time Series Analysis Electric Production by ARIMA Model