Using time series analysis (Part 1)
I got the dataset from Kaggle for practicing time series analysis.
https://www.kaggle.com/felixzhao/productdemandforecasting
This dataset includes historical product demand by products and warehouses between 2011 and 2017. I loaded data into Incorta, and use the Incorta API to read the data.
I first did data profiling to summarize the data. I found the data can be categorized by product code, warehouse, and product category. I plan to find the time series based on different product categories and warehouses.
It was quite challenging to use pandas time series related functions. I need to define the index on a DateTime field, but I can not directly use the date or timestamp field from Spark. Finally, It worked after I use a string field and cover it to date time with Pandas.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from incorta import incorta | |
from pyspark.sql.functions import * | |
from pyspark.sql import SparkSession | |
from pandas.plotting import scatter_matrix | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
df = incorta.read("TimeSeriesNotebooks.HISTORICAL_PRODUCT_DEMAND") | |
# df_profile = df.describe() | |
# df.summary().show() | |
countDF = df.select([count(c).cast("string").alias(c) for c in df.columns]) | |
nullDf = df.select([count(when(col(c).isNull(), c)).cast("string").alias(c) for c in df.columns]) | |
distinctDf=df.select([countDistinct(c).cast("string").alias(c) for c in df.columns]) | |
minDF = df.select([min(c).cast("string").alias(c) for c in df.columns]) | |
maxDF = df.select([max(c).cast("string").alias(c) for c in df.columns]) | |
countDF=countDF.select(lit("# of Occurances").alias("summary"), "*") | |
nullDf=nullDf.select(lit("# of Null").alias("summary"), "*") | |
distinctDf=distinctDf.select(lit("# of Distinct Values").alias("summary"), "*") | |
minDF=minDF.select(lit("Min").alias("summary"), "*") | |
maxDF=maxDF.select(lit("Max").alias("summary"), "*") | |
df_output = nullDf.unionAll(distinctDf).unionAll(countDF).unionAll(minDF).unionAll(maxDF) | |
df_output.show() | |
# df_profile.show() | |
df.schema | |
df = df.withColumnRenamed("Date", "Order_Date") | |
df = df.select(date_format(col("Order_Date"),'yyyy-MM-dd').alias("Order_Date_Str"),col("Order_Demand")) | |
df.schema | |
pdf = df.toPandas() | |
# pdf.tail() | |
# pdf.head(3) | |
pdf.dtypes | |
pdf['pd_Datetime'] = pd.to_datetime(pdf['Order_Date_Str'] + ' 00:00:00') | |
pdf.dtypes | |
pdf = pdf.set_index(pd.DatetimeIndex(pdf['pd_Datetime'])) | |
pdf.head(3) | |
pdf["Order_Demand"].plot(linewidth=0.5); | |
%matplot plt |
Comments
Post a Comment