Data Science — Journey through Life Cycles. Part 1

Introduction

Everyone is talking about Data Science and its different steps and phases. This article explores Data Science lifecycles, different steps (in each lifecycle) and would be a great start for Data Scientist beginners in Data Science journey.

Data Science Lifecycle

By its simple definition, Data Science is a multi-disciplinary field that contains multiple processes to extract knowledge or useful output from Input Data. The output may be Predictive or Descriptive analysis, Report, Business Intelligence, etc. Data Science has well-defined lifecycles similar to any other projects and CRISP-DM and TDSP are some of the proven standards.

  1. Data Understanding
  2. Data Explore and Preparation
  3. Create and Evaluate Model
  4. Deploy Model and turn out effective output

Business Understanding

Business Understanding is always a key phase in any SDLC but it is more critical in Data Science lifecycle. If we misunderstood business, then we would end up with the wrong outcome or even we predicted good output but not acceptable by the customer. The main steps in this phase are:

Identify Stakeholder(s)

Stakeholders are Business Analyst/Expert people and their responsibility is clear all business query from Data Scientist in any phase of Data Science lifecycles.

Set Objective

Understand the business problem and identify whether the problem is applicable to an analytical solution or in other words, if Data science can target business problems. To achieve this, Data Scientist frames the business objective by asking relevant and sharp question to stakeholders. Please find this blog for some relevant questions to be asked to stakeholders.

  1. Identity Data Science problem type from the Business requirement and find some Data Science Problem Types.

Identify and Define Target Variable

Each Data science project contains either Supervised Or Unsupervised learning data.

Data Science Project Execution Plan

Use Project Management Tool like VSTS, JIRA, etc. and create a project execution plan and track each milestone and deliverable in different stages of Data Science lifecycles.

Data Understanding

The main steps in this phase are:

Collect Proper Data Set

Data Scientist collects proper data set that covers all business objectives and ensures Data Sets have required input features that answer all business questions. Data might be stored in a CSV file, database or in different formats and storage media. We can access either entire data and download from its source or through data streaming using a secured API.

Setup Environment for Data

Set up Data hosting environment after Data Scientist collect Data Set. This environment might be either in Local Computer, Cloud and On-premise, etc. Example of a Cloud environment is — Azure Blob Storage, Azure SQL Database.

  • Data Science Virtual Machine (DSVM) / Deep Learning Virtual Machine (IaaS solution) — Customized virtual machines that contain different tools preconfigures and preinstalled on Azure
  • Cloud-based Notebook VM
  • ML Studio — UI based tool from Microsoft — https://studio.azureml.net/
  • ML Service — PaaS solution

Setup Tools and Package

Setup Tools and Install Packages for data processing. Please find some Tools used in DS — Python, R, Azure Machine Language Learning Environment, SQL and RapidMiner, TensorFlow, etc. There are multiple packages available in each tool to process, manipulate and visualize the data. Panda and Numpy are some of the main packages with Python.

Identity Feature Category

Each feature in Data Set is broadly divided into either Categorical (for string type) or Continuous (for numeric type) Type. Categorical type is further divided into - Ordinal and Binary.

Data Explore and Preparation

The main steps in this phase are:

Data Explore

In this phase, Data Scientist familiarizes each feature in Data Set that includes: Identifies feature type like Categorical, Continuous, etc. How data spread and their distribution, identify if there is any relationship between two features, etc.

  • Check Dispersion Measure — How Data are spreading and its distribution of the specific feature
  • Unique count for category
  • BAR chart used to show individual category status

Data Preparation

Based on Data Explore, some abnormal behavior can be identified like — Missing Data, Extreme Values (outlier) and Noisy data. These behaviors may impact the accuracy of Data science output and recommended to fix it before creating a model.

Feature Engineering

Process of transforming data to another better understanding feature to create better Predictive model. Feature Engineering is a crucial step in any of the Data Science journey and cannot implement without proper Domain knowledge.

Categorical Feature Encoding

Most of the Machine learning algorithms work on numeric data and not Categorical Data. So Data Scientist needs to convert Categorical Type feature to Continuous feature.

Points of Interest

Familiarizing Data Science lingo like — Supervised learning, Feature Engineering, etc. is one of the most important aspects in Data Science journey.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store