Excel for Data Analyst

Binary Classification

True positive

  • Identity it is boomer
  • Take action

 

False positive

  • Take action

False negative

  • Do nothing

True negative

  • Identify it is seagull
  • Do nothing

Threshold

Information Measures

Measuring Uncertainty

Quantifying the Informational Edge

Probability and Entropy

Entropy of a Guessing Game

New Data and Information Gain

Dependence and Mutual Information

 

 

 

Business Analyst

Business Analyst

Identify relevant business metrics

“local knowledge”

  • Specific business domain the company has

 

Horizontal business metrics

  • Enterprise sales funnel

 

Vertical business metrics

  • Specific industry sector characteristics
  • Real estate and financial services

 

Apply models to analyze metrics

Models can run in Excel

Model

  • Mathematical form to simplify presentation of the real world situation
  • Math formula
  • Simple

Ability to quantify the model used

Different types,

Different measures

Binary classification

->Forecast default rates

->Different binary classifications

->Compare performance of classifications

 

Ability to listen

Interview customers, internal or external

Define project requirements

Reports

  • Collect customer requirements
  • Translate to product features and services
  • Internal use
    • Marketing, sales

Excel skills

  • Identify patterns and trends in business data
  • Make forecast
  • Organize financial information
  • Display conclusion in charts
  • Import big data set
  • Model
  • Run optimization

Presentation skills

Effective

  • Clean
  • Concise
  • Persuasive

Data visualization

Make conclusions and recommendations to non-technique audiences

 

BA vs. BDA

BDA

  • More senior position
  • Higher pay
  • Think flexibly
  • Analyze the business
  • How to find data, collect data
  • Figure the right questions to identify business
  • SQL

 

BA

  • Specific problem to analyze
  • Ready made data set

 

 

Tableau: Visualizations, Logic, and Stories

Storyboarding Your Data Story

Big picture -> details -> big picture

  1. Why? So what?
  2. SMART rec. (1 minute)
  3. What I am going to show
  4. Supporting evidence
  5. Link evidence to SMART rec.
  6. What you could gain
  7. Why? So what?

Appendix

Technical details

White paper

Making your data story come alive

Emotion context

Bring the pride to your presentation.

Closing in an emotional way.

 

Story elements

  • Characters
    • Stakeholders, yourself
  • Location
    • Actual story, picture in their mind
  • Conflict
  • Resolution

 

Tell a real life story to show the story elements

  • Family grouping dinner
  • Show the direction and satisfaction from the story
  • Let the audience the feel the same emotion

 

To start the presentation

  • Story
  • Use an image
  • Device
  • Statistics

 

Find something that is motivated to you!

 

Tips

  • Specific details
    • Add credibility
  • Positive messages -> positive emotions
  • Use large, high-quality pictures

 

Stress-testing Your Story

Present the data story to your stakeholders.

Sanity check

Storyboard

Team culture: the culture of bring forward any errors and that being celebrated.

Identify inconsistencies and logical arguments

Logic fallacies

Choose a class in Logic and Reasoning in Couresa

Overgeneralization and Sample bias

Assume what you are seeing in your dataset is what you would see.

Why

  • Your data is very small or sometimes it’s selected subset (biased).
  • A lot of missing data in your data set

 

It is tempting to think that if you just have a big enough data set, you should be able to overcome most types of sampling bias.

Smartphone owner: wealthier and younger people

Poor decisions and inaccurate predications.

Data shadows: lack of data

Difficulties

  • It can take a lot of detective work to figure out that your sample is biased in the first place

Single-mindedly paranoid about your data quality is actually one of the biggest ways that you can prevent mistakes.

How to avoid

  • Ask a lot of questions about how your data was collected
  • Check how many data points you have in all the groups you’re looking at
  • Split your full data set into three to five random subsets. See if you observe the same effects n each one as you see in the group as a whole

Misinterpretations Due to Lack of Controls

What

Controls: comparisons

Include designed comparison groups that should not have the effect you were looking for in your analyses to make sure the effects you observe are due to the events you think they are due to.

Correlation Does Not Equal Causation

Correlation

Refers to the phenomenon of two things having a tendency to vary together over multiple time points or multiple measurements.

 

Causation

Refers to the phenomenon of one thing happening as the result of the other thing.

Coincidence

When purely by chance two things happen at the same time.

Spurious relationship

When either due to chance or due to an unmeasured variable, but not by direct causality, two tings correlate.

Click-through rates

The percentage of users who click on an advertisement link when they see it.

 

How Correlations Impact Business Decision

Scientific method

Change the variable you think is causing the effect you want and hold all other things constant.

 

AB test

Test on two different groups

 

Tips

  • Every time you see a correlation between two entities related to a business recommendation you wanna make, get in the habit of questioning whether there is some other third or fourth or fifth variable that can explain the relationship you see.
  • Then look for data that allows you to test whether that third or fourth or fifth variable is a better measure of the phenomena you are interested in.
  • Examine whether the correlation you’re basing your business recommendation on exists in other contexts or datasets. The more you can replicate the effect, the less likely the first correlation you saw was due to random chance.
  • Try to come up with different, but complementary angles to assess the causal relationship you’re hypothesizing about.

The likelihood that you will get into trouble inferring causation from correlation, increases as the size of your data sets increase.

Likelihood for getting into trouble increase as the complexity of your data sets increases as well.

When you don’t know why two phenomena are correlated, you don’t know how to predict when their correlation might change.

Data is meant to inform human decision making, not replace it.

Risk, uncertainty

Think of data as a resource to increase the number of good decisions made.

Tools for Conveying Your Data Story

Basic Charts

Use bar charts and line charts as defaults

Purpose is to convey the most critical parts of your data.

Decision making process

Bar charts

Used for comparing measures in different groups or categories.

Show aggregated rather than raw data

 

Line charts

How values and categories vary over time.

Specific, inherent, sequential order

Pie charts

Not good

Scatter plots

Show raw data

Best fit line

Too confusing and overwhelming for non-technical audiences

Statistics background

Tips

Use

  • Bar charts (comparing categories)
  • Line charts (changes over time or an ordered category)
  • Pie charts (4 or fewer categories that add up to 100)

Don’t use

  • Scatter charts (unless you have a technical audience)
  • 3D charts (ever)

 

Visual Perception

Relative differences in position and length

Our eyes and brains don’t have the ability to perceive many of the differences other charts rely on.

Most accurate

  • Positions on a common scale
    • Line, bar
  • Lengths
  • Angles and slopes
  • Area
  • Volume
    • Pie
  • Color saturation or shading
  • Color hue

 

Misinterpretation Caused by Colorbars

Tips

Do not use color to

  • Convey detailed quantitative differences in the values of continuous variables

 

Do use color to

  • Illustrate general patterns
  • Code for different categories of categorical variables
  • Draw attention to something

 

Visual Contrast Directs Where Your Audience Looks

Influence what they look at.

You have already decided what parts are important.

And you goal is to only show your audience they need to make a decision about your recommendation.

Visualization for data analysis should show as much data as possible.

Visualization for persuasion should show selected pieces of data and should direct your audience’s eyes to the precise points of the data that support the arguments you are trying to persuade them of.

Salient

Stand out relative to their neighbors.

The human brain can only handle a limited amount of information at a time.

Our brains developed a focus on important things in the environment and filter out everything else.

Make sure they see only the parts they need to see in order to make a decision about your recommendation.

Putting Compelling Data Visualizations into Persuasive Business Presentations

Formatting Slides to Communicate Date Stories

Maximize the Data-Ink ratio

Data ink: ink that represents the actual data

Non-data ink: everything else

No extra information and text on the page

Understanding at a glance

Put label on the bar charts

 

Fonts

Times new roman, bodoni, garamond

Helvetica, calibri, arial

 

Don’t make your audience do visual math

Split the complex charts

Formatting Presentations to Communicate Date Stories

Visual formatting

Consistency

In 10 minutes

Transition slide

  • 3 dividend, left 2 is title, right 1 is picture

Animation

  • Simple

Tips

  • Check for typos. Then check again
  • Bold is readable than italics and underlining
  • No distorted or fuzzy pictures
  • Use 2-3 colors (but others can be used for highlighting)

Delivering your data story

What you will say

How you will say them

Fast, slow speed

Practice

Body gesture

 

References:

Data Analyst

General Tips

Ask questions, nourish curiosity and embrace the unknown

Start thinking about everything you see as a dependent or independent variable

Dependent variable

  • Phenomenon to be measured
  • Examine it to see if its value is dependent on, or changes in response to, other factors

Independent variable

  • Affect the dependent variables

Relationships : dependent variables and independent variables

Start exploring the advantages of Continuous vs. Discrete variables

Continuous

  • 17%
  • Harder to interpret by eye
  • More detailed information
  • Line graphs

Discrete

  • high, medium, low
  • Easier to understand
  • Often less precise
  • May hide important patterns
  • Bar graphs

Train your skepticism muscles

Really dramatic or surprising effect in your data

Seek Details

Don’t be satisfied with taking your data at face value, be curious about what your data really look like, and what they really mean.

Cherish Precision

Rule out inefficient directions

Predictions, hypotheses, or goals

Precision makes things getting easier.

Listen AND Contribute

Data analysis projects are almost always collaborations!

Listen and speak!

The best practices in data analytics are not necessarily the most common practices in data analytics

Work in team

Understand other people’s perspective

 

Introduction to Data-Centric Roles

Roles and companies as they relate to big data

-> Recognize

  • Business metrics

-> identify

  • Revenues
  • Profitability
  • Risk

-> Track

  • Business metrics
  • Identify

Roles

  • Business analyst
  • Business data analyst
  • Data scientist
    • Skill and experience
    • Started as business data analysts
  • Senior software engineer / Project manager
    • Different training

Companies

  • Strategic consulting: general business
  • Traditional bricks-and-mortar
  • Strategic consulting: information technology
  • Hardware and software companies
  • Digital business

 

References

 

 

 

Stakeholders

Listening to Stakeholders During Elicitation

Elicitation

Ask questions.

Three goals

  1. Determine who your key stakeholders are.
    • Anybody who is affected by your data analysis project or who might have a strong interest in it.
    • Narrow down the list to the stakeholders who will have the greatest potential to affect whether your project succeeds or fails.
  2. Identify independent variables to test
    • Care about what types of changes are easy or hard to do, given the company’s business culture or data architecture.
  3. Determine whether stakeholders agree about problem to be solved

Elicitation meeting questions

  • What has been tried before and how did it turn out?
  • What do you think might improve this business problem?

Stakeholder Expectations Matters

Most companies are not ready to incorporate data analytics into their business practices.

Best practices aren’t necessarily common practices.

Gartner Analytics continuum

  • Descriptive analytics
    • Answer the questions, what happened or what’s happening
    • Pie charts and bouncy bar charts
    • nothing too much more sophisticated that that
  • Diagnostic analytics
    • Answer the questions, why things are happening
    • Look for root causes and doing root cause analysis
    • Fraud detection
  • Predictive analytics
    • What is going to happen?
    • Trajectory of understanding why things are happening, what is happening to develop that trajectory of the likely scenario in the future
    • Forecasting
  • Prescriptive analytics
    • Giving recommendations

Technologies -> skill sets

Pay particular attention to what kinds of analytic tools and results they are willing to work with.

Data visualizations are explicitly designed to be accessible to anyone.

Advanced skills: linear regression analytics, machine learning, heuristic processing

 

References

S.P.A.P.

SPAP: Structure Pyramid Analysis Plans

Structures

Untitled picture.png

Layer 1

S.M.A.R.T. GOAL

In 2 months, analyze archived click-stream data to determine the website changes that will most efficiently increase revenues by 15% on 1 month-to-month basis compared to the same month last year.

Defines your dependent variables

DV1: Total $ spent per transaction

  • Clickstream database, “total spent” field aggregated by SUM over each transaction ID

DV2: Total $ spent per month

  • Clickstream database, “total spend” field aggregated by SUM over data (month)

DV3: Total $ spent per customer

  • Clickstream database, “total spent” field aggregated by SUM over each customer ID

Layer 2 and Layer 3

Independent variables

Do specific demographics disproportionately contribute to revenue?

  • Age
  • Gender
  • Income

Do specific behaviors disproportionately contribute to revenue?

  • Longer time on site
  • More visits to site

Did specific marketing strategies disproportionately contribute to revenue?

  • Promotional emails
  • Facebook ads
  • tweets

Specific analyses to run & graphs to make

SQL

Tableau

Using SPAPs

Before starting analysis, define the value the analysis is supposed to provide, have a plan for what you are going to do.

With hypothesis

Analysis plan

To be more specific, having a clear analysis plan will give you a map to make sure you stay on task and don’t get distracted with data that aren’t relevant to your smart goals, make it easier to track if there are gaps in your thinking, give you a framework to convey your conclusions, give you a straightforward structure to report your progress, and will make it possible for an entire team to be on the same page about what needs to happen next.

Total money spent

  • Per transaction
  • Per month
  • Per customer

Data missing or take a long time to prepare.

Prioritization

Feedback

Analyze relationships between smart metrics and independent variables

  • Tableau
  • Excel

Initial step

Make one or two charts to access every single category specially in layers two and three of your SPAP.

Bar charts, scatter plots, and line charts

To check any of the charts unearth obvious patterns.

Highlight any confirmed or not related category with different symbols.

Use mind map to draw the layers, prioritize, delete not related, highlight any critical

References

 

S.M.A.R.T

Why?

Objective of your project

After you’ve had your meeting, your next job is to synthesize what you heard and draft up a smart statement that reflects the goal of the analysis project.

Designing effective objectives

Specific, measurable, attainable, relevant, time-bound

As a checklist -> same page as your stakeholders

Business metrics

  • Dependent variables -> independent variables

Confirm the metric so that you can think of your project in terms of numbers.

Task: “People aren’t returning back to our site after they log in for the first time. I need you to tell us how to convert first time visitors to returning visitors.”

Specific and Measurable

“Increase the number of returning visitors on a month-by-month basis by at least 15%

compared to the same month during last year”

-> Rows and columns in table

Relevant

“The goal of the project is to determine the website changes that will most efficiently increase revenues by 15% on a month-by-month basis compared to the same month last year.”

Time-bound

“The goal of the project is to, by the end of two months, determine the website changes that will most efficiently increase revenues by 15% on a month-by-month basis compared to the same month last year.”

Attainable

Is the data available?

The goal of the project is to, by the end of two months, analyze archived click-stream data to determine the website changes that will most efficiently increase revenues by 15% on a month-by-month basis compared to the same month last year.

First in 3 months, install a system that will collect and store click-stream data in a cloud-base relational database. Then, the second part could be. By the end of 2 months, after the system has been installed, this data will be analyzed to determine the website changes that will most efficiently increase revenues by 15% on a month-by-month basis compared to the same month last year.

References

 

 

 

Tableau: Basics

Use Data Visualization to Drive Your Analysis

“Science groups and arranges tis truths so as to enable us to take in at one view as much as possible of the general order of the universe.”

“Science is knowledge which we understand so well that we can teach it to a computer…”

“… and if we don’t fully understand something, it is an art to deal with it.”

Basics

Scatter plot

  • Show the variability of distribution

Dimension

  • Independent data
  • Discrete Variables

Measures

  • Continuous Variables

 

Database -> Python -> Tableau/Excel

Rows: aggregation of the data

Columns: how to break the data, what categories to use, divide the data

Show me: best practices plots

Aggregate measures: aggregation rows, COUNT, MEDIAN, AVG

If not check, every single row in dataset will be shown -> show all data points

Filter, Group

Group items by using IDs. Then use grouped dimensions as filter

Group CA and California, Create group

Outliers

Extreme values

Dispersion

Scatterplot

Do not aggregate measures

Filters

Groups

 

Line Graphs and Box Plots

Continuous Variables (Measures)

Values along a continuous line

Discrete Variables (Dimensions)

Date parts

Advantages
    • Best format for computing best fit lines
    • Years are connected on line graphs
    • Easy analysis and labeling of data parts
    • Can see details in a box and whisker graph
Disadvantages
    • Can’t see all details in a single graph
    • Years are NOT connected on line graphs
    • Some unintuitive effects when computing best fit lines
Dependent variables, green pill, linear regression Independent variables, blue pill

Line graph

Time as x-axis

To show how two variables are related to each other

Linear regression

  • P value. How trust is the line result.

Box plot

Maximum, half-way, median, half-way, minimum

Quartiles

  • 1st quartile: 25% below this line
  • 3rd quartile: 25% above this line
  • interquartile range: 50%
  • Box:  50% of data inside it

Outliers

  • Out of quartiles

Dynamic Data Manipulation and Presentation in Tableau

Row-level Calculations

Make a different value for each row of your data.

Get a new column of data

Blending and Aggregation-level Calculations

Connect two different data sources with different level of aggregation

Common field of two data sources

  1. Aggregate the large table first based on some level
  2. Link to the small table  after aggregated

Primary data sources

  • Smallest data source

Second data source

  • Largest data source

Data -> Relationships

  • Custom: mapping fields

Table calculations and parameters

Rank adjusted paid salary in each subgroup

  • Edit table calculation
  • Dimensions
  • Calculation type: rank
  • Compute using: specific dimensions

Parameters

  • Global
  • Used in calculations

Dashboards and story points

Story points

It is a branch of worksheets.

Ordered in a logic sequences

References

Sprint Boot:

Why is Spring?

  • Application framework
    • Template
    • Dependency injection
  • Programming and configuration model
    • Business logic
  • Infrastructure support
    • Connect to database

Problems with Spring

  • Huge framework
  • Multiple setup steps
    • Connect to database
  • Multiple configuration steps
  • Multiple build and deploy steps
    • flexibility, capabilities come at a cost

Abstract these steps

Enter Spring Boot

  • Opinionated
  • Convention over configuration
  • Stand alone
  • Production ready

 

Spring STS

 

Maven

Build dependency management

 

pom.xml

  • declare all dependencies
  • list of dependencies which maven needs to know
  • download from repository

Create starter of project

  • archetype

 

<parent></parent>

Declare this maven project’s parent project

<parent>
<groupId>org.springframework.boot </groupId>
<artifactId>spring-boot-starter-parent </artifactId>
<version>1.5.1.RELEASE </version>
</parent>

 

Creating a Spring Boot project

Maven project

Pom.xml

 

Create a App class

main method

annotation

@SpringBootApplication

SpringApplication.run(CourseApiApp.class, args);

 

Starting Spring Boot

SpringApplication.run(CourseApiApp.class, args);

  • Sets up default configuration
    • Favor convention over configuration
  • Starts Spring application context
    • Container for all your code running on server, service
  • Performs class path scan
    • Plug in class
    • Annotate
    • Class contains an annotation infer controller, service
    • Look at class code, check annotation
  • Starts Tomcat server
    • Access url
    • Stand alone

 

References:

https://www.youtube.com/results?search_query=spring+boot+tutorial