The Tip of the Spear: March 2015

Tuesday, March 31, 2015

What is OLAP

The defnintion of OLAP from Webopedia says, "Short for Online Analytical Processing, a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. OLAP often is used in data mining." What does that mean to you?

When it comes to BI, there are multiple ways to pull the data for the end users. You can build reports, build a dashboard, or generate OLAP cubes for example. The problem with building OLAP cube is that you are building another layer of complexity. That means more overhead into cycle times in how long it takes to get data from point A to point B. From your source, through the transformations, to the end user. Overhead might not mean anything to you in the beginning but think long term when you think about data warehousing. If you build it right up front it will pay off in time and stress on the back end.

One day in your future, you will wake up and have a problem with your data. The source system had issues. The ETL process broke. The data was dirty. You come into the office ready to tackle the day and you have a Severity 1 staring you in the face. What do you do about it?

You analyze the issue
You correct the issue
THEN you have to reload the data

Time is of the essence. If your load take 6 hours to complete, do you do anything at all? Do you try to load the data and then load the cube? Hey it will happen and what is the answer?

The second issue that I have with the cube is that it takes time to generate your model. Then you come to the customer. The customer may not like the way you put the cube together because there are assumptions that have to be made in order to build your cube. You have business rules that might not meet the need of your new customer. What do you do now? Do you build a new cube for every new customer? My client just changed the way we calculate persistency for the business. The first year persistency looked great. The second year, it tanked because of considerations made for year 1. So the answer was to change the model. Now we don't use cubes for this client but changing that model and then changing the cube would not be fun.

These are my quirks with OLAP. You may think differently. I would appreciate the feedback.

Joe

Wednesday, March 25, 2015

Where there is no vision, the people will perish

The old biblical saying is true even in your IT organization. Someone in your organization has got to have a vision for the future of BI in the company. What happens if there is no vision? One word for you, "Tangents." There are a million and one systems that your organization can use. Most of them have some sort of reporting functionality attached. These vendors need to refer to the BI roadmap for your organization. What? you don't have a roadmap? Well you need one. Realize that if you don't drive your vendors. Your vendors will drive you. Your IT department can only support X number of systems. The only reason for X+1 is a lack of vision.

Joe, I inherited this lack of vision. How do I get to goodness? Here is Joe's happy path:

1) Analyze your current environment. What are the systems and who uses them.
2) Ask yourself, which reporting environment do I want to use for Adhoc reports, General Reporting, and Dashboards. NOTE: You can have 3 different tools for the three different needs.
3) Develop a plan to get down to 3 reporting systems. Push as much reporting to your data warehouse as possible. After all, that is the purpose of the DW. Your new favorite word is "Sunset" because your the new hatchet person for sunsetting reporting tools. Sometimes you need to merge a legacy data warehouse into the EDW. That depends on your business and where you want to move as an organization.
4) Before you purchase a new system, All Vendors will report through the data warehouse - bar none.

After you come up with your new vision and roadmap, socialize this to the business. Be the new Evangelist for Clean reporting. Call it the Common Core Reporting. Feel free to steal that one if you want.

Joe

What makes a good BI Project Manager

The BI Project Manager will most likely be involved at some point in both Data Warehouse Project Management and BI Project Management. What is the difference? The BI Project Manager would be involved with Dashboard/Reporting type projects while the Data Warehouse Project Manager would be infrastructure related (OS/Database upgrades, Infrastructure upgrades, New Data streams) So what would make a good Project Manager in the BI space?

1) Understanding of the Data. There is a difference between a project involving existing data vs new data. Existing data sources could involve just a front end developer whereas, new data sources will involve your ETL guy, developers, QA for the backend data, QA for the frontend, etc. Your time estimates will be vastly different since the project will be larger in scope.

2) Understanding of the Infrastructure. A Data warehouse will involve more infrastructure than your typical database system. A data warehouse might have an ETL server, a database server, a data mart, a reporting server, etc. Now some environments will have the ETL/DB server combined there are many pieces. There might also be various feeds providing input from source systems and exporting data to downstream systems. Depending on this setup, you will need to update your business resources for those elements when upgrading for testing purposes.

3) Don't lump all projects together under one work stream. You can have an ETL guy working on a DB upgrade while your Dashboard guy is working on a dashboard upgrade. That's two separate projects with different timelines.

4) Reporting or Dashboard projects should probably be handled using AGILE methodology especially if your user community has not provided sufficient requirements. It just works out better for everyone if the business is involved in the development cycle.

5) Understanding of IT projects in general. A lot of shops use Sharepoint for a document repository. Although Sharepoint and BI are separate entities, there are Sharepoint BI projects because these two spaces are merging for general reporting. Also understanding the testing cycle of system testing, regression testing, UAT, etc.

6) Understanding of what is a realistic timeline from your developers. Sometimes developers pad their estimates which is fine but your pad should be closer to 20% than 50%-75%. A good understanding of common developer timelines will help you be successful in the long run.

I hope this helps and would appreciate any feedback.

Tuesday, March 24, 2015

Google Cloud Platform

Watch this video to learn more about the Google Cloud platform

https://www.youtube.com/watch?v=CJbmJ1W5wys

Monday, March 23, 2015

Pushing BI to the forefront

Your Data Warehouse is a downstream system meaning that you ingest data from your source systems. Their changes are your changes. Even when they don't tell you that the changes are coming. The data warehouse team needs to work hand in hand with the owners of the source systems and let them know how you are implementing the data warehouse and to remind them that you are there. Now this might be the point where you may meet some resistance.

For my current client, they had their source systems are the BI team was brought in to create a series of executive dashboards for every part of the company. These dashboards not only help the company gauge how well they are doing but they are shared with the board of directors on a monthly basis. Often times the week before the board views this data is filled with anxiety from both the source systems and our BI team. When you know the numbers for the business and a number that should be static is changing from day to day, you have just pointed out a fatal flaw for one of your "Business Partners." You have to watch and ensure that your findings don't create a negative relationship for the future. Your team may win the battle but lose the war. Keep those doors open. Maintain a level of openness and transparency.

Here are some tips to keep up a good relationship with your business partners:
1) Get buyin from upper levels of the organization. BI leadership needs to meet with your business leadership to maintain the importance of your BI roadmap and data quality plan.
2) BI Managers needs to meet periodically with your business partners about the data.
3) Whenever an issue arises, try to resolve it at the lower levels before escalating to your leadership.
4) Communicate deadlines for data issues to be resolved to give them an idea on when you need data issues resolved. You can use a BI newsletter or just emails. We created a calendar of dates. (Dates for board meeting, dates the data is mailed, when your data is needed, signoffs, etc.)

Sometimes its not the business partners that control the changes. Its your IT partners. Communicate regularly with your IT leadership such that changes on the source systems need to be communicated downstream. Perhaps you will want to add BI as a signoff on all change requests. Remember, always have your etl processes tweeked to reveal any changes to those source system changes. When all else fails, let the ETL process alarm you of any changes to Database schemas. Sometimes you can't help it. Even though you have alerted IT and the business about the need to notify the BI team, you have changes flow through that break your ETL. It is always better to know about those changes on Day 1 instead of Day 10. Best of luck to you.

Joe

Friday, March 20, 2015

Big Data terms

Analytics and Big Data Glossary

Last updated: 9/24/14

ACID test

A test applied to data for atomicity, consistency, isolation, and durability.

ad hoc reporting

Reports generated for a one-time need.

ad targeting

The attempt to reach a specific audience with a specific message, typically by either contacting them directly or placing contextual ads on the Web.

algorithm

A mathematical formula placed in software that performs an analysis on a set of data.

analytics

Using software-based algorithms and statistics to derive meaning from data.

analytics platform

Software or software and hardware that provides the tools and computational power needed to build and perform many different analytical queries.

anonymization

The severing of links between people in a database and their records to prevent the discovery of the source of the records.

application

Software that is designed to perform a specific task or suite of tasks.

automatic identification and capture (AIDC)

Any method of automatically identifying and collecting data on items, and then storing the data in a computer system. For example, a scanner might collect data about a product being shipped via an RFID chip.

behavioral analytics

Using data about people’s behavior to understand intent and predict future actions.

big data

This term has been defined in many ways, but along similar lines. Doug Laney, then an analyst at the META Group, first defined big data in a 2001 report called “3-D Data Management: Controlling Data Volume, Velocity and Variety.” Volume refers to the sheer size of the datasets. The McKinsey report, “Big Data: The Next Frontier for Innovation, Competition, and Productivity,” expands on the volume aspect by saying that, “’Big data’ refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”

Velocity refers to the speed at which the data is acquired and used. Not only are companies and organizations collecting more and more data at a faster rate, they want to derive meaning from that data as soon as possible, often in real time.

Variety refers to the different types of data that are available to collect and analyze in addition to the structured data found in a typical database. Barry Devlin of 9sight Consulting identifies four categories of information that constitute big data:

1. Machine-generated data. This includes RFID data, geolocation data from mobile devices, and data from monitoring devices such as utility meters.

2. Computer log data, such as clickstreams from websites.

3. Textual social media information from sources such as Twitter and Facebook.

4. Multimedia social and other information from Flickr, YouTube, and other similar sites.

IDC analyst Benjamin Woo has added a fourth V to the definition: value. He says that because big data is about supporting decisions, you need the ability to act on the data and derive value.

biometrics

The use of technology to identify people by one or more of their physical traits.

brand monitoring

The act of monitoring your brand’s reputation online, typically by using software to automate the process.

business intelligence (BI)

The general term used for the identification, extraction, and analysis of data.

call detail record (CDR) analysis

CDRs contain data that a telecommunications company collects about phone calls, such as time and length of call. This data can be used in any number of analytical applications.

Cassandra

A popular choice of columnar database for use in big data applications. It is an open source database managed by The Apache Software Foundation.

cell phone data

Cell phones generate a tremendous amount of data, and much of it is available for use with analytical applications.

clickstream analytics

The analysis of users’ Web activity through the items they click on a page.

Clojure

Clojure is a dynamic programming language based on LISP that uses the Java Virtual Machine (JVM). It is well suited for parallel data processing.

cloud

A broad term that refers to any Internet-based application or service that is hosted remotely.

columnar database or column-oriented database

A database that stores data by column rather than by row. In a row-based database, a row might contain a name, address, and phone number. In a column-oriented database, all names are in one column, addresses in another, and so on. A key advantage of a columnar database is faster hard disk access.

competitive monitoring

Keeping tabs of competitors’ activities on the Web using software to automate the process.

complex event processing (CEP)

CEP is the process of monitoring and analyzing all events across an organization’s systems and acting on them when necessary in real time.

Comprehensive Large Array-data Stewardship System (CLASS)

A digital library of historical environmental data from satellites operated by the U.S. National Oceanic and Atmospheric Association (NOAA).

computer-generated data

Any data generated by a computer rather than a human–a log file for example.

concurrency

The ability to execute multiple processes at the same time.

confabulation

The act of making an intuition-based decision appear to be data-based.

content management system (CMS)

Software that facilitates the management and publication of content on the Web.

cross-channel analytics

Analysis that can attribute sales, show average order value, or the lifetime value.

crowdsourcing

The act of submitting a task or problem to the public for completion or solution.

customer relationship management (CRM)

Software that helps businesses manage sales and customer service processes.

dashboard

A graphical reporting of static or real-time data on a desktop or mobile device. The data represented is typically high-level to give managers a quick report on status or performance.

data

A quantitative or qualitative value. Common types of data include sales figures, marketing research results, readings from monitoring equipment, user actions on a website, market growth projections, demographic information, and customer lists.

data access

The act or method of viewing or retrieving stored data.

data aggregation

The act of collecting data from multiple sources for the purpose of reporting or analysis.

data analytics

The application of software to derive information or meaning from data. The end result might be a report, an indication of status, or an action taken automatically based on the information received.

data analyst

A person responsible for the tasks of modeling, preparing, and cleaning data for the purpose of deriving actionable information from it.

data architecture and design

How enterprise data is structured. The actual structure or design varies depending on the eventual end result required. Data architecture has three stages or processes: conceptual representation of business entities. the logical representation of the relationships among those entities, and the physical construction of the system to support the functionality.

database

A digital collection of data and the structure around which the data is organized. The data is typically entered into and accessed via a database management system (DBMS).

database administrator (DBA)

A person, often certified, who is responsible for supporting and maintaining the integrity of the structure and content of a database.

database as a service (DaaS)

A database hosted in the cloud and sold on a metered basis. Examples include Heroku Postgres and Amazon Relational Database Service.

database management system (DBMS)

Software that collects and provides access to data in a structured format.

data center

A physical facility that houses a large number of servers and data storage devices. Data centers might belong to a single organization or sell their services to many organizations.

data cleansing

The act of reviewing and revising data to remove duplicate entries, correct misspellings, add missing data, and provide more consistency.

data collection

Any process that captures any type of data.

data custodian

A person responsible for the database structure and the technical environment, including the storage of data.

data-directed decision making

Using data to support making crucial decisions.

data exhaust

The data that a person creates as a byproduct of a common activity–for example, a cell call log or web search history.

data feed

A means for a person to receive a stream of data. Examples of data feed mechanisms include RSS or Twitter.

data governance

A set of processes or rules that ensure the integrity of the data and that data management best practices are met.

data integration

The process of combining data from different sources and presenting it in a single view.

data integrity

The measure of trust an organization has in the accuracy, completeness, timeliness, and validity of the data.

data management

According to the Data Management Association, data management incorporates the following practices needed to manage the full data lifecycle in an enterprise:

data governance
data architecture, analysis, and design
database management
data security management
data quality management
reference and master data management
data warehousing and business intelligence management
document, record, and content management
metadata management
contact data management

Data Management Association (DAMA)

A non-profit international organization for technical and business professionals “dedicated to advancing the concepts and practices of information and data management.”

data marketplace

A place where people can buy and sell data online.

data mart

The access layer of a data warehouse used to provide data to users.

data migration

The process of moving data between different storage types or formats, or between different computer systems.

data mining

The process of deriving patterns or knowledge from large data sets.

data model, data modeling

A data model defines the structure of the data for the purpose of communicating between functional and technical people to show data needed for business processes, or for communicating a plan to develop how data is stored and accessed among application development team members.

data point

An individual item on a graph or a chart.

data profiling

The process of collecting statistics and information about data in an existing source.

data quality

The measure of data to determine its worthiness for decision making, planning, or operations.

data replication

The process of sharing information to ensure consistency between redundant sources.

data repository

The location of permanently stored data.

data science

A recent term that has multiple definitions, but generally accepted as a discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.

data scientist

A practitioner of data science.

data security

The practice of protecting data from destruction or unauthorized access.

data set

A collection of data, typically in tabular form.

data source

Any provider of data–for example, a database or a data stream.

data steward

A person responsible for data stored in a data field.

data structure

A specific way of storing and organizing data.

data visualization

A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.

data virtualization

The process of abstracting different data sources through a single data access layer.

data warehouse

A place to store data for the purpose of reporting and analysis.

de-identification

The act of removing all data that links a person to a particular piece of information.

demographic data

Data relating to the characteristics of a human population.

Deep Thunder

IBM’s weather prediction service that provides weather data to organizations such as utilities, which use the data to optimize energy distribution.

distributed cache

A data cache that is spread across multiple systems but works as one. It is used to improve performance.

distributed object

A software module designed to work with other distributed objects stored on other computers.

distributed processing

The execution of a process across multiple computers connected by a computer network.

document management

The practice of tracking and storing electronic documents and scanned images of paper documents.

Drill

An open source distributed system for performing interactive analysis on large-scale datasets. It is similar to Google’s Dremel, and is managed by Apache.

elasticsearch

An open source search engine built on Apache Lucene.

electronic health records (EHR)

A digitized health record meant to be usable across different health care settings.

enterprise resource planning (ERP)

A software system that allows an organization to coordinate and manage all its resources, information, and business functions.

event analytics

Shows the series of steps that led to an action.

exabyte

One million terabytes, or 1 billion gigabytes of information.

external data

Data that exists outside of a system.

extract, transform, and load (ETL)

A process used in data warehousing to prepare data for use in reporting or analytics.

failover

The automatic switching to another computer or node should one fail.

Federal Information Security Management Act (FISMA)

A US federal law that requires all federal agencies to meet certain standards of information security across its systems.

grid computing

The performing of computing functions using resources from multiple distributed systems. Grid computing typically involves large files and are most often used for multiple applications. The systems that comprise a grid computing network do not have to be similar in design or in the same geographic location.

Hadoop

An open source software library project administered by the Apache Software Foundation. Apache defines Hadoop as “a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.”

HANA

A software/hardware in-memory computing platform from SAP designed for high-volume transactions and real-time analytics.

HBase

A distributed columnar NoSQL database.

high-performance computing (HPC)

HPC systems, also called supercomputers, are often custom built from state-of-the-art technology to maximize compute performance, storage capacity and throughput, and data transfer speeds.

Hive

A SQL-like query and data warehouse engine.

in-database analytics

The integration of data analytics into the data warehouse.

information management

The practice of collecting, managing, and dsitributing information of all types–digital, paper-based, structured, unstructured.

in-memory database

Any database system that relies on memory for data storage.

in-memory data grid (IMDG)

The storage of data in memory across multiple servers for the purpose of greater scalability and faster access or analytics.

Kafka

LinkedIn’s open-source message system used to monitor activity events on the web.

latency

Any delay in a response or delivery of data from one point to another.

legacy system

Any computer system, application, or technology that is obsolete, but continues to be used because it performs a needed function adequately.

linked data

As described by World Wide Web inventor Time Berners-Lee, “Cherry-picking common attributes or languages to identify connections or relationships between disparate sources of data.”

load balancing

The process of distributing workload across a computer network or computer cluster to optimize performance.

location analytics

Location analytics brings mapping and map-driven analytics to enterprise business systems and data warehouses. It allows you to associate geospatial information with datasets.

location data

Data that describes a geographic location.

log file

A file that a computer, network, or application creates automatically to record events that occur during operation–for example, the time a file is accessed.

long data

A term coined by mathematician and network scientist Samuel Arbesman that refers to “datasets that have massive historical sweep.”

machine-generated data

Any data that is automatically created from a computer process, application, or other non-human source.

machine learning

The use of algorithms to allow a computer to analyze data for the purpose of “learning” what action to take when a specific pattern or event occurs.

map/reduce

A general term that refers to the process of breaking up a problem into pieces that are then distributed across multiple computers on the same network or cluster, or across a grid of disparate and possibly geographically separated systems (map), and then collecting all the results and combines them into a report (reduce). Google’s branded framework to perform this function is called MapReduce.

mashup

The process of combining different datasets within a single application to enhance output–for example, combining demographic data with real estate listings.

massively parallel processing (MPP)

The act of processing of a program by breaking it up into separate pieces, each of which is executed on its own processor, operating system, and memory.

master data management (MDM)

Master data is any non-transactional data that is critical to the operation of a business–for example, customer or supplier data, product information, or employee data. MDM is the process of managing that data to ensure consistency, quality, and availability.

metadata

Any data used to describe other data–for example, a data file’s size or date of creation.

MongoDB

An open-source NoSQL database managed by 10gen.

MPP database

A database optimized to work in a massively parallel processing environment.

multi-threading

The act of breaking up an operation within a single computer system into multiple threads for faster execution.

NoSQL

A class of database management system that does not use the relational model. NoSQL is designed to handle large data volumes that do not follow a fixed schema. It is ideally suited for use with very large data volumes that do not require the relational model.

online analytical processing (OLAP)

The process of analyzing multidimensional data using three operations: consolidation (the aggregation of available), drill-down (the ability for users to see the underlying details), and slice and dice (the ability for users to select subsets and view them from different perspectives).

online transactional processing (OLTP)

The process of providing users with access to large amounts of transactional data in a way that they can derive meaning from it.

OpenDremel

The open source version of Google’s Big Query java code. It is being integrated with Apache Drill.

Open Data Center Alliance (ODCA)

A consortium of global IT organizations whose goal is to speed the migration of cloud computing.

operational data store (ODS)

A location to gather and store data from multiple sources so that more operations can be performed on it before sending to the data warehouse for reporting.

parallel data analysis

Breaking up an analytical problem into smaller components and running algorithms on each of those components at the same time. Parallel data analysis can occur within the same system or across multiple systems.

parallel method invocation (PMI)

Allows programming code to call multiple functions in parallel.

parallel processing

The ability to execute multiple tasks at the same time.

parallel query

A query that is executed over multiple system threads for faster performance.

pattern recognition

The classification or labeling of an identified pattern in the machine learning process.

performance management

The process of monitoring system or business performance against predefined goals to identify areas that need attention.

petabyte

One million gigabytes or 1,024 terabytes.

Pig

A data flow language and execution framework for parallel computation.

predictive analytics

Using statistical functions on one or more datasets to predict trends or future events.

predictive modeling

The process of developing a model that will most likely predict a trend or outcome.

query analysis

The process of analyzing a search query for the purpose of optimizing it for the best possible result.

An open source software environment used for statistical computing.

radio-frequency identification (RFID)

A technology that uses wireless communications to send information about an object from one point to another.

real time

A descriptor for events, data streams, or processes that have an action performed on them as they occur.

recommendation engine

An algorithm that analyzes a customer’s purchases and actions on an e-commerce site and then uses that data to recommend complementary products.

records management

The process of managing an organization’s records throughout their entire lifecycle, from creation to disposal.

reference data

Data that describes an object and its properties. The object may be physical or virtual.

report

The presentation of information derived from a query against a dataset, usually in a predetermined format.

risk analysis

The application of statistical methods on one or more datasets to determine the likely risk of a project, action, or decision.

root-cause analysis

The process of determining the main cause of an event or problem.

Sawzall

Google’s procedural domain-specific programming language designed to process large volumes of log records.

scalability

The ability of a system or process to maintain acceptable performance levels as workload or scope increases.

schema

The structure that defines the organization of data in a database system.

search

The process of locating specific data or content using a search tool.

search data

Aggregated data about search terms used over time.

Semantic Web

A project of the World Wide Web Consortium (W3C) to encourage the use of a standard format to include semantic content on websites. The goal is to enable computers and other devices to better process data.

semi-structured data

Data that is not structured by a formal data model, but provides other means of describing the data and hierarchies.

sentiment analysis

The application of statistical functions on comments people make on the web and through social networks to determine how they feel about a product or company.

server

A physical or virtual computer that serves requests for a software application and delivers those requests over a network.

smart grid

The smart grid refers to the concept of adding intelligence to the world’s electrical transmission systems with the goal of optimizing energy efficiency. Enabling the smart grid will rely heavily on collecting, analyzing, and acting on large volumes of data.

smart meter

An electrical meter that monitor and report energy usage and are capable of two-way communication with the utility.

solid-state drive (SSD)

Also called a solid-state disk, a device that uses memory ICs to persistently store data.

software as a service (SaaS)

Application software that is used over the web by a thin client or web browser. Salesforce is a well-known example of SaaS.

storage

Any means of storing data persistently.

Storm

An open-source distributed computation system designed for processing multiple data streams in real time.

structured data

Data that is organized by a predetermined structure.

Structured Query Language (SQL)

A programming language designed specifically to manage and retrieve data from a relational database system.

terabyte

1,000 gigabytes.

text analytics

The application of statistical, linguistic, and machine learning techniques on text-based sources to derive meaning or insight.

transactional data

Data that changes unpredictably. Examples include accounts payable and receivable data, or data about product shipments.

transparency

As more data becomes openly available, the idea of proprietary data as a competitive advantage is diminished.

unstructured data

Data that has no identifiable structure–for example, the text of email messages.

variable pricing

The practice of changing price on the fly in response to supply and demand. It requires real-time monitoring of consumption and supply.

weather data

Real-time weather data is now widely available for organizations to use in a variety of ways. For example, a logistics company can monitor local weather conditions to optimize the transport of goods. A utility company can adjust energy distribution in real time.

Whole Earth Model

An integrated data management system that allows geophysicists, engineers, and financial managers in the oil and gas industry evaluate the potential of oil and gas fields.

- See more at: http://data-informed.com/glossary-of-big-data-terms/#sthash.4ha1lRlj.dpuf

The Tip of the Spear