Pages

Welcome

Welcome To JPDESIGNS Blog

Subscribe in Reader

Subscribe in Reader

Friday, 27 February 2015

Introduction to Data Mining

Introduction to Data Mining

This is an introduction to Data Mining, I have also put it into a Podcast available in the Podcast sidebar.

Data Mining

Business Intelligence (BI) Systems

·    Business intelligence (BI) systems are information systems that assist managers and other professionals:
-To analyse current and past activities
-To predict future events
·    Two broad categories:
-Reporting
-Data mining

Data for BI Systems

·    BI systems obtain data in three ways:
-From the operational database:
Read and process data only
DO NOT insert, modify or delete operational data
-From extracts from the operational database:
Data is in a BI DBMS
May be a different DBMS than the operations DBMS
-From data purchased from data vendors

Data Mining Applications

·     Data mining applications are used to:
-preform what-if analysis
-Make predictions
-Facilitate decision making
·    Data mining applications use sophisticated statistical and mathematical techniques

The convergence of the Disciplines

Statistics/Mathematics, artificial intelligence machine learning, huge databases, data management technology, cheap computer processing and storage, sophisticated marketing- finance and other business professionals all converge into data mining

Data Mining

·    The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial decisions, (Simoudis, 1996)
·    Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data
·    Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing
·    Relatively new technology, however already used in a number of industries

·     Retail / Marking
-Identifying buying patterns of customers
-Finding associations among customer demographic characteristics
-Predicting response to mailing campaigns
-Market basket analysis
·    Banking
-Detecting patterns of fraudulent credit card use
-Identifying loyal customers
-Predicting customers likely to change their credit card affiliation
-determining credit card spending by customer groups
·    Insurance
-Claims analysis
-predicting which customers will but new polices
·    Medicine
-Characterizing patient behaviour to predict surgery visits
-identifying successful medical therapies for different illnesses
·    Mining analogy:
-large volumes of data are sifted in an attempt to find something worthwhile. In a mining operation large amounts of low grade materials are sifted through in order to find something of value

Comparison of DM and DBMS

·    DBMS queries based on the data held e.g.
-last months sales for each product
-sales grouped by customer age etc.
-list of customer who lapsed their policy
·    Data Mining infer knowledge from the data held to answer queries e.g.
-what characteristics do customers share who lapsed their policies and how do they differ from those renewed their policies?
-why is the Cleveland division so profitable?

Data Warehousing

Modern organizations are drowning in data but starving for information. Why?
1    .   Information gap – fragmented way organizations have developed information systems and supporting databases for many years. Difficult for managers to locate and use accurate information.
2   .       Most system developed to support operational processing, with little thought given to the information or analytical tools needed for decision making.
·    Operational processing – Transaction processing – captures, stores and manipulates data to support daily operations of business
·    Information processing  - analysis of data to support decision making
·    Bridging the information gap are data  warehouses that consolidate and integrate information from many internal and external sources and arrange it in a meaningful format for making accurate and timely business decision
·    They support executives, managers and business analysts in making complex decisions through applications such as
Analysis of trends
Target marking
Competitive analysis
Customer relationship management…

Definition

Data warehouse:

A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes
·    Subject-oriented:
e.g. customers, patients, students, products

·    Integrated:
Consistent naming conventions, formats, encoding structures; from multiple data sources
·    Time –variant:
Can study trends and changes
·    Nonupdatable:
Read-only, periodically refreshed from operational systems – cannot be up dated by the end users.

Data Mart

A data warehouse that is limited in scope

Benefits of Data Warehousing

·         Potential high returns on investment
·         Competitive advantage
·         Increased productivity of corporate decision-makers

Data Warehousing Institute Awards

·         Past winners include:
·         Continental Airlines
·         Toyota
·         Bank of America
·         Lowa Department of Revenue

Operational Data Sources

·     Main sources are online transaction processing (OLTP) databases.
Also include sources such as personal databases and spreadsheets, Enterprise Resource Planning (ERP) files, and web usage log files.
Periodic extraction           data is not completely current in warehouse

What is a Data Mart? (Smaller version)

Provides localization for departments or functions
Reduces demands on the
·         Warehouse
·         Network
Independent data mart data warehousing architecture ( Extract, Transform, Load)
Data marts: Mini-warehouses, limited in scope

The ETL Process

·         Capture /Extract
·         Scrub or data cleansing
·         Transform
·         Load and index
ETL = Extract, Transform, and Load

Capture /Extract….obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
Static extract = capturing a snapshot of the source data at a point in time
Incremental extract = capturing changes that have occurred since the last static extract
Scrub/Cleanse… uses pattern recognition and AI techniques to upgrade data quality
Fixing errors:  misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Transform = convert data from format of operational system to format of data warehouse
Record-level: Selection-data partitioning, Join-data combining, Aggregation-data summarization
Single-field – from one field to one field, multi-field – from many fields to one, or one field to many
Load/Index = place transformed data into the warehouse and create indexes  
Refresh mode: bulk rewriting of target data at periodic intervals

Update mode: only changes in source data are written to data warehouse

0 comments:

Post a Comment

 
Blogger Templates