Introduction to Data Mining
This is an introduction to Data Mining, I have also put it into a Podcast available in the Podcast sidebar.
Data Mining
Business Intelligence (BI) Systems
· Business intelligence (BI) systems are
information systems that assist managers and other professionals:
-To analyse current and past activities
-To predict future events
· Two broad categories:
-Data mining
Data for BI Systems
· BI systems obtain data in three ways:
-From the operational database:
Read and process data only
DO NOT insert, modify or delete operational
-From extracts from the operational
Data is in a BI DBMS
May be a different DBMS than the operations
-From data purchased from data vendors
Data Mining Applications
· Data mining applications are used to:
-preform what-if analysis
-Make predictions
-Facilitate decision making
· Data mining applications use sophisticated
statistical and mathematical techniques
The convergence of the Disciplines
Statistics/Mathematics, artificial intelligence machine
learning, huge databases, data management technology, cheap computer processing
and storage, sophisticated marketing- finance and other business professionals all converge into data mining
Data Mining
· The process of extracting valid, previously
unknown, comprehensible, and actionable information from large databases and
using it to make crucial decisions, (Simoudis, 1996)
· Involves the analysis of data and the use of
software techniques for finding hidden and unexpected patterns and
relationships in sets of data
· Data mining can provide huge paybacks for
companies who have made a significant investment in data warehousing
· Relatively new technology, however already used
in a number of industries
· Retail / Marking
-Identifying buying patterns of customers
-Finding associations among customer
demographic characteristics
-Predicting response to mailing campaigns
-Market basket analysis
· Banking
-Detecting patterns of fraudulent credit
card use
-Identifying loyal customers
-Predicting customers likely to change
their credit card affiliation
-determining credit card spending by
customer groups
· Insurance
-Claims analysis
-predicting which customers will but new
· Medicine
-Characterizing patient behaviour to
predict surgery visits
-identifying successful medical therapies
for different illnesses
· Mining analogy:
-large volumes of data are sifted in an
attempt to find something worthwhile. In a mining operation large amounts of
low grade materials are sifted through in order to find something of value
Comparison of DM and DBMS
· DBMS queries based on the data held e.g.
-last months sales for each product
-sales grouped by customer age etc.
-list of customer who lapsed their policy
· Data Mining infer knowledge from the data held
to answer queries e.g.
-what characteristics do customers share
who lapsed their policies and how do they differ from those renewed their
-why is the Cleveland division so profitable?
Data Warehousing
Modern organizations are drowning in data but starving for
information. Why?
1 . Information gap – fragmented way organizations
have developed information systems and supporting databases for many years.
Difficult for managers to locate and use accurate information.
2 .
Most system developed to support operational
processing, with little thought given to the information or analytical tools
needed for decision making.
· Operational processing – Transaction processing
– captures, stores and manipulates data to support daily operations of business
· Information processing - analysis of data to support decision making
· Bridging the information gap are data warehouses that consolidate and integrate
information from many internal and external sources and arrange it in a
meaningful format for making accurate and timely business decision
· They support executives, managers and business
analysts in making complex decisions through applications such as
Analysis of trends
Target marking
Competitive analysis
Customer relationship management…
Data warehouse:
A subject-oriented, integrated, time-variant, non-updatable
collection of data used in support of management decision-making processes
· Subject-oriented:
e.g. customers, patients, students, products
· Integrated:
Consistent naming conventions, formats, encoding structures;
from multiple data sources
· Time –variant:
Can study trends and changes
· Nonupdatable:
Read-only, periodically refreshed from operational systems –
cannot be up dated by the end users.
Data Mart
A data warehouse that is limited in scope
Benefits of Data Warehousing
Potential high returns on investment
Competitive advantage
Increased productivity of corporate
Data Warehousing Institute Awards
Past winners include:
Continental Airlines
Bank of America
Lowa Department of Revenue
Operational Data Sources
· Main sources are online transaction processing
(OLTP) databases.
Also include sources such as personal databases and
spreadsheets, Enterprise Resource Planning (ERP) files, and web usage log
What is a Data Mart? (Smaller version)
Provides localization for departments or functions
Reduces demands on the
Independent data mart data warehousing architecture (
Extract, Transform, Load)
Data marts: Mini-warehouses, limited
in scope
The ETL Process
Capture /Extract
Scrub or data cleansing
Load and index
ETL = Extract, Transform, and Load
Capture /Extract….obtaining a
snapshot of a chosen subset of the source data for loading into the data
Static extract = capturing a snapshot of the source data at
a point in time
Incremental extract = capturing changes that have occurred
since the last static extract
Scrub/Cleanse… uses pattern
recognition and AI techniques to upgrade data quality
Fixing errors: misspellings, erroneous dates, incorrect field
usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also: decoding, reformatting, time stamping, conversion, key
generation, merging, error detection/logging, locating missing data
Transform = convert data from format
of operational system to format of data warehouse
Record-level: Selection-data partitioning, Join-data
combining, Aggregation-data summarization
Single-field – from one field to one field, multi-field –
from many fields to one, or one field to many
Load/Index = place transformed data
into the warehouse and create indexes
Refresh mode: bulk rewriting of target data at periodic
Update mode: only changes in source data are written to data
Post a Comment