|
Biographical Sketch
Research Interests
Curriculum Vitae (pdf)
Publication
List (pdf)
Online Publications
PhD Alumni
Contact Information
Sailing!
home |
|
|
|
Rodney M. Goodman B.Sc., Ph.D., C.Eng.,
SMIEEE, FIEE.
Data Analysis Using the ITRule Algorithm
Rodney M. Goodman, Dr. John Lindal
The ITRule (Information Theoretic Rule Induction)
Algorithm is a suite of software programs developed at Caltech for
automated expert systems design. ITRule performs an information
theoretic and statistical analysis of any database in order to discover
the useful knowledge that is implicitly buried within that database.
ITRule outputs this information as "rules".
ITRule can be used
to:
- Discover the most informative correlations between the data
variables or "attributes" and express them as probabilistic
rules of the form: IF Stocks>75% AND Assets=Large THEN FundType=Growth
with prob. 0.8
- Produce a ranked list of rules in order of "informational
strength".
- Identify the relative importance of data attributes.
- Perform automated knowledge acquisition for expert systems,
and load the rules directly into a number of expert system shells.
Thus allowing an expert system to be directly bootstrapped from
data with minimal use of human experts.
- Produce executive summaries of the most important knowledge
in the data, for human presentation and discussion.
- Implement an expert system onto a parallel high-speed inference
architecture.
- Perform true probabilistic inference as a self-contained Expert
System. When presented with the previously unseen input data
vectors, ITRule predicts the values of any unknown data attributes
via the rules. Inference is Bayesian, with the option of outputting
either a simple classification decision, or much more powerfully,
a probability estimate of each inferred output attribute-value.
ITRule has been used to automate expert
system rule generation and analyze data for a wide variety of applications
including:
- Medical Diagnosis.
- Telecommunication Trouble Ticket Analysis and Fault Finding.
- Stock Market and Mutual Fund Analysis.
- DNA Sequence Analysis.
- Real Time Telephone Network Alarms Analysis.
- Mineral Classification using Satellite Synthetic Aperture
Radar Returns.
- Antenna Fault Diagnosis on the JPL Deep Space Network.
- Sonar Returns Signature Analysis.
- Supermarket Product/Sales Analysis.
- Census and Questionnaire Data.
- Database Query Optimization.
Reference: P. Smyth and R.M. Goodman, “An Information Theoretic
Approach to Rule Induction from Databases,” IEEE Transactions
on Knowledge and Data Engineering , Vol. 4, No. 4, pp. 301-316,
August 1992.
Using ITRule
- ITRule requires a flat database text file
in a rows and columns format. The columns or fields of the database
are the "attributes" or "variables" of the
domain. Each row is an "example" or instance of the
problem. The entries in the matrix are the "values"
taken by the attributes.
- Attributes (data variables, field names) can be either categorical
or continuous. Categorical (or discrete) attributes take a finite
set of values, for example the attribute "Rank" may
take the values "Private", "Corporal", and
"Sergeant". A continuous attribute takes either integer
values or real number values. For example, the attribute "Temperature"
takes real number values such as 28.45 degrees. An attribute
such as "number of shipments" takes integer number
values such as 28 or 65. Continuous attributes need to be quantized
(made into discrete attributes by splitting into ranges). This
can be automatically performed by ITRule , but is is much better
if "meaningful" ranges of the data can be identified
given the context of the data. For example, within a particular
context it may make sense to split a continuous variable called
"Temperature" into three significant ranges, thus
resulting in a discretized "Temperature" variable
which takes the values "below_zero", "zero_to_boiling",
and "above_boiling".
- ITRule optionally needs to know which attributes are "right-hand-side"
(RHS) or "hypothesis" attributes, and which are "left-hand-side"
(LHS) or "data only" attributes. RHS attributes will
appear both in the conclusion part of rules and in the LHS of
other rules. Data only attributes appear only in the LHS of
rules. By default, ITRule uses all variables as RHS's.
- ITRule 's output can take several forms: printed rules, rules
in a format suitable for a number of standard expert system
shells, rules that can be loaded onto a number of neural network
simulators, or rules for ITRule 's own internal probabilistic
Rule Based Network inference mechanism. The rules can then be
used to perform prediction on new examples.
An Example of ITRule Analysis using a Mutual Funds
Database
As a simple example of using ITRule , we show how
rules can be generated from a database of mutual funds information.
- Figure 1 (pdf) shows a portion
of the raw database of mutual funds. Note there are several
thousand funds in the whole database. Each row represents an
example of a particular mutual fund (the name is omitted). The
column headings are the attributes of interest when thinking
of investing in a mutual fund. An example of a categorical attribute
is "Fund Type" which takes three possible values:
"Growth", "Growth&Income", and "Aggressive
Growth". An example of a continuous attribute is the "Beta"
or "risk" variable of the fund.
- Figure 2 (pdf) shows a derived
database in which all the continuous variables have been quantized.
Some of these quantizations are "obvious", and some
require expert contextual knowledge of the database domain.
For example, the "Beta" or "risk" attribute
is naturally defined as above and below 1, where 1 is the market
risk. On the other hand the Capital Gain attribute has been
quantized to three ranges on expert advice.
- Figure 3 (pdf) shows the rules
output by ITRule ranked in order of information content. The
probability value is a measure of the "reliability"
of the rule, for example, a completely deterministic rule (a
certainty) has probability 1. The strength value is just the
relative information of the rule relative to the most informative
rule. The rules display information of interest to the investor,
and can be used to predict the performance of new funds. For
example, rule 15 states that IF Assets are Low THEN the 5 year
Return of the fund will be below the Standard and Poor's Market
average return with high probability (0.87).
- Figure 4(pdf)
depicts the rules loaded into an expert system shell (NEXPERT®),
showing how an initial rule base can be rapidly and automatically
prototyped from the data.
top
More information on Rule
Based Networks
back to Information
Processing |
|