Automated Knowledge Capture






	Biographical Sketch Research Interests Curriculum Vitae (pdf) Publication List (pdf) Online Publications PhD Alumni Contact Information Sailing! home				Rodney M. Goodman B.Sc., Ph.D., C.Eng., SMIEEE, FIEE. Data Analysis Using the ITRule Algorithm Rodney M. Goodman, Dr. John Lindal The ITRule (Information Theoretic Rule Induction) Algorithm is a suite of software programs developed at Caltech for automated expert systems design. ITRule performs an information theoretic and statistical analysis of any database in order to discover the useful knowledge that is implicitly buried within that database. ITRule outputs this information as "rules". ITRule can be used to: Discover the most informative correlations between the data variables or "attributes" and express them as probabilistic rules of the form: IF Stocks>75% AND Assets=Large THEN FundType=Growth with prob. 0.8 Produce a ranked list of rules in order of "informational strength". Identify the relative importance of data attributes. Perform automated knowledge acquisition for expert systems, and load the rules directly into a number of expert system shells. Thus allowing an expert system to be directly bootstrapped from data with minimal use of human experts. Produce executive summaries of the most important knowledge in the data, for human presentation and discussion. Implement an expert system onto a parallel high-speed inference architecture. Perform true probabilistic inference as a self-contained Expert System. When presented with the previously unseen input data vectors, ITRule predicts the values of any unknown data attributes via the rules. Inference is Bayesian, with the option of outputting either a simple classification decision, or much more powerfully, a probability estimate of each inferred output attribute-value. ITRule has been used to automate expert system rule generation and analyze data for a wide variety of applications including: Medical Diagnosis. Telecommunication Trouble Ticket Analysis and Fault Finding. Stock Market and Mutual Fund Analysis. DNA Sequence Analysis. Real Time Telephone Network Alarms Analysis. Mineral Classification using Satellite Synthetic Aperture Radar Returns. Antenna Fault Diagnosis on the JPL Deep Space Network. Sonar Returns Signature Analysis. Supermarket Product/Sales Analysis. Census and Questionnaire Data. Database Query Optimization. Reference: P. Smyth and R.M. Goodman, “An Information Theoretic Approach to Rule Induction from Databases,” IEEE Transactions on Knowledge and Data Engineering , Vol. 4, No. 4, pp. 301-316, August 1992. Using ITRule ITRule requires a flat database text file in a rows and columns format. The columns or fields of the database are the "attributes" or "variables" of the domain. Each row is an "example" or instance of the problem. The entries in the matrix are the "values" taken by the attributes. Attributes (data variables, field names) can be either categorical or continuous. Categorical (or discrete) attributes take a finite set of values, for example the attribute "Rank" may take the values "Private", "Corporal", and "Sergeant". A continuous attribute takes either integer values or real number values. For example, the attribute "Temperature" takes real number values such as 28.45 degrees. An attribute such as "number of shipments" takes integer number values such as 28 or 65. Continuous attributes need to be quantized (made into discrete attributes by splitting into ranges). This can be automatically performed by ITRule , but is is much better if "meaningful" ranges of the data can be identified given the context of the data. For example, within a particular context it may make sense to split a continuous variable called "Temperature" into three significant ranges, thus resulting in a discretized "Temperature" variable which takes the values "below_zero", "zero_to_boiling", and "above_boiling". ITRule optionally needs to know which attributes are "right-hand-side" (RHS) or "hypothesis" attributes, and which are "left-hand-side" (LHS) or "data only" attributes. RHS attributes will appear both in the conclusion part of rules and in the LHS of other rules. Data only attributes appear only in the LHS of rules. By default, ITRule uses all variables as RHS's. ITRule 's output can take several forms: printed rules, rules in a format suitable for a number of standard expert system shells, rules that can be loaded onto a number of neural network simulators, or rules for ITRule 's own internal probabilistic Rule Based Network inference mechanism. The rules can then be used to perform prediction on new examples. An Example of ITRule Analysis using a Mutual Funds Database As a simple example of using ITRule , we show how rules can be generated from a database of mutual funds information. Figure 1 (pdf) shows a portion of the raw database of mutual funds. Note there are several thousand funds in the whole database. Each row represents an example of a particular mutual fund (the name is omitted). The column headings are the attributes of interest when thinking of investing in a mutual fund. An example of a categorical attribute is "Fund Type" which takes three possible values: "Growth", "Growth&Income", and "Aggressive Growth". An example of a continuous attribute is the "Beta" or "risk" variable of the fund. Figure 2 (pdf) shows a derived database in which all the continuous variables have been quantized. Some of these quantizations are "obvious", and some require expert contextual knowledge of the database domain. For example, the "Beta" or "risk" attribute is naturally defined as above and below 1, where 1 is the market risk. On the other hand the Capital Gain attribute has been quantized to three ranges on expert advice. Figure 3 (pdf) shows the rules output by ITRule ranked in order of information content. The probability value is a measure of the "reliability" of the rule, for example, a completely deterministic rule (a certainty) has probability 1. The strength value is just the relative information of the rule relative to the most informative rule. The rules display information of interest to the investor, and can be used to predict the performance of new funds. For example, rule 15 states that IF Assets are Low THEN the 5 year Return of the fund will be below the Standard and Poor's Market average return with high probability (0.87). Figure 4(pdf) depicts the rules loaded into an expert system shell (NEXPERT®), showing how an initial rule base can be rapidly and automatically prototyped from the data. top More information on Rule Based Networks back to Information Processing