AIS Overview

 

The Internet

"We are drowning in information but starved of knowledge"
John Naisbitt of Megatrends


Big changes are taking place in the area of information supply and demand. The first big change, which took place quite a while ago, is related to the form information is available in. In the past, paper was the most frequently used media for information, and it still is very popular right now. However, more and more information is available through electronic media.

Other aspects of information that have changed rapidly in the last few years are the amount that it is available in, the number of sources and the ease with which it can be obtained. Expectations are that these developments will carry on into the future.

A third important change is related to the supply and demand of information. Until recently the market for information was driven by supply, and it was fuelled by a relatively small group of suppliers that were easily identifiable. At this moment this situation is changing into a market of a very large scale where it is becoming increasingly difficult to get a clear picture of all the suppliers.

All these changes have an enormous impact on the information market. One of the most important changes is the shift from it being supply-driven to it becoming demand-driven. The number of suppliers has become so high (and this number will get even higher in the future) that the question who is supplying the information has become less important: demand for information is becoming the most important aspect of the information chain.

What's more, information is playing an increasingly important role in our lives, as we are moving towards an information society. [1]  Information has become an instrument, a tool that can be used to solve many problems.

Meeting information demand has become easier on one hand, but has also become more complicated and difficult on the other. Because of the emergence of information sources such as the world-wide computer network called the Internet (not to mention the increasing volume of corporate intranets) everyone - in principle - can have access to a sheer inexhaustible pool of information. Typically, one would expect that because of this satisfying information demand has become easier.

The sheer endlessness of the information available through the Internet, which at first glance looks like its major strength, is at the same time one of its major weaknesses. The amounts of information that are at your disposal are too vast: information that is being sought is (probably) available somewhere, but often only parts of it can be retrieved, or sometimes nothing can be found at all. To put it more figuratively: the number of needles that can be found has increased, but so has the size of the haystack they are hidden in. The inquirers for information are being confronted with an information overkill.

The current, conventional search methods do not seem to be able to tackle these problems. These methods are based on the principle that it is known which information is available (and which one is not) and where exactly it can be found. To make this possible, large information systems such as databases are supplied with (large) indexes to provide the user with this information. With the aid of such an index one can, at all times, look up whether certain information can or cannot be found in the database, and - if available - where it can be found.

On the Internet (but not just there [2]) this strategy fails completely, the reasons for this being:

The dynamic nature of the Internet itself: there is no central supervision on the growth and development of Internet. Anybody who wants to use it and/or offer information or services on it, is free to do so. This has created a situation where it has become very hard to get a clear picture of the size of the Internet, let alone to make an estimation of the amount of information that is available on or through it;

The dynamic nature of the information on Internet: information that cannot be found today, may become available tomorrow. And the reverse happens too: information that was available, may suddenly disappear without further notice, for instance because an Internet service has stopped its activities, or because information has been moved to a different, unknown location;

The information and information services on the Internet are very heterogeneous: information on the Internet is being offered in many different kinds of formats and in many different ways. This makes it very difficult to search for information automatically, because every information format and every type of information service

 

[1] "Information society" or "Information Age" are both terms that are very often used nowadays. The terms are used to denote the period following the "Post-Industrial Age" we are living in right now.
[2] Articles in professional magazines indicate that these problems are not appearing on the Internet only: large companies that own databases with gigabytes of corporate information stored in them (so-called data warehouses), are faced with similar problems. Many managers cannot be sure anymore which information is, and which is not stored in these databases. Combining the stored data to extract valuable information from it (for instance, by discovering interesting patterns in it) is becoming a task that can no longer be carried out by humans alone.

Corporate Intranets

In the initial phase of the Information Revolution [1] (from 1950 to 1970),  corporate computer penetration was minimal; however, in the later phase of the Information Revolution (from 1990 to the present) corporate computer penetration is at a maximum. In addition to the shear number of computers present in modern corporations, there is an increasing body of historical electronic data available (the so-called Legacy Data). This legacy data is increasingly available, on demand, from corporate internets, corporate online data warehouses, and corporate portals. Corporate Portals unlock essential information from both structured data in relational databases and legacy systems and unstructured data in all documents and graphic files. Corporate Portals provide access to the cumulative knowledge resources of an organization through a single corporate gateway.

A problem arises caused by the vast quantities of electronic data being made available, its disparate origins, heterogeneous nature, and different formats. It becomes difficult to integrate radically different data types into any meaningful framework.

Legacy Data comes in a variety of sizes, shapes and descriptions. It may be the output of programs written in-house over the past 30 years, the output of commercial software installed over the past 30 years, or even the output of programs written or purchased within the past few months.

Within the enterprise, it may have many personalities, especially if the corporate culture permits departments to build or buy their own application solutions. Some departments may have grown from departmental systems through mid-range systems like the IBM office systems or network-based printing systems. Others may have selected office systems from companies like Xerox, Datapoint, Data General, and Tandem, while others remained on paper until the advent of the networked PC. And, there are those who are at home on the big iron - developing all of their support applications on the IBM or IBM plug-compatible hosts.
Regardless of the pedigree, legacy data poses challenges. It is the result of the business of getting enterprise information, such as invoices, statements, specifications, policies, reports, and just about every other type of business document, through the application and print process. It may be simple line data, consisting of little more than the actual text to be printed and some page eject commands, or it may have evolved to a more sophisticated form of line data that includes font calls, inserted graphics, calls to forms that overlay the data, and re-organization of the incoming data.

The increasing need for managing heterogeneous data has motivated the international standards community to develop the specification for XML. In just a few short years, XML has become the lingua franca of data exchange. Application designers create elements, formats, and rules that describe the data in their applications so that non-native applications can use them. Programmers work from Document Type Definitions (DTDs) to understand these elements and to create XML files that can be consumed by external applications. They can also create stylesheets (XSL files) that contain the rules for transforming XML data into another format.

Building Corporate Portals with XML provides a foundation for the exchange of heterogeneous legacy data. A company wishing to use its legacy data for e-business might take the following approach. Provide access to all relational legacy data via a standard access method and/or service. Convert all non-relational legacy data (text and graphics) to a normalized XML database. This database may also house newly created XML content. The conversion process requires more than a superficial re-formatting; it requires the XML to capture both the deep meaning of the content plus the layout and style.

Legacy systems are valuable: they hold most of the world?s data, including corporate data, yellow pages, bank account information and virtually all legal documents. In recognition of this inherent value, an increasing volume of legacy data is being made available electronically.

 

[1] "Information society" or "Information Age" are both terms that are very often used nowadays. The terms are used to denote the period following the "Post-Industrial Age" we are living in right now.

The Analysis Opportunity

The presence of increasing volumes of electronic corporate, scientific, and government data online, creates an irresistible opportunity to analyze the available data looking for information vital to the business of the enterprise. Given that the data is available, what CEO can resist the urge to know more about the relationship between regional sales patterns and regional demographics? What mortgage banker can resist the urge to better identify fraudulent patterns among current credit applications? What head of state can resist the urge to look for terrorist indications among the vast patterns of Internet communications traffic? As the volume of data grows, the opportunities to profit through data analysis increase; and, the urge to analyze becomes irresistible. Here follow just a small sampling of current applications for large volume data analysis.

There are far too many techniques and flavors of large volume data analysis to mention each individually. The best one can do is to provide a broad categorization of current large volume data analysis techniques. Some of the larger areas of data analysis are as follows.

Data Mining

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation and verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).

On-Line Analytic Processing (OLAP)

The term On-Line Analytic Processing - OLAP (or Fast Analysis of Shared Multidimensional Information - FASMI) refers to technology that allows users of multidimensional databases to generate on-line descriptive or comparative summaries ("views") of data and other analytic queries. Note that despite its name, analyses referred to as OLAP do not need to be performed truly "on-line" (or in real-time); the term applies to analyses of multidimensional databases (that may, obviously, contain dynamically updated information) through efficient "multidimensional" queries that reference various types of data. OLAP facilities can be integrated into corporate (enterprise-wide) database systems and they allow analysts and managers to monitor the performance of the business (e.g., such as various aspects of the manufacturing process or numbers and types of completed transactions at different locations) or the market. The final result of OLAP techniques can be very simple (e.g., frequency tables, descriptive statistics, simple cross-tabulations) or more complex (e.g., they may involve seasonal adjustments, removal of outliers, and other forms of cleaning the data). Although Data Mining techniques can operate on any kind of unprocessed or even unstructured information, they can also be applied to the data views and summaries generated by OLAP to provide more in-depth and often more multidimensional knowledge. In this sense, Data Mining techniques could be considered to represent either a different analytic approach (serving different purposes than OLAP) or as an analytic extension of OLAP.

Exploratory Data Analysis (EDA)

EDA vs. Hypothesis Testing

As opposed to traditional hypothesis testing designed to verify a priori hypotheses about relations between variables (e.g., "There is a positive correlation between the AGE of a person and his/her RISK TAKING disposition"), exploratory data analysis (EDA) is used to identify systematic relations between variables when there are no (or not complete) a priori expectations as to the nature of those relations. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns.

Computational EDA techniques

Computational exploratory data analysis methods include both simple basic statistics and more advanced, designated multivariate exploratory techniques designed to identify patterns in multivariate data sets.

Basic statistical exploratory methods. The basic statistical exploratory methods include such techniques as examining distributions of variables (e.g., to identify highly skewed or non-normal, such as bi-modal patterns), reviewing large correlation matrices for coefficients that meet certain thresholds (see example above), or examining multi-way frequency tables (e.g., "slice by slice" systematically reviewing combinations of levels of control variables).

Multivariate exploratory techniques. Multivariate exploratory techniques designed specifically to identify patterns in multivariate (or univariate, such as sequences of measurements) data sets include: Cluster Analysis, Factor Analysis, Discriminant Function Analysis, Multidimensional Scaling, Log-linear Analysis, Canonical Correlation, Stepwise Linear and Nonlinear (e.g., Logit) Regression, Correspondence Analysis, Time Series Analysis, and Classification Trees

Neural Networks. Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data.

Genetic Programming. Genetic and Evolutionary Programming are analytic techniques modeled after the (hypothesized) evolutionary processes seen in the breeding of plants and animals using survival of the fittest criteria for continued breeding. These techniques are achieving an increasingly impressive list of successes. In essence, Genetic Programming reduces Software Lambdas (or other types of computer programs) to genetic codes (from which the original program can be regenerated). All Software Lambdas in the pool are given a common task with a commonly measured objective. The genetic codes from the best performers are mixed together, using techniques modeled after those found in nature, to produce new genetic children. These child Software Lambdas are then introduced to the pool, and the process is repeated. After thousands of breeding generations, new Software Lambdas emerge which outperform all others in the pool.

Other Analysis Techniques

Obviously there are far too many to mention. Some of the more important, but by no means exhaustive list, are: natural language processing, graph traversal algorithms, pattern matching in picture data, and pattern matching in music data. There is no dominant technique for large volume data analysis. The only preparation guaranteed to support unrestricted data analysis is to have a Rapid Application Development (RAD) environment for the creation of large volume data analysis Software Lambdas (or other computer programs).

Software Lambdas

Software Lambdas are executable objects which act as small building blocks for creating ever larger programs with increasing sophistication. Lambdas differ from old style functions in several important ways:

These differences, from old style functions, allow the rapid development of larger and larger programs with ever increasing functionality.

The next wave of technological innovation must integrate linked organizations and multiple application platforms. Developers must construct unified information management systems that use the world wide web and advanced software technologies. Software Lambdas, one of the most exciting new developments in computer software technology, can be used to quickly and easily build integrated enterprise systems. The idea of having a software Lambda that can perform complex tasks on our behalf is intuitively appealing. The natural next step is to use multiple software Lambdas that communicate and cooperate with each other to solve complex problems and implement complex systems. Software Lambdas provide a powerful new method for implementing these next-generation information systems.

Nowhere is this more apparent than with modern Genetic Programming technology which can produce millions of complex analytic software Lambdas. None of these software Lambdas have a human author. They are all authored by machines. There are no human experts to guide in their deployment, to explain how they operate, or to describe the analytic techniques which they employ. Furthermore, just keeping track of the experimental results, of millions of analytic software Lambdas, can be a real housekeeping chore. It becomes rapidly obvious that a database capable of storing and serving analytic software Lambdas is necessary.

Analytic Information Server (AIS) is a database system designed to store and serve millions of analytic software Lambdas across the Internet, intranets, WANs, and LANs. Analytic Information Server is also capable of storing and serving vast quantities of XML data upon which these analytic software Lambdas perform their complicated analyses.

AIS is a powerful database environment combining Lambda technology and Internet server technology. It comes with a rich set of re-usable Lambda Libraries, a fast Lisp compiler, a built-in Lambda JavaScript compiler, an interactive Lambda debugger, and persistent Lambda repository with a flexible database schema. AIS is designed to store and serve software Lambdas which specialize in large volume data analysis, where microchip-level execution speed is a priority, and multiple gigabyte repositories are commonplace.

Any software Lambda served from AIS can be deployed directly over TCP/IP connections to any Internet browser on the World Wide Web, and can be deployed from any software product (MSWord, Excel, Project, etc) or user client application that can open a TCP/IP socket connection (which is to say almost all existing legacy applications).