AIS Overview

 

The Internet

"We are drowning in information but starved of knowledge"
John Naisbitt of Megatrends


Big changes are taking place in the area of information supply and demand. The first big change, which took place quite a while ago, is related to the form information is available in. In the past, paper was the most frequently used media for information, and it still is very popular right now. However, more and more information is available through electronic media.

Other aspects of information that have changed rapidly in the last few years are the amount that it is available in, the number of sources and the ease with which it can be obtained. Expectations are that these developments will carry on into the future.

A third important change is related to the supply and demand of information. Until recently the market for information was driven by supply, and it was fuelled by a relatively small group of suppliers that were easily identifiable. At this moment this situation is changing into a market of a very large scale where it is becoming increasingly difficult to get a clear picture of all the suppliers.

All these changes have an enormous impact on the information market. One of the most important changes is the shift from it being supply-driven to it becoming demand-driven. The number of suppliers has become so high (and this number will get even higher in the future) that the question who is supplying the information has become less important: demand for information is becoming the most important aspect of the information chain.

What's more, information is playing an increasingly important role in our lives, as we are moving towards an information society. [1]  Information has become an instrument, a tool that can be used to solve many problems.

Meeting information demand has become easier on one hand, but has also become more complicated and difficult on the other. Because of the emergence of information sources such as the world-wide computer network called the Internet (not to mention the increasing volume of corporate intranets) everyone - in principle - can have access to a sheer inexhaustible pool of information. Typically, one would expect that because of this satisfying information demand has become easier.

The sheer endlessness of the information available through the Internet, which at first glance looks like its major strength, is at the same time one of its major weaknesses. The amounts of information that are at your disposal are too vast: information that is being sought is (probably) available somewhere, but often only parts of it can be retrieved, or sometimes nothing can be found at all. To put it more figuratively: the number of needles that can be found has increased, but so has the size of the haystack they are hidden in. The inquirers for information are being confronted with an information overkill.

The current, conventional search methods do not seem to be able to tackle these problems. These methods are based on the principle that it is known which information is available (and which one is not) and where exactly it can be found. To make this possible, large information systems such as databases are supplied with (large) indexes to provide the user with this information. With the aid of such an index one can, at all times, look up whether certain information can or cannot be found in the database, and - if available - where it can be found.

On the Internet (but not just there [2]) this strategy fails completely, the reasons for this being:

The dynamic nature of the Internet itself: there is no central supervision on the growth and development of Internet. Anybody who wants to use it and/or offer information or services on it, is free to do so. This has created a situation where it has become very hard to get a clear picture of the size of the Internet, let alone to make an estimation of the amount of information that is available on or through it;

The dynamic nature of the information on Internet: information that cannot be found today, may become available tomorrow. And the reverse happens too: information that was available, may suddenly disappear without further notice, for instance because an Internet service has stopped its activities, or because information has been moved to a different, unknown location;

The information and information services on the Internet are very heterogeneous: information on the Internet is being offered in many different kinds of formats and in many different ways. This makes it very difficult to search for information automatically, because every information format and every type of information service

 

[1] "Information society" or "Information Age" are both terms that are very often used nowadays. The terms are used to denote the period following the "Post-Industrial Age" we are living in right now.
[2] Articles in professional magazines indicate that these problems are not appearing on the Internet only: large companies that own databases with gigabytes of corporate information stored in them (so-called data warehouses), are faced with similar problems. Many managers cannot be sure anymore which information is, and which is not stored in these databases. Combining the stored data to extract valuable information from it (for instance, by discovering interesting patterns in it) is becoming a task that can no longer be carried out by humans alone.

Corporate Intranets

In the initial phase of the Information Revolution [1] (from 1950 to 1970),  corporate computer penetration was minimal; however, in the later phase of the Information Revolution (from 1990 to the present) corporate computer penetration is at a maximum. In addition to the shear number of computers present in modern corporations, there is an increasing body of historical electronic data available (the so-called Legacy Data). This legacy data is increasingly available, on demand, from corporate internets, corporate online data warehouses, and corporate portals. Corporate Portals unlock essential information from both structured data in relational databases and legacy systems and unstructured data in all documents and graphic files. Corporate Portals provide access to the cumulative knowledge resources of an organization through a single corporate gateway.

A problem arises caused by the vast quantities of electronic data being made available, its disparate origins, heterogeneous nature, and different formats. It becomes difficult to integrate radically different data types into any meaningful framework.

Legacy Data comes in a variety of sizes, shapes and descriptions. It may be the output of programs written in-house over the past 30 years, the output of commercial software installed over the past 30 years, or even the output of programs written or purchased within the past few months.

Within the enterprise, it may have many personalities, especially if the corporate culture permits departments to build or buy their own application solutions. Some departments may have grown from departmental systems through mid-range systems like the IBM office systems or network-based printing systems. Others may have selected office systems from companies like Xerox, Datapoint, Data General, and Tandem, while others remained on paper until the advent of the networked PC. And, there are those who are at home on the big iron - developing all of their support applications on the IBM or IBM plug-compatible hosts.
Regardless of the pedigree, legacy data poses challenges. It is the result of the business of getting enterprise information, such as invoices, statements, specifications, policies, reports, and just about every other type of business document, through the application and print process. It may be simple line data, consisting of little more than the actual text to be printed and some page eject commands, or it may have evolved to a more sophisticated form of line data that includes font calls, inserted graphics, calls to forms that overlay the data, and re-organization of the incoming data.

The increasing need for managing heterogeneous data has motivated the international standards community to develop the specification for XML. In just a few short years, XML has become the lingua franca of data exchange. Application designers create elements, formats, and rules that describe the data in their applications so that non-native applications can use them. Programmers work from Document Type Definitions (DTDs) to understand these elements and to create XML files that can be consumed by external applications. They can also create stylesheets (XSL files) that contain the rules for transforming XML data into another format.

Building Corporate Portals with XML provides a foundation for the exchange of heterogeneous legacy data. A company wishing to use its legacy data for e-business might take the following approach. Provide access to all relational legacy data via a standard access method and/or service. Convert all non-relational legacy data (text and graphics) to a normalized XML database. This database may also house newly created XML content. The conversion process requires more than a superficial re-formatting; it requires the XML to capture both the deep meaning of the content plus the layout and style.

Legacy systems are valuable: they hold most of the world?s data, including corporate data, yellow pages, bank account information and virtually all legal documents. In recognition of this inherent value, an increasing volume of legacy data is being made available electronically.

 

[1] "Information society" or "Information Age" are both terms that are very often used nowadays. The terms are used to denote the period following the "Post-Industrial Age" we are living in right now.

Search Engines and Agents

The Analysis Opportunity

The presence of increasing volumes of electronic corporate, scientific, and government data online, creates an irresistible opportunity to analyze the available data looking for information vital to the business of the enterprise. Given that the data is available, what CEO can resist the urge to know more about the relationship between regional sales patterns and regional demographics? What mortgage banker can resist the urge to better identify fraudulent patterns among current credit applications? What head of state can resist the urge to look for terrorist indications among the vast patterns of Internet communications traffic? As the volume of data grows, the opportunities to profit through data analysis increase; and, the urge to analyze becomes irresistible. Here follow just a small sampling of current applications for large volume data analysis.

There are far too many techniques and flavors of large volume data analysis to mention each individually. The best one can do is to provide a broad categorization of current large volume data analysis techniques. Some of the larger areas of data analysis are as follows.

Data Mining

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation and verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).

On-Line Analytic Processing (OLAP)

The term On-Line Analytic Processing - OLAP (or Fast Analysis of Shared Multidimensional Information - FASMI) refers to technology that allows users of multidimensional databases to generate on-line descriptive or comparative summaries ("views") of data and other analytic queries. Note that despite its name, analyses referred to as OLAP do not need to be performed truly "on-line" (or in real-time); the term applies to analyses of multidimensional databases (that may, obviously, contain dynamically updated information) through efficient "multidimensional" queries that reference various types of data. OLAP facilities can be integrated into corporate (enterprise-wide) database systems and they allow analysts and managers to monitor the performance of the business (e.g., such as various aspects of the manufacturing process or numbers and types of completed transactions at different locations) or the market. The final result of OLAP techniques can be very simple (e.g., frequency tables, descriptive statistics, simple cross-tabulations) or more complex (e.g., they may involve seasonal adjustments, removal of outliers, and other forms of cleaning the data). Although Data Mining techniques can operate on any kind of unprocessed or even unstructured information, they can also be applied to the data views and summaries generated by OLAP to provide more in-depth and often more multidimensional knowledge. In this sense, Data Mining techniques could be considered to represent either a different analytic approach (serving different purposes than OLAP) or as an analytic extension of OLAP.

Exploratory Data Analysis (EDA)

EDA vs. Hypothesis Testing

As opposed to traditional hypothesis testing designed to verify a priori hypotheses about relations between variables (e.g., "There is a positive correlation between the AGE of a person and his/her RISK TAKING disposition"), exploratory data analysis (EDA) is used to identify systematic relations between variables when there are no (or not complete) a priori expectations as to the nature of those relations. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns.

Computational EDA techniques

Computational exploratory data analysis methods include both simple basic statistics and more advanced, designated multivariate exploratory techniques designed to identify patterns in multivariate data sets.

Basic statistical exploratory methods. The basic statistical exploratory methods include such techniques as examining distributions of variables (e.g., to identify highly skewed or non-normal, such as bi-modal patterns), reviewing large correlation matrices for coefficients that meet certain thresholds (see example above), or examining multi-way frequency tables (e.g., "slice by slice" systematically reviewing combinations of levels of control variables).

Multivariate exploratory techniques. Multivariate exploratory techniques designed specifically to identify patterns in multivariate (or univariate, such as sequences of measurements) data sets include: Cluster Analysis, Factor Analysis, Discriminant Function Analysis, Multidimensional Scaling, Log-linear Analysis, Canonical Correlation, Stepwise Linear and Nonlinear (e.g., Logit) Regression, Correspondence Analysis, Time Series Analysis, and Classification Trees

Neural Networks. Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data.

Genetic Programming. Genetic and Evolutionary Programming are analytic techniques modeled after the (hypothesized) evolutionary processes seen in the breeding of plants and animals using survival of the fittest criteria for continued breeding. These techniques are achieving an increasingly impressive list of successes. In essence, Genetic Programming reduces Software Agents (or other types of computer programs) to genetic codes (from which the original program can be regenerated). All Software Agents in the pool are given a common task with a commonly measured objective. The genetic codes from the best performers are mixed together, using techniques modeled after those found in nature, to produce new genetic children. These child Software Agents are then introduced to the pool, and the process is repeated. After thousands of breeding generations, new Software Agents emerge which outperform all others in the pool.

Other Analysis Techniques

Obviously there are far too many to mention. Some of the more important, but by no means exhaustive list, are: natural language processing, graph traversal algorithms, pattern matching in picture data, and pattern matching in music data. There is no dominant technique for large volume data analysis. The only preparation guaranteed to support unrestricted data analysis is to have a Rapid Application Development (RAD) environment for the creation of large volume data analysis Software Agents (or other computer programs).

Sharing Software Agents

The next wave of technological innovation must integrate linked organizations and multiple application platforms. Developers must construct unified information management systems that use the world wide web and advanced software technologies. Software agents, one of the most exciting new developments in computer software technology, can be used to quickly and easily build integrated enterprise systems. The idea of having a software agent that can perform complex tasks on our behalf is intuitively appealing. The natural next step is to use multiple software agents that communicate and cooperate with each other to solve complex problems and implement complex systems. Software agents provide a powerful new method for implementing these next-generation information systems.

As the number and complexity of data analysis techniques increases, the original problem, of sharing vast quantities of heterogeneous data, is now accompanied by the additional problem of sharing large numbers of analytic software agents and of deciding which agent to use. At first, while there are only a few dozen C or Java software agents (all hand written by human experts), the problem of sharing and choosing between analytic agents is accomplished by cooperation between human experts. However, as the number and complexity of analytic software agents increases, this becomes very difficult.

Nowhere is this more apparent than with modern Genetic Programming technology which can produce millions of complex analytic software agents. None of these software agents have a human author. They are all authored by machines. There are no human experts to guide in their deployment, to explain how they operate, or to describe the analytic techniques which they employ. Furthermore, just keeping track of the experimental results, of millions of analytic software agents, can be a real housekeeping chore. It becomes rapidly obvious that a database capable of storing and serving analytic software agents is necessary.

Agent Information Server (AIS) is a database system designed to store and serve millions of analytic software agents across the Internet, intranets, WANs, and LANs. Agent Information Server is also capable of storing and serving vast quantities of XML data upon which these analytic software agents perform their complicated analyses.

AIS is a powerful database environment combining Agent technology and Internet server technology. It comes with a rich set of re-usable Agent Libraries, a fast Lisp compiler, a built-in Agent JavaScript compiler, an interactive agent debugger, and persistent agent repository with a flexible database schema. AIS is designed to store and serve software agents which specialize in large volume data analysis, where microchip-level execution speed is a priority, and multiple gigabyte repositories are commonplace.

Any software agent served from AIS can be deployed directly over TCP/IP connections to any Internet browser on the World Wide Web, and can be deployed from any software product (MSWord, Excel, Project, etc) or user client application that can open a TCP/IP socket connection (which is to say almost all existing legacy applications).

What is an Agent?

Software Agents and Intelligent Software Agents are popular research objects these days in such fields as psychology, sociology and computer science. Agents are most intensely studied in the discipline of Artificial Intelligence (AI). Strangely enough, it seems like the question what exactly an agent is, has only very recently been addressed seriously. Due to the irrational exuberance of product marketing executives, there is considerable confusion as to what a software agent really is. What is marketing hyperbole and what is real technology?

"It is in our best interests, as pioneers of this technology, to stratify the technology in such a way that it is readily marketable to consumers. If we utterly confuse consumers about what agent technology is (as is the case today) then we'll have a hard time fully developing the market potential."
J. Williams on the Software Agents Mailing List

Because of the fact that currently the term "agent" is used by many parties in many different ways, it has become difficult for users to make a good estimation of what the possibilities of the agent technology are. At this moment, there is every appearance that there are more definitions than there are working examples of systems that could be called agent-based.

Agent producers that make unjust use of the term agent to designate their product, cause users to draw the conclusion that agent technology as a whole has not much to offer. That is - obviously - a worrying development:

"In order to survive for the agent, there must be something that really distinguishes agents from other programs, otherwise agents will fail. Researchers, the public and companies will no longer accept things that are called agent and the market for agents will be very small or even not exist."
Wijnand van de Calseyde on the Software Agents Mailing List

On the other hand, the description of agent capabilities should not be too rose-colored either.

Not everybody is that thrilled about agents. Especially from the field of computer science, a point of criticism often heard about agents is that they are not a new technique really, and that anything that can be done with agents "can just as well be done in C". According to these critics, agents are nothing but the latest hype.
The main points of criticism can be summarized as follows:

Particularly by researchers in the field of AI, these points of criticism are answered with the following counter arguments:

The 'pros' and 'cons' with regards to agents as they are mentioned here, are by no means complete, and should be seen as merely an illustration of the general discussions about agents. What it does show is why it is necessary (in several respects) to have a definition of the concept "intelligent software agent" that is as clear and as precise as possible. It also shows that there is probably a long way to go before we arrive at such a definition - if we can come to such a definition at all.

 

[1] Unfortunately that question opens up the old AI can-of-worms about definitions of intelligence. E.g., does an intelligent entity necessarily have to possess emotions, self-awareness, etcetera, or is it sufficient that it performs tasks for which we currently do not possess algorithmic solutions?
[2] The 'opposite' can be said as well: in many cases the individual agents of a system aren't that intelligent at all, but the combination and co-operation of them leads to the intelligence and smartness of an agent system.
[3] These researchers see a paradigm shift from those who build intelligent systems and consequently grapple with problems of knowledge representation and acquisition, to those who build distributed, not particularly, intelligent systems, and hope that intelligence will emerge in some sort of Gestalt fashion. The knowledge acquisition problem gets solved by being declared to be a 'non-problem'.

What is an Intelligent Agent?

"An agent is a software thing that knows how to do things that you could probably do yourself if you had the time."
Ted Selker of the IBM Almaden Research Centre (quote taken from [JANC95])

We will not come to a rock-solid formal definition of the concept "intelligent agent". Given the multiplicity of roles agents can play, this is quite impossible and even very impractical. On the Software Agents Mailing List, however, a possible informal definition of an intelligent software agent was given:

 
"A piece of software which performs a given task using information gleaned from its environment to act in a suitable manner so as to complete the task successfully. The software should be able to adapt itself based on changes occurring in its environment, so that a change in circumstances will still yield the intended result."
(with thanks to G.W. Lecky-Thompson for this definition)

Instead of the formal definition, a list of general characteristics of agents will be given. Together these characteristics give a global impression of what an intelligent agent "is".  [1]

Weak Concept of Intelligent Agent

The first group of characteristics, which will be presented here, are connected to the weak notion of the concept "intelligent agent". The fact that an agent should possess most, if not all of these characteristics, is something that most scientists have agreed upon at this moment.

Thus, a simple way of conceptualizing an intelligent agent is as a kind of UNIX-like software process, that exhibits the properties listed above. A clear example of an agent that meets the weak notion of an intelligent agent is the so-called softbot (`software robot'). This is an intelligent agent that is active in a software environment (for instance the previously mentioned UNIX operating system).

Strong Concept of Intelligent Agent

The second group of characteristics, which are connected to the strong notion of the concept "intelligent agent", are not characteristics that go without saying for everybody.

For some researchers - particularly those working in the field of AI - the term intelligent agent has a stronger and more specific meaning than that sketched out in the weak notion. These researchers generally mean an intelligent agent to be a computer system that, in addition to having the properties as they were previously identified, is either conceptualized or implemented using concepts that are more usually applied to humans. For example, it is quite common in AI to characterize an agent using mentalistic notions, such as knowledge, belief, intention, and obligation [6]. Some AI researchers have gone further, and considered emotional agents[7]
Another way of giving agents human-like attributes is to represent them visually by using techniques such as a cartoon-like graphical icon or an animated face  [8]. Research into this matter  [9] has shown that, although agents are pieces of software code, people like to deal with them as if they were dealing with other people (regardless of the type of agent interface that is being used).

Agents that fit the stronger notion of intelligent agent usually have one or more of the following characteristics: [10]

Although no single agent possesses all these abilities, there are several prototype agents, in the AI research community, that posses quite a few of them. At this moment no consensus has yet been reached about the relative importance (weight) of each of these characteristics in the agent as a whole. What most scientists have come to a consensus about, is that it are these kinds of characteristics that distinguish intelligent agents from ordinary programs.

[1] See [WOOL95] for a more eleborated overview of the theoretical and practical aspects of agents.
[2] See: Casterfranchi, C (1995). Guarantees for autonomy in cognitive agent architecture. In Woolridge, M. and Jennings, N. R., ed., Intelligent Agents: Theories, Architectures, and Languages (LNAI Volume 890), page 56-70. Springer-Verlag: Heidelberg, Germany.
[3] See: Genesereth, M. R. and Ketchpel, S. P. (1994). Software Agents. Communications of the ACM, 37(7): page 48-53.
[4] Note that the kind of reactivity that is displayed by agents, is beyond that of so-called (UNIX) daemons. Daemons are system processes that continuously monitor system resources and activities, and become active once certain conditions (e.g. thresholds) are met. As opposed to agents, daemons react in a very straight-forward way, and they do not get better in reacting to certain conditions.
[5] Analogous to the "sleep" state in a UNIX system, where a process that has no further tasks to be done, or has to wait for another process to finish, goes into a sleep state until another process wakes it up again.
[6] See: Shoham, Y. Agent-oriented programming. Artificial Intelligence, 60(1): page 51-92, 1993.
[7] See, for instance, Bates, J. The role of emotion in believable agents. Communications of the ACM, 37(7): page 122-125, 1994.
[8] See: Maes, P. Agents that reduce work and information overload. Communications of the ACM, 37(7): page 31-40, 1994.
[9] See, for instance, Norman, D. How Might People Interact with Agents. In Communications of the ACM, 1994 issue, Juli 1994.
[10] This list is far from complete. There are many other characteristics of agents that could have been added to this list. The characteristics that are mentioned here are there for illustrative purposes and should not be interpreted as an ultimate enumeration.
[11] See: White, J. E. Telescript technology: The foundation for the electronic marketplace. White paper, General Magic Inc., 1994.
[12] See: Rosenschein, J. S. and Genesereth, M. R. Deals among rational agents. In Proceedings of the Ninth International Joint Conference on Artificial Intelligence (IJCAI-85), page 91-99, Los Angeles, United States, 1994.
[13] See: Galliers, J. R. A Theoretical Framework for Computer Models of Cooperative Dialogue, Acknowledging Multi-Agent Conflict. PhD thesis, page 49-54, Open University, Great Britain, 1994.
[14] See: Eichmann, D. Ethical Web Agents. Proceedings of the Second International World-Wide Web Conference. Chicago, United States, October 1994.