Tuesday, November 22, 2005

Smart Searching - Military Information Technology---DIA

Smart Searching - Military Information Technology
New technology is helping defense intelligence analysts sort through huge volumes of data.
By Cheryl Gerber

The unique requirements of defense intelligence analysts are refining search technology down from mass production, with its vast and sometimes trivial outcomes, to more guided, dynamic navigation able to produce results that are both inclusive and relevant.

As one of the largest collectors of information on the planet, the Defense Intelligence Agency (DIA) is responsible for amassing and analyzing all sources of human intelligence in the field from all information types in a multitude of languages.

?This forces us to deal with huge volumes of data. It?s an enormous challenge,? said a senior DIA official.

The task is indeed a massive one. Sources of intelligence in the field include feeds from UAVs, intelligence, surveillance and reconnaissance data from a vast array of sensors and overhead platforms, signal intelligence, satellites, film and video, not to mention all the data from the open source world. ?We need to manage all that data and make it available as quickly as possible to analysts,? the DIA official said.

In addition, the agency manages the global infrastructure for the Department of Defense Intelligence Information Systems (DoDIIS) and the combined information technology of multiple agencies on the Joint Worldwide Intelligence Communications System (JWICS), and it must coordinate the process of managing and sharing top-secret information across multiple agencies and commands. ?All data goes into a top-secret network and we share results at that level or the secret level, or onto a PDA at secret level or below,? the official said.

To conduct research and analysis effectively, DIA relies on a broad inventory of technology tools from such companies as Endeca Technologies, Basis Technology, Inxight Software, Insightful, Attensity, Convera, NetOwl and Clearforest.

?We use them all in ways that support semantic search,? the DIA official said.

Semantic search improves traditional Web searching by linking and defining information in a way that provides more effective discovery. It reaches beyond today?s single hyperlink connection into many different kinds of relations between resources. One way to accomplish this advanced search method is by using eXtensible Markup Language (XML), the standard for defining data elements on the Web.

Entity Extraction

The tools from the abovementioned companies provide technology that supports multilingual, integrated knowledge sharing through entity extraction (specific information retrieval), text mining and text analysis to achieve enhanced information discovery through the machine understanding of the semantic meaning of text. Entity extraction and semantic search produce discovery in the process rather than merely producing search results.

However, the technology cannot deliver discovery effectively unless it is highly scalable. In addition, the defense intelligence community tends to use extremely complex query strings that can run pages long to search through terabytes of data. Most search engines cannot handle such large queries under the load of terabytes.

To address the scalability issue, the DIA purchased an enterprise license of Convera Technologies? RetrievalWare server software, a highly scalable search and knowledge discovery platform for unstructured as well as structured data.

?Our software platform is designed to be extremely scalable. It allows users to discover information by organizing content based on multiple taxonomies or classifications,? said Jon Lieberman, Convera federal senior account manager.

RetrievalWare contains a component called categorization and dynamic classification. ?Our software allows users to drill down into vast data stores and classify information by, for example, region, chemical agents and known terrorist organizations,? said Lieberman.

Another such tool, Endeca Technologies? ProFind platform, provides next-generation search with guided navigation that integrates, discovers and analyzes data through bolt-on modules for intelligence analysis or knowledge management. The information retrieval system presents results and refinement options to help users find content most relevant to their unique needs.

?The onus is not on the intelligence analyst to ask the next question to determine a valid query refinement. Endeca dynamically shows you all the next refinements, all the next navigations you can take at any point in any data set,? explained Greg Fairbank, manager, Endeca Federal Industry Solutions.

?It?s more of a math problem than of exploration. We frame the search so you?re not overloaded with information. It?s the lopping off of all the dead ends at any point in a data set. And we discover new intelligence we?d never know to ask by using the system,? Fairbank said.

Contained within Endeca is Basis Technology?s Rosette Linguistics Platform server, used by DIA?s National Media Exploitation Center (NMEC) to process and extract information in the native script of multiple languages. The NMEC uses various modules in Rosette such as the language identifier, which can recognize 40 different languages, and the entity extraction module, which then extracts information about the document.

?Entities are key pieces of information, such as people, places, phone numbers and addresses,? said Carl Hoffman, chief executive officer of Basis Technology. Rosette has become an essential part of building a large scale multilingual search engine, he said, adding, ?Most Web and enterprise search engines today use Rosette at the core of their technology.?

Multiple Languages

Managing multiple languages is an essential part of the worldwide search process in the intelligence community. ?When you are dealing with Web search, language identification is the first step toward knowing whether to process the document, analyze it further or pass it to the appropriate analyst,? said Bill Ray, Basis vice president of sales.

?People in the field receive new intelligence every day in multiple languages, and their use of Rosette allows them to turn that information into actionable intelligence,? Hoffman said.

When U.S. forces are working with intelligence analysts overseas, they frequently find computers in buildings, which they must search for needed information. Using Rosette modules, they are able to search foreign file formats in hard drives, CD-ROMs and floppy disks to extract bank account numbers, cell phone numbers, e-mail addresses or nicknames in foreign languages.

?We were the first In-Q-Tel company whose software was deployed inside the National Geospatial-Intelligence Agency,? said Hoffman.

In-Q-Tel is the venture capital fund, largely funded by the intelligence community, that seeks to stimulate new technologies through private sector entrepreneurs. It has invested in Basis and licenses Rosette modules.

Transliteration can become a problem when each government agency has a different standard and there is not enough interagency agreement. Rosette modules allow interoperability between different government agency transliteration standards.

Competing transliteration standards may be resolved, however, by the Intelligence Community Metadata Working Group (ICMWG), which is charged with establishing standards for the tagging of all data used by DIA systems.

?We are strong proponents of that, since it will allow us to work interoperably,? said the DIA official. ?We are now going backwards and tagging legacy data on all legacy systems and everything we currently produce, so we don?t have this problem in the future.?

Endeca leverages the tagging of legacy data by making guided navigation out of metadata. The metadata that is created becomes a series of navigations. ?We figure out all the valid intersections in metadata and how to hop from one valid intersection to another,? said Fairbank.

DIA uses an in-house metadata extraction tagging system.

DIA also uses Insightful Corp.?s InFact search engine to help analysts search and extract meaningful intelligence from unstructured data sources. The search engine identifies the subject, verb and object of each sentence and organizes the data in a way that makes it easier to analyze data relationships across documents. This process allows analysts to uncover and understand obscure patterns and activities linking terrorist organizations, geographical locations, money transactions and many other entities.

?We?ve been collaborating with many DoD groups within the past few years and, as a result, we?ve been improving our product and tailoring it to the needs of our DoD customers,? said Giovanni Marchisio, Insightful?s vice president of engineering and research for text analysis and search. ?We worked in partnership with a select group of DIA super users and high-end statistics analysts who defined how the product should work,? he said.

?Let?s say you want to search all the news in the world for helicopter crashes that have occurred within the last three years. You want to find where they have occurred and what day of the week they occurred. InFact gives you back a table that shows what days of the week the helicopter crashes occurred. The system is intelligent enough to understand the linguistic attributes and syntactic roles in grammatical sentences,? Marchisio said.

Query Language

Many of the search engines familiar to consumers, such as Google, are based on Boolean logic to perform a search with complex, long queries. But the Boolean language is not as expressive as it could be in the kind of query one can pose. ?The InFact Query Language (IQL) can express in three words what would take 20 lines using Boolean language,? said Marchisio.

Although Insightful offers a four-hour course on using IQL, it is currently working to increase usability in order to eliminate the need for the course and to reach a broader audience among those who are not all super users or might not have the time to learn how to use the technology.

The product provides flexibility by allowing individual choice of Web browser. ?InFact is based on a client/server architecture. The server side hosts the search service and does the preprocessing and indexing of the documents. You can use any Web browser on any platform on the client side. It could be Windows, Linux, Unix Solaris or Apple,? said Marchisio.

DIA also uses Inxight Software?s overarching SmartDiscovery Analysis Server, which understands documents in multiple languages and, through its visualization tools, reveals relationships and trends. SmartDiscovery evolved from text analysis and visualization technology that came from Xerox?s Palo Alto Research Center. The company?s LinguistX platform is now used in Yahoo?s search offering, said Paul Battaglia, president of Inxight Federal Systems Group.

DIA purchased five SmartDiscovery modules: search, summarization, categorization and taxonomy management, entity extraction, and fact extraction. The agency also purchased SmartDiscovery ThingFinder and visualization software, called Inxight StarTree, in an enterprise license that covers 10 of the Unified Combatant Commands and the five new Regional Service Centers established by the recent reorganization under DIA.

ThingFinder is a text analysis application that automatically identifies and tags more than 30 entity types. The software can extract entities in 20 languages, which are then displayed visually through hyperbolic tree visualization software known as StarTree.

SmartDiscovery was architected for massively parallel processing, using XML for integration and Web services-based application programming interfaces (API). But speed on a large scale is a major technology challenge, and the company is hoping to discover solutions through an Air Force Research Lab award of $2 million to investigate ways to process multiple terabytes of information, said Battaglia.

Basis Technology also works with BrightPlanet, which has integrated Rosette into its Deep Query Manager (DQM), a management platform and deep navigational search engine used by knowledge workers to assist the intelligence community in conducting multilingual searches and harvests.

Several agencies, including the Office of the Secretary of Defense (OSD) use BrightPlanet?s DQM, which can export content in XML to other applications and integrate the software?s functionality with other programs using a BrightPlanet API. Version 5.0 of DQM has the ability to harvest and process documents from deep web, conventional surface web and internal file system sources, in up to 140 different foreign languages in more than 370 file formats.

?You don?t want to spook the people whose data you are extracting. They can look at their own traffic volumes and where it?s coming from. You?d much rather that they are not aware you are looking. So you only want to touch those sites once. In order to do that, you need a sophisticated harvesting technique, not a brute force approach,? said Duncan Witte, president and chief operating officer of BrightPlanet.

Complexity and scalability are the most challenging aspects when exploiting search technology in the defense intelligence community, Witte said. While BrightPlanet got started with search and harvest technology, he said, users also ?have to be able to organize, manage and distribute the huge volume of information as well. You need various specialties that allow collaboration with teammates and effective distribution of information.?

DIA maintains a steady push toward technology improvement. ?We try to do the best we can with the volumes. In-house we have a lot of expertise on search algorithms and text analysis. But we need to do a better job of combing through the massive volumes of information to find that which is interesting and nontrivial in a way that leads to knowledge discovery. We need better information retrieval through machine understanding of the semantic meaning of text, regardless of language,? the DIA official said.

No comments: