Special Event: Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA)-wide Research Retreat

March 7-8, 2010
Morgan State University, Baltimore, MD

Organizers:: Ed Hovy, CCICADA/USC, hovy at isi.edu; Jack Jarmon, CCICADA/DIMACS, jjarmon at dimacs.rutgers.edu; Asamoah Nkwanta, Morgan State University, asamoah.nkwanta at morgan.edu; Bill Pottenger, CCICADA/Rutgers, billp at dimacs.rutgers.edu; Fred Roberts, CCICADA/DIMACS, froberts at dimacs.rutgers.edu; Dan Roth, CCICADA/University of Illinois at Urbana-Champaign, danr at uiuc.edu; Guoping Zhang, Morgan State University, guoping.zhang at morgan.edu

Presented under the auspices of the The Homeland Security Center for Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA).

Abstracts:

Earl R. Barnes, Morgan State

Talk Title: A Graph Partitioning Problem for Disease Control

We consider a graph in which nodes represent individuals and edges correspond to pairs of individuals that are in frequent contact with each other. Assume that a certain number of individuals become infected with a communicable disease. Our problem is to find the least number of edges that must be cut to isolate the infected individuals from a certain percentage of the population. This is a graph partitioning problem with constraints. We obtain bounds on the number of edges that must be cut to isolate the infected individuals from a certain percentage of the population. The bounds depend on the eigenvalues of the adjacency matrix of the graph.

Smriti Bhagat, Rutgers

Talk Title: Hone: Automatically Watching Across Information Networks

An important security task is to watch a set of candidate personas online. This is challenging because (a) often one has only a small amount of identifiers (emails, tel numbers, etc) about personalities, and have to discover the rest, (b) often only knows identifiers in a few of the information networks (say email) while personas can reside in many different networks (say in addition, chatrooms, social networks), and most importantly, (c) personas are dynamic, with persons behind them quickly able to drop old personas or adopt new ones in one of the myraid information networks. Some of these networks e.g., blogs and twitter, leave a publicly crawlable trail of the communication. Some other networks are more private such as email, facebook, and VoIP. As a result, systems that watch personas not only have to identify multiple identities across networks which is a known challenge in Intelligence Data Analysis, but have to adopt quickly and rapidly as well, an aspect that has got limited attention in research, and further work beyond the limitations of information available on the web.

We present our system Hone. It adopts the unique approach of simultaneously monitoring real time packet level IP traffic as well as targeted analysis of application level data. It combines real time streaming analysis of packet level data with traditional crawl and information retrieval analysis of the web. The result is a system that starts with a rudimentary watch list and successively hones on the multiple personas of the watch list entities as new associations are discovered and tracks them as they change. This presents an analysts with a complete view of a suspect's online communications and enables tracking the suspect in an automatic and dynamic way.

Joint work with S. Muthukrishnan and Narus Inc.

Congxing Cai, USC/ISI

Talk Title: Integration and summarization of multiple media

This talk describes our work on the integration and summarization of multiple media for strategic analysis. We demonstrate how to link and present different media for analysis. When the amount of associated textual data is large, it is organized and summarized before display. A hierarchical summarization framework, conditioned on the small space available for display, has been fully implemented.

Eugene Fink, CMU

Talk Title: Machine learning methods for cybersecurity

We are working on machine learning tools for improving cyber defenses through automated discovery of unexpected patterns in system behavior, network traffic, and other attack indicators. These tools will support proactive analysis and early detection of attacks; automated adaptation of defenses to the needs and usage patterns of individual users; and sharing related knowledge and experience among multiple users. They will complement the standard defenses by adding a "layer of armor" that detects novel threats.

Cibin George, Rutgers

Poster Title: Higher order feature associations for classification

Higher Order Naive Bayes (HONB, Ganiz et al., 2009) exploits higher order associations between features for classification purposes. We first confirm the utility of HONB at low sample sizes using the Wilcoxon signed rank test. We then present a new class of graph sampling algorithms that exploit higher order associations. We empirically demonstrate that second order path counts in document relation graphs can be successfully leveraged to reduce the sample size without significantly impacting classification performance.

Emilie Hogan, Rutgers

Poster Title: View Discovery in OLAP Databases Through Statistical Combinatorial Optimization

In many projects so much data is being collected that it quickly becomes unmanageable. When the data being collected is multidimensional, an Online Analytical Processing (OLAP) database can be used for storage and analysis. The capability of OLAP database software systems to handle data complexity comes at a high price for analysts, presenting them a combinatorially vast space of views of a relational database. To get the most out of the amount of data that we have there needs to be a way to reveal areas of the data that analysts may otherwise never find.

We responded to the need to deploy technologies that will allow users to guide themselves to areas of local structure. We did this by casting the space of "views" of an OLAP database as a combinatorial object of all projections and subsets (i.e. - a lattice), and "view discovery" as a search process over that lattice. We equipped the view lattice with statistical information theoretical measures sufficient to support a combinatorial optimization process. We outlined "hop-chaining" as a particular view discovery algorithm over this object, wherein users are guided across a permutation of the dimensions by searching for successive two-dimensional views, pushing seen dimensions into an increasingly large background filter in a "spiraling" search process. For testing purposes we have applied our algorithm to the database of summary statistics for radiation portal monitors at US ports.

Cindy Hui, RPI

Talk Title: Simulating the Diffusion of Warnings in Large Dynamic Networks

Warning systems play an important role in informing the at-risk population of potential dangers during hazardous events. These systems are also used to provide information on protective measures to promote safety in the community. In addition to a technological reliable system, it is important to understand and make use of the social communication network in communities to spread the warnings to a larger audience and to help ensure that the people at risk will act on the information they receive. This project involves formulating an axiomatic framework for modeling the diffusion of warnings in dynamic social networks through the concept of trust. The network is dynamic where individuals may leave the network and disrupt the flow of information as warnings are being diffused.

We assess the framework by modeling the 2007 San Diego Firestorms, in particular the diffusion of the Reverse911 evacuation warnings sent during the event. We generate a hypothetical social network of San Diego County with one million household nodes. We configure the parameters and map the process using multiple data sources relevant to the event. We use the model to examine how social group structure, distribution of trust, and existence of weak ties affect the spread of evacuation warnings.

Advisors: William A. Wallace, Malik Magdon-Ismail, and Mark Goldberg

Ming Ji, University of Illinois at Urbana-Champaign

Talk Title: Mining Hidden Communities in Heterogeneous Information Networks

Information networks, composed of large numbers of objects linking to each other, are ubiquitous in real life. Common examples include telephone account networks linked by calls, co-author networks and paper citation networks extracted from bibliographic data, webpage networks interconnected by hyperlinks in the World Wide Web, etc. Discovering hidden communities of special interests in information networks with the help of prior knowledge for part of the objects has recently attracted substantial interest. Current work about hidden communities discovery mainly focus on homogeneous information networks, i.e., networks composed of a single type of objects, as mentioned above. But in real life, it is more natural to mine hidden communities in heterogeneous information networks composed of multiple types of objects. For instance, the blogsphere can be viewed as a heterogeneous information network composed of blogs, users and terms. By giving prior knowledge for a few terms and users, we can precisely detect online communities with special interests. In fact, applications like terrorist email detection, fraud detection and research community discovery can all be cast as a community mining problem on heterogeneous information networks.

In this work, we try to mine hidden communities with special interests on heterogeneous information networks directly, which has hardly been explored so far. Given prior knowledge on part of the objects about which community each of them belongs to, we solve the problem by predicting communities for all types of the remaining objects. A novel graph-based regularization framework is proposed to model the link structure in heterogeneous information networks with arbitrary network schema and number of object/link types. Specifically, we explicitly differentiate the multi-typed link information by incorporating it into different relation graphs. Efficient computational schemes are then introduced to solve the corresponding optimization problem. Experiments on the DBLP data set show that our algorithm significantly improves the prediction accuracy over existing state-of-the-art methods. This work is submitted to KDD 2010.

Joint work with Yizhou Sun, Marina Danilevsky, Jing Gao and Jiawei Han, University of Illinois at Urbana-Champaign

Darja Krusevskaja, Rutgers

Poster Title: Inferring Multi-Relationships

We consider entities that are connected to each other in information networks. What characterizes the relationship between entities? Common analysis techniques address presence or absence of the link between pairs of entities or their strengths. However the nature of link is more versa- tile in information networks, in particular, ones of interest to security and intelligence communities.

A link some times can be characterized by several types of relationships between entities, e.g. two persons can be friends, co-workers and classmates at the same time. We also might be interested in strength of different types of relationships. In case of networks where multiple types of connections are possible we might be interested in top few relationship types that describe the relation best in terms of their strength.

We propose two sets of algorithms for the problem of predicting relationship types and strengths from the graph structure. One is derived from similar algorithms used for inferring labels for nodes in graph, rather than edges. This is accomplished using the dual graph so that edge inference problem becomes node inference problem. The other algorithm is based on intuition that the strength of the edge should reect the stationary probability of a suitable random walk from one endpoint of that edge to another and vice versa. Both these algorithms have been implemented, and preliminary tests have been done with Newman co-authorship dataset from 2003. Our experimental results show good correlation between true relationship types and our predictions.

Joint Work with CCICADA PI S. Muthukrishnan.

Yue Lu, University of Illinois at Urbana-Champaign

Talk Title: Statistical Topic Models for Large-Scale Opinion Integration

Project Scope: In homeland security applications, there is often a need to gather and integrate scattered opinions about an entity such as a person, an organization or a policy. Thanks to Web 2.0 technology which has enabled more and more people to freely express their opinions, the Web has become an extremely valuable source for analyzing views and opinions. However, with the current technologies, it is still difficult for people to integrate and digest all opinions relevant to a specific topic. In this work we study how to automatically generate a structured summary for any given topic by integrating opinions from different kinds of resources about the topic, such as well-written news articles, database tables, opinions scattering in blogspaces and forums; the goal is to help people digest and exploit a large number of scattered opinions in a general way.

Recent Progress: We have studied how to automatically integrate opinions expressed in a well-written article with lots of opinions scattering in various sources such as blogspaces and forums. We proposed semi-supervised topic models to solve the problem in a principled way. The models can be used to integrate a well written review with opinions in an arbitrary text collection about any topic to potentially support many interesting applications in multiple domains. We have already obtained interesting opinion integration results on all US presidents and Hurricane Katrina.

Future Plans: We will address a more general setup of the problem: integrating opinions in an arbitrary text collection with a set of well-written articles instead of a single one. We also plan to investigate many other resources, such as existing databases about people or events. Ultimately, the effectiveness of our research results will be demonstrated by developing a toolkit to facilitate broad applications of opinion integration.

Relevance to listed research areas: Our research is in the area of "Advanced Data Analysis and Visualization" and it can potentially support many interesting applications, such as in the areas of "Social and Behavioral Sciences" and "Human Factors".

Carlos T. Murray, Morgan State

Talk Title: A Practical Application of Oracle Apex Software

In this presentation I will discuss how Oracle Apex software is used to make the management of financial transactions easier for any business with an automated checking system. The topics covered in this computer application involves database structures, Structured Query Language which involves having a background of data structures, problem solving using the C++ and Perl programming.

Helene Nguewou, Morgan State University

Poster Title: A Perl Implementation of a Contact-waiting Time Metric for HIV-RNA Folding Prediction: Are There National Security Implications?

This work focuses on creating a user-friendly prediction tool for RNA folding kinetics in Perl language. The contact-waiting time (CWT) metric is applied to certain HIV sequences in order to calculate their folding rates. The CWT metric will be converted from MATLAB to Perl language and will be tested on sequences in HIV databases. The purpose of creating the Perl implementation is to make the CWT more widely available and easy to use as a bioinformatics tool. In addition, are there national security implications as a result of HIV sequence prediction? At least 39 million people now infected with the virus are expected to die in the next 5-10 years. This depletion of elite workers and professionals constitutes a threat to homeland security, which will then be at greater risks of civil disturbance, conflict and disorder. The disparity of access to retroviral drugs increases the widening life-expectancy gap between poor countries and Western countries. As a result of this, there is increasing concern that nations highly infected with HIV might engage in bioterrorist acts against the United States. The lack of an effective and affordable vaccine against the virus makes this threat even more conceivable. Therefore, HIV research efforts are of high importance.

Joint work with Asamoah Nkwanta.

Bill Pottenger, Rutgers

Talk Title: Higher Order Learning

In order to recognize and capture inherent semantics in data we have developed an approach to feature space transformations termed Higher Order Learning (HOL) that has proven effective in both discriminative and generative settings (Ganiz, 2008; Ganiz et al., 2006; Ganiz, Pottenger and George, 2010; Li et al., 2007; Li et al., 2005; Ganiz, Lytkin and Pottenger, 2009; Menon and Pottenger, 2009).Â We have successfully leveraged the power of data representations based on HOL in a variety of problem domains. HOL-based methods significantly outperform the traditional approaches in text classification (Ganiz, 2008; Ganiz, Pottenger and George, 2010; Ganiz, Lytkin and Pottenger, 2009) and in association rule mining (Li et al., 2007; Li et al., 2005). HOL-based models have also proven effective in network event and anomaly detection (Ganiz, 2008; Ganiz et al., 2006; Menon and Pottenger, 2009) on time series data from the Border Gateway Protocol, the backbone protocol in the Internet routing infrastructure.Â In more recent work, HOL was successfully applied to threat detection in streaming data in a defense setting (Pottenger, 2009).

Warren B. Powell, Princeton

Talk Title: Optimal Learning for Homeland Security