Special Event: Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA)-wide Research Retreat

March 7-8, 2010
Morgan State University, Baltimore, MD

Organizers:
Ed Hovy, CCICADA/USC, hovy at isi.edu
Jack Jarmon, CCICADA/DIMACS, jjarmon at dimacs.rutgers.edu
Asamoah Nkwanta, Morgan State University, asamoah.nkwanta at morgan.edu
Bill Pottenger, CCICADA/Rutgers, billp at dimacs.rutgers.edu
Fred Roberts, CCICADA/DIMACS, froberts at dimacs.rutgers.edu
Dan Roth, CCICADA/University of Illinois at Urbana-Champaign, danr at uiuc.edu
Guoping Zhang, Morgan State University, guoping.zhang at morgan.edu
Presented under the auspices of the The Homeland Security Center for Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA).

Abstracts:

Earl R. Barnes, Morgan State

Talk Title: A Graph Partitioning Problem for Disease Control

We consider a graph in which nodes represent individuals and edges correspond to pairs of individuals that are in frequent contact with each other. Assume that a certain number of individuals become infected with a communicable disease. Our problem is to find the least number of edges that must be cut to isolate the infected individuals from a certain percentage of the population. This is a graph partitioning problem with constraints. We obtain bounds on the number of edges that must be cut to isolate the infected individuals from a certain percentage of the population. The bounds depend on the eigenvalues of the adjacency matrix of the graph.


Smriti Bhagat, Rutgers

Talk Title: Hone: Automatically Watching Across Information Networks

An important security task is to watch a set of candidate personas online. This is challenging because (a) often one has only a small amount of identifiers (emails, tel numbers, etc) about personalities, and have to discover the rest, (b) often only knows identifiers in a few of the information networks (say email) while personas can reside in many different networks (say in addition, chatrooms, social networks), and most importantly, (c) personas are dynamic, with persons behind them quickly able to drop old personas or adopt new ones in one of the myraid information networks. Some of these networks e.g., blogs and twitter, leave a publicly crawlable trail of the communication. Some other networks are more private such as email, facebook, and VoIP. As a result, systems that watch personas not only have to identify multiple identities across networks which is a known challenge in Intelligence Data Analysis, but have to adopt quickly and rapidly as well, an aspect that has got limited attention in research, and further work beyond the limitations of information available on the web.

We present our system Hone. It adopts the unique approach of simultaneously monitoring real time packet level IP traffic as well as targeted analysis of application level data. It combines real time streaming analysis of packet level data with traditional crawl and information retrieval analysis of the web. The result is a system that starts with a rudimentary watch list and successively hones on the multiple personas of the watch list entities as new associations are discovered and tracks them as they change. This presents an analysts with a complete view of a suspect's online communications and enables tracking the suspect in an automatic and dynamic way.

Joint work with S. Muthukrishnan and Narus Inc.


Congxing Cai, USC/ISI

Talk Title: Integration and summarization of multiple media

This talk describes our work on the integration and summarization of multiple media for strategic analysis. We demonstrate how to link and present different media for analysis. When the amount of associated textual data is large, it is organized and summarized before display. A hierarchical summarization framework, conditioned on the small space available for display, has been fully implemented.


Eugene Fink, CMU

Talk Title: Machine learning methods for cybersecurity

We are working on machine learning tools for improving cyber defenses through automated discovery of unexpected patterns in system behavior, network traffic, and other attack indicators. These tools will support proactive analysis and early detection of attacks; automated adaptation of defenses to the needs and usage patterns of individual users; and sharing related knowledge and experience among multiple users. They will complement the standard defenses by adding a "layer of armor" that detects novel threats.


Cibin George, Rutgers

Poster Title: Higher order feature associations for classification

Higher Order Naive Bayes (HONB, Ganiz et al., 2009) exploits higher order associations between features for classification purposes. We first confirm the utility of HONB at low sample sizes using the Wilcoxon signed rank test. We then present a new class of graph sampling algorithms that exploit higher order associations. We empirically demonstrate that second order path counts in document relation graphs can be successfully leveraged to reduce the sample size without significantly impacting classification performance.


Emilie Hogan, Rutgers

Poster Title: View Discovery in OLAP Databases Through Statistical Combinatorial Optimization

In many projects so much data is being collected that it quickly becomes unmanageable. When the data being collected is multidimensional, an Online Analytical Processing (OLAP) database can be used for storage and analysis. The capability of OLAP database software systems to handle data complexity comes at a high price for analysts, presenting them a combinatorially vast space of views of a relational database. To get the most out of the amount of data that we have there needs to be a way to reveal areas of the data that analysts may otherwise never find.

We responded to the need to deploy technologies that will allow users to guide themselves to areas of local structure. We did this by casting the space of "views" of an OLAP database as a combinatorial object of all projections and subsets (i.e. - a lattice), and "view discovery" as a search process over that lattice. We equipped the view lattice with statistical information theoretical measures sufficient to support a combinatorial optimization process. We outlined "hop-chaining" as a particular view discovery algorithm over this object, wherein users are guided across a permutation of the dimensions by searching for successive two-dimensional views, pushing seen dimensions into an increasingly large background filter in a "spiraling" search process. For testing purposes we have applied our algorithm to the database of summary statistics for radiation portal monitors at US ports.


Cindy Hui, RPI

Talk Title: Simulating the Diffusion of Warnings in Large Dynamic Networks

Warning systems play an important role in informing the at-risk population of potential dangers during hazardous events. These systems are also used to provide information on protective measures to promote safety in the community. In addition to a technological reliable system, it is important to understand and make use of the social communication network in communities to spread the warnings to a larger audience and to help ensure that the people at risk will act on the information they receive. This project involves formulating an axiomatic framework for modeling the diffusion of warnings in dynamic social networks through the concept of trust. The network is dynamic where individuals may leave the network and disrupt the flow of information as warnings are being diffused.

We assess the framework by modeling the 2007 San Diego Firestorms, in particular the diffusion of the Reverse911 evacuation warnings sent during the event. We generate a hypothetical social network of San Diego County with one million household nodes. We configure the parameters and map the process using multiple data sources relevant to the event. We use the model to examine how social group structure, distribution of trust, and existence of weak ties affect the spread of evacuation warnings.

Advisors: William A. Wallace, Malik Magdon-Ismail, and Mark Goldberg


Ming Ji, University of Illinois at Urbana-Champaign

Talk Title: Mining Hidden Communities in Heterogeneous Information Networks

Information networks, composed of large numbers of objects linking to each other, are ubiquitous in real life. Common examples include telephone account networks linked by calls, co-author networks and paper citation networks extracted from bibliographic data, webpage networks interconnected by hyperlinks in the World Wide Web, etc. Discovering hidden communities of special interests in information networks with the help of prior knowledge for part of the objects has recently attracted substantial interest. Current work about hidden communities discovery mainly focus on homogeneous information networks, i.e., networks composed of a single type of objects, as mentioned above. But in real life, it is more natural to mine hidden communities in heterogeneous information networks composed of multiple types of objects. For instance, the blogsphere can be viewed as a heterogeneous information network composed of blogs, users and terms. By giving prior knowledge for a few terms and users, we can precisely detect online communities with special interests. In fact, applications like terrorist email detection, fraud detection and research community discovery can all be cast as a community mining problem on heterogeneous information networks.

In this work, we try to mine hidden communities with special interests on heterogeneous information networks directly, which has hardly been explored so far. Given prior knowledge on part of the objects about which community each of them belongs to, we solve the problem by predicting communities for all types of the remaining objects. A novel graph-based regularization framework is proposed to model the link structure in heterogeneous information networks with arbitrary network schema and number of object/link types. Specifically, we explicitly differentiate the multi-typed link information by incorporating it into different relation graphs. Efficient computational schemes are then introduced to solve the corresponding optimization problem. Experiments on the DBLP data set show that our algorithm significantly improves the prediction accuracy over existing state-of-the-art methods. This work is submitted to KDD 2010.

Joint work with Yizhou Sun, Marina Danilevsky, Jing Gao and Jiawei Han, University of Illinois at Urbana-Champaign


Darja Krusevskaja, Rutgers

Poster Title: Inferring Multi-Relationships

We consider entities that are connected to each other in information networks. What characterizes the relationship between entities? Common analysis techniques address presence or absence of the link between pairs of entities or their strengths. However the nature of link is more versa- tile in information networks, in particular, ones of interest to security and intelligence communities.

A link some times can be characterized by several types of relationships between entities, e.g. two persons can be friends, co-workers and classmates at the same time. We also might be interested in strength of different types of relationships. In case of networks where multiple types of connections are possible we might be interested in top few relationship types that describe the relation best in terms of their strength.

We propose two sets of algorithms for the problem of predicting relationship types and strengths from the graph structure. One is derived from similar algorithms used for inferring labels for nodes in graph, rather than edges. This is accomplished using the dual graph so that edge inference problem becomes node inference problem. The other algorithm is based on intuition that the strength of the edge should reect the stationary probability of a suitable random walk from one endpoint of that edge to another and vice versa. Both these algorithms have been implemented, and preliminary tests have been done with Newman co-authorship dataset from 2003. Our experimental results show good correlation between true relationship types and our predictions.

Joint Work with CCICADA PI S. Muthukrishnan.


Yue Lu, University of Illinois at Urbana-Champaign

Talk Title: Statistical Topic Models for Large-Scale Opinion Integration

Project Scope: In homeland security applications, there is often a need to gather and integrate scattered opinions about an entity such as a person, an organization or a policy. Thanks to Web 2.0 technology which has enabled more and more people to freely express their opinions, the Web has become an extremely valuable source for analyzing views and opinions. However, with the current technologies, it is still difficult for people to integrate and digest all opinions relevant to a specific topic. In this work we study how to automatically generate a structured summary for any given topic by integrating opinions from different kinds of resources about the topic, such as well-written news articles, database tables, opinions scattering in blogspaces and forums; the goal is to help people digest and exploit a large number of scattered opinions in a general way.

Recent Progress: We have studied how to automatically integrate opinions expressed in a well-written article with lots of opinions scattering in various sources such as blogspaces and forums. We proposed semi-supervised topic models to solve the problem in a principled way. The models can be used to integrate a well written review with opinions in an arbitrary text collection about any topic to potentially support many interesting applications in multiple domains. We have already obtained interesting opinion integration results on all US presidents and Hurricane Katrina.

Future Plans: We will address a more general setup of the problem: integrating opinions in an arbitrary text collection with a set of well-written articles instead of a single one. We also plan to investigate many other resources, such as existing databases about people or events. Ultimately, the effectiveness of our research results will be demonstrated by developing a toolkit to facilitate broad applications of opinion integration.

Relevance to listed research areas: Our research is in the area of "Advanced Data Analysis and Visualization" and it can potentially support many interesting applications, such as in the areas of "Social and Behavioral Sciences" and "Human Factors".


Carlos T. Murray, Morgan State

Talk Title: A Practical Application of Oracle Apex Software

In this presentation I will discuss how Oracle Apex software is used to make the management of financial transactions easier for any business with an automated checking system. The topics covered in this computer application involves database structures, Structured Query Language which involves having a background of data structures, problem solving using the C++ and Perl programming.


Helene Nguewou, Morgan State University

Poster Title: A Perl Implementation of a Contact-waiting Time Metric for HIV-RNA Folding Prediction: Are There National Security Implications?

This work focuses on creating a user-friendly prediction tool for RNA folding kinetics in Perl language. The contact-waiting time (CWT) metric is applied to certain HIV sequences in order to calculate their folding rates. The CWT metric will be converted from MATLAB to Perl language and will be tested on sequences in HIV databases. The purpose of creating the Perl implementation is to make the CWT more widely available and easy to use as a bioinformatics tool. In addition, are there national security implications as a result of HIV sequence prediction? At least 39 million people now infected with the virus are expected to die in the next 5-10 years. This depletion of elite workers and professionals constitutes a threat to homeland security, which will then be at greater risks of civil disturbance, conflict and disorder. The disparity of access to retroviral drugs increases the widening life-expectancy gap between poor countries and Western countries. As a result of this, there is increasing concern that nations highly infected with HIV might engage in bioterrorist acts against the United States. The lack of an effective and affordable vaccine against the virus makes this threat even more conceivable. Therefore, HIV research efforts are of high importance.

Joint work with Asamoah Nkwanta.


Bill Pottenger, Rutgers

Talk Title: Higher Order Learning

In order to recognize and capture inherent semantics in data we have developed an approach to feature space transformations termed Higher Order Learning (HOL) that has proven effective in both discriminative and generative settings (Ganiz, 2008; Ganiz et al., 2006; Ganiz, Pottenger and George, 2010; Li et al., 2007; Li et al., 2005; Ganiz, Lytkin and Pottenger, 2009; Menon and Pottenger, 2009). We have successfully leveraged the power of data representations based on HOL in a variety of problem domains. HOL-based methods significantly outperform the traditional approaches in text classification (Ganiz, 2008; Ganiz, Pottenger and George, 2010; Ganiz, Lytkin and Pottenger, 2009) and in association rule mining (Li et al., 2007; Li et al., 2005). HOL-based models have also proven effective in network event and anomaly detection (Ganiz, 2008; Ganiz et al., 2006; Menon and Pottenger, 2009) on time series data from the Border Gateway Protocol, the backbone protocol in the Internet routing infrastructure. In more recent work, HOL was successfully applied to threat detection in streaming data in a defense setting (Pottenger, 2009).


Warren B. Powell, Princeton

Talk Title: Optimal Learning for Homeland Security

Optimal learning addresses the challenge of collecting information quickly when observations are time consuming and expensive. While optimal policies for collecting information are computationally intractable, the knowledge gradient policy has proven to be particularly powerful. We have adapted this concept to both offline (guiding laboratory research) and online (observe as you go) problems. We have extended applications from the usual domain of a finite number of discrete alternatives to problems on graphs (learning about networks), to learning about large numbers of subsets, learning multidimensional continuous surfaces and, most recently, learning general, nonconvex functions. I will discuss different types of applications within homeland security.


Andrew Rodriguez, Rutgers

Talk Title: Graph evolution over time

The purpose of this project is to study how graphs evolve over time. Understanding normal parameters of computer and communication networks, for example, allow for pattern and anomaly detection in such environments. If activity is detected that produces network characteristics or structures that deviate significantly from the norm, it can be flagged as an abnormality, potentially helping with the detection of fraud, spam, and denial of service attacks, for example. The problem of anomaly detection has been approached from various angles, including artificial intelligence, machine learning, and state machine modeling. This work begins by using change-point detection for anomaly detection.

This work is done in collaboration with Dr. Tin Kam Ho, Dr. Jin Cao, Dr. Harald Steck, and Dr. Ayou Chen of Bell Labs.


Dan Roth, UIUC

Talk Title: Making Sense of Unstructured Data

Recent studies have shown that over 85% of the information organizations deal with is unstructured - the vast majority of which is text in different forms. A multitude of techniques has to be used in order to enable intelligent access to this information and to support transforming it to forms that allow sensible use of the information. The fundamental issue that all these techniques have to address is that of semantics - there is a need to move toward understanding the text at an appropriate level, beyond the word level, in order to support access, knowledge extraction and synthesis.

I will discuss some of our research in these directions, focusing on tools and products rather than techniques. The talk will address several dimensions of text understanding that can facilitate access to information and extraction of knowledge from unstructured text, transforming it to forms that are useful to different users in different settings, and integrating it along multiple dimensions and with existing institutional resources.


Gilna Samuel, Morgan State

Poster Title: A Data Mining Approach to Predicting Optimal Regimens in Cancer Treatments

Cancer is one of the leading causes of death. Many new drugs are currently being manufactured and tested in clinical trials to combat this disease. A database collected by a group of researchers at Massachusetts Institute of Technology (MIT) stores significant information, taken from journal articles on clinical trials for cancer. The database is updated weekly and more data is added continually to further development and enhancement of this database. The aim of this project is to build models that predict survival rates of patients undergoing a variety of cancer treatments. This was accomplished by analyzing the information gathered in this database by applying data mining methods involving regression models and visual algorithms. Initial results from the various data mining methods will be presented for breast cancer, non-small-cell lung cancer and gastric cancer. In each case, these drugs had the higher overall survival rates and lower toxicity rates and thus were the most effective. At this stage of the research, more emphasis was placed on finding optimal algorithms to compare the results from the database. The results were limited since the database used has a relatively small number of clinical trials. A larger database would give more accurate results and the clinical trials could be grouped by cancer stage, in addition to cancer type. Thus, the combination of drugs would be both stage and type specific and could be used in future clinical trials to create more effective cancer regimens.


Warren Scott, Princeton

Poster Title: An Optimal Learning Approach to Finding an Outbreak of a Disease

We describe an optimal learning policy to sequentially decide on locations of a city to test for an outbreak of a particular disease. We use Gaussian process regression to model the level of the disease throughout the city, and then use the correlated knowledge gradient, which implicitly uses exploration versus exploitation concepts, to choose where to test next. The correlated knowledge gradient policy is a general framework that can be used to find the maximum of an expensive function with noisy observations.


Yizhou Sun, UIUC

Talk Title:iTopicModel: Information Network-Integrated Topic Modeling

Project Scope: Document networks, i.e., networks associated with text information, are becoming increasingly popular due to the ubiquity of Web documents, blogs, online social networks and various kinds of online data. Topic analysis considering both text information and link information for such networks will not only improve the quality of topic models, but also understand the individual objects in the network more clearly.

Recent Progress: In this paper, we propose a novel topic modeling framework for document networks, which builds a unified generative topic model that is able to consider both text and structure information for documents. A graphical model is proposed to describe the generative model. On the top layer, we define a novel multivariate Markov Random Field for topic distribution random variables for each document, to model the dependency relationships among documents over the network structure. On the bottom layer, we follow the traditional topic model to model the generation of text for each document. A joint distribution function for both the text and structure of the documents is thus provided. A solution to estimate this topic model is given. Some important practical issues in real applications are also discussed, including how to decide the topic number. We apply the model on two real datasets, DBLP and Cora, and the experiments show that this model is more effective in comparison with the state-of-the-art topic modeling algorithms.

Relevance to Research Area: All sorts of online networks are ubiquitous now, due to the rapid development of the Internet. It is impossible to understand such huge networks merely via human laboring. iTopicModel can easily detect the major topics in these networks and associate each individual object to these topics. Even for those objects with no text information but only structural information, their topics can be easily predicted. This study will be extremely useful for risk analysis on online social networks.

Future Plans: In the future work, we will study how to combine different networks to further improve the quality of topic models and to better understand individual objects in these networks. We will report more technical details on this progress if the abstract is selected for poster in the coming DHS conference.


Sam Tannouri and Ahlam Tannouri, Morgan State

Talk Title: Convoluted graph visualization using splines to render edges

Visual analytics technologies combined with computational tools and analytic reasoning form a platform for detecting, analyzing and responding to threat in the cyber and physical worlds. They handle a huge amount of data, which is very difficult task especially when one is concerned about extracting critical information. The complex and convoluted nature of data sets make them very difficult to clearly illustrate in a visually aesthetic form. To alleviate the congested visual representation of these data sets, we propose the use of convoluted graphs rendering their edges using splines. Configurations of these splines are studied and interactively illustrated.


Stephen Tratz, USC/ISI

Talk Title: Noun-Noun Compounds for Text Analysis

The automatic interpretation of noun-noun compounds is an important subproblem within many natural language processing applications and is an area of increasing interest. The problem is difficult, with disagreement regarding the number and nature of the relations, low inter-annotator agreement, and limited annotated data. We present a novel taxonomy of relations that integrates previous relations, a large dataset annotated according to these relations, and a supervised classification method for automatic noun compound interpretation.


Yuancheng Tu, UIUC

Talk Title: Aspect Guided Text Categorization with Unobserved Labels

Project Scope: Categorizing text into one of multiple possible categories is an archetypical multi-class classification problem with many important applications in knowledge management and information access, ranging from email classification to author identification to sentiment analysis. Studies in this area develop a principled way to analyze massive amounts of information from texts or other sources and quickly and reliably determine whether they hold a desired property, such as, written by a known entity, represent a suspicious event, etc. As such, it has broad applications to DHS and national security applications. In this project, we propose a novel method in the domain of text categorization with a large label space and, most importantly, can handle the case in which some of the labels were not observed previously in the training data. Our method exhibits great potential to correctly predict unobserved labels which traditional multiclass classification methods cannot handle at all.

Recent Progress: We have developed a novel multi-class classification method that exploits the structure and meaning of the label space and thereby can classify even with respect labels that were not previously observed in the training data. The key insight is the introduction of intermediate aspect variables that encode properties of the labels. Aspect variables serve as a joint representation for observed and unobserved labels. This way the classification problem can be viewed as a structure learning problem with natural constraints on assignments to the aspect variables. We solve the problem as a constrained optimization problem over multiple learners and show significant improvement in classifying short sentences into a large label space of categories, including previously unobserved categories.

Future Plans: One perspective for further development of this work is to find more constraints in the inference procedure and to experiment with incorporating the constraint based inference within or after the learning. Another way to enhance the system is to develop an interactive protocol to induce prediction feedback as a way to quickly adapt to new labels

Relevance to listed research areas: Automatic text classification methods are an essential component in developing the ability to interpret in a meaningful way properties of massive amounts of information as a way to support decision making with respect to it.


Abdul-Aziz Yakubu, Howard University

Talk Title: Discrete-Time Epidemic Models

We use a periodically forced SIS epidemic model with disease induced mortality to study the combined effects of seasonal trends and death on the extinction and persistence of discretely reproducing populations. We introduce the epidemic threshold parameter, R0, for predicting disease dynamics in periodic environments. Typically, R0<1 implies disease extinction. However, in the presence of disease induced mortality, we how that a tiny number of infective population can drive an otherwise persistence population with R0>1 to extinction. Furthermore, we obtain conditions for the persistence of the total population. In addition, we use the Beverton-Holt recruitment function to show that the infective population exhibits perioddoubling bifurcations route to chaos where the disease-free susceptible population lives on a 2-cycle (non-chaotic) attractor.


Guoping Zhang, Morgan State University

Talk Title: Radon transform and its applications

One of the major inventions in last century is the CT-scanner (computerized Tomography). Cormack and Hounsfield got the Nobel-prize in Medicine 1979 for their work with computed axial tomography. The CT-scanner can be used for reconstruction of X-ray absorption in the interior of structures, such as patients or machine parts with possible internal fractures.

What is common for the development of all types of scanners is that they, to some extent, have been based on the Radon transform. The crucial idea for the image reconstruction is that Radon transform can provide a natural link between the image function of the object and the measurement of the machine such as CT-scanner and MRI.

In this talk, I will give a brief introduction to Radon transform, the generalized Radon transform and their relations with microlocal analysis theory. I also want to discuss its potential applications to DHS research. The original goal of my talk is to seek future potential collaborators among CCICADA members through our discussion during the retreat.


Mianwei Zhou, UIUC

Talk Title: Data-oriented Content Query System: Searching for Data into Text on the Web

With the ever growing richness of the Web, people nowadays are no longer satisfied with finding interesting documents to read. Instead, we are becoming increasingly interested in the various fine granularity information units, e.g., movie release date, book price, which appear within the content of Web documents. We are witnessing several emerging Web-based search applications towards exploiting such rich data on the Web, such as:

1. Web-based Information Extraction (WIE). With the richness and redundancy of the Web, WIE tries to rely on simple phrase patterns (e.g. "X is the capital of Y") to harvest numerous facts online.

2. Typed-Entity Search (TES). Several efforts were tried to search entities inside the Web pages (e.g. searching the phone number of Amazon's customer services). Such techniques often rely on extract data types of interest, and then matching the extracted information based on some proximity patterns. With so many ad-hoc efforts exploiting Web contents, there is a pressing need to distill their essential capabilities?thus we propose the concept of Data-oriented Content Query System (or DoCQS).

DoCQS aims at generally supporting "content querying" for finding data over the Web. In DoCQS, we utilize the relational model for modeling Web data, and propose the corresponding SQL-style language CQL for content querying. Based on DoCQS, Web-based applications can easily access the Web data as if writing SQL for querying a database, avoiding a lot of repetitious work. For efficient processing, we design novel index structures and query processing algorithms. We evaluate our proposal over two concrete domains of realistic Web corpora, demonstrating that our query language is rather flexible and expressive, and our query processing is efficient with reasonable index overhead.


Next: Call for Participation
Workshop Index
CCICADA Homepage
Contacting CCICADA
Document last modified on March 5, 2010.