Abstracts
Title: An Algorithm for Partitioning the Nodes of a Graph
Project: Controlling The Spread of Diseases
Consider a population of individuals, some of whom interact with each other. We model this population as a graph where the nodes represent individuals, and two nodes are connected by an edge if the individuals interact with each other. We consider the problem of separating the nodes into a specified number of groups, of given sizes, in such a way that the number of edges connecting individuals in different groups is minimized. This model may be helpful for controlling the spread of contagious diseases. Mathematically, our problem can be formulated as a quadratic assignment problem. It is therefore a nonlinear integer programming problem, and is NP-complete. We propose an algorithm that produces a local solution by solving a sequence of linear assignment problems. In this talk we describe the algorithm and report on some experiments with small graphs.
The graphs we have experimented with are made up from military data collected in Afghanistan, and is being analyzed to determine the habits of certain populations. We have selected small bits of this data, up to 500 individual, to test the feasibility of our algorithm. The results we are getting are very promising.
Title: Continuous monitoring of events from twitter and news articles
Emergencies invariably require crisis managers to field numerous calls for service such as 911, radio reports by police and fire teams, medical reporting, infrastructure support teams, etc. CCICADA technologies for Event and Entity Resolution will be applied to help managers handle the confusion. We will deploy technology to monitor events in the general area surrounding the imaginary event, constantly looking for new events. Our technology will monitor text streams of various kinds, including online news, tweets, to identify and track the topics related to the current stage of events. These topics can be used to integrate different types of information sources for decision making, such as geospatial imagery, map etc.
The technology first collect and preprocess the Twitter stream and online news articles appearing in simulated time. The realtime tweet stream is analyzed according to its average distribution to identify important bursty events. Then the events will be monitored to discover the topic words related to the current stage of the event. These identified topics will be used to correlate the social discussions with official media news to visualize the current stage of the event. We will not be able to solve all these problems definitively but will demonstrate the capabilities of some state of the art solutions.
Title: A penalized regression approach in detection of nuclear materials in the shipments to the U.S.
Large volumes of shipping containers entering into the US each day are highly vulnerable vehicles for delivering nuclear and radiological weapons or materials into the US illegally. A multilayer approach has been adopted as a standard measure for counteraction in the current inspection practice. The _rst layer of inspection begins with collecting information at an overseas point of embarkation in a variety of customs forms, which we call \manifest" information. In this paper, a penalized linear regression model is constructed to reveal the abnormality in the manifest data and further identify high-risk vehicles. Manifest data have special features that require special considerations to improve mod- eling and prediction powers. For instance, most of the information contained in the manifest data is given by categorical variables. This type of categorical variables are usually rep- resented by dummy variables that form natural groups. For example, one item contained in manifest data is content which is classi_ed into 16 categories such as food, toys, chem- icals etc. Some contents may be more likely to hide nuclear materials than the other. So when assign a risk score of a container, we may need to take its content into consideration. Therefore, not only it is important to identify inuential categorical variables, but also it is equally important to identify which categories contribute to the impact on the risk scores. In addition, some of the categorical variables are highly correlated and this could cause problems in model _tting. Nevertheless, the group structures of variables provide an important source of regularization which can properly handle these issues and help model building. In this talk, we propose a novel penalized regression method to discover and model the dependence of the risk score on the container information reported in the manifest data.
Penalized regression method is a modern statistical learning technique that has been proven to be very e_ective for model selection, especially when the number of variables considered is large. The proposed method not only incorporates the correlation patterns among the variables, leading to the underlying group structure, but also performs variable selection at both group and within-group level. This method is based on a penalty function associated with the correlation structures of the explanatory variables in addition to the usual penalty on the coe_cients. It can be shown that our method has a general grouping e_ect and achieves oracle property in the sense of removing unimportant variables. The method is applied to the high-dimensional, mixed type noisy manifest data with simulated outcomes and the results show that this approach facilitates detection of suspicious cargoes and reduces the false alarms.
Title: Exploiting Thread Structures to Improve Smoothing of Language Models for Forum Post Retrieval
As Web 2.0 prospered during the last decade, there are nowadays more and more ways for publishing information on the Web. Among them, online forums and discussion boards are of great importance and widely used. The reason lies in several aspects. For one thing, it is much easier for users to post contents on forums, compared with composing web pages. The infrastructure of forums allows users to focus on the content of the post instead of putting much effort on the designing of the presentation. For another, users are able to interact with each other in forums while they publish their opinions. This makes the web contents live and people are therefore more inclined to look into forum posts for information. As more and more forums are available online, forum post retrieval becomes an important task. According to one of the most popular forum search engines, BoardTracker, it has more than 32,000 forums indexed. A number of forum search engines have been built in recent years. Despite their dedication to forum search, few serious study has been done towards this end.
Due to many unique characteristics of forum data, forum post retrieval is different from traditional document retrieval and web search, raising interesting research questions about how to optimize the accuracy of forum post retrieval. In this project, we study how to exploit the thread structures of forums to improve retrieval accuracy in the language modeling framework. Specifically, we propose and study two different schemes for smoothing the language model of a forum post based on the thread containing the post. We explore several different variants of the two schemes to exploit thread structures in different ways. We create the first forum post test data set and evaluate these smoothing methods using this data set. The experiment results show that the proposed methods for leveraging forum threads to improve estimation of document language models are effective, and they outperform the existing smoothing methods for the forum post retrieval task.
Title: Virtual Worlds and Human Behavior: A case study in health risks
Current medical theories explaining rates of high risk behavior, especially in young adults, blame irrational assumptions of "invincibility" to the risks. Public health efforts have therefore made attempts to debunk these impressions by focusing explicitly on the risks, assuming that if only individuals could be convinced of their own susceptibility, they would avoid adopting risky behaviors.
Due to health-related behaviors observed within the virtual world of Whyville, we propose an alternative hypothesis: due to strong social bonds, certain populations may view poor medical outcomes among the benefits, rather than the costs, of risky behaviors. We will describe the virtual outbreak and observed reactions that lead us to propose this hypothesis, and will then discuss how this new perspective could translate into alternative public health efforts to curb high risk behaviors among teens. Lastly, we will discuss how uses of these virtual world settings provide access to insights (such as those presented) into human behavior that are otherwise inaccessible to scientific investigation.
Title: Identifying Relevant Literature on Chemical Terrorism using Machine Learning
With the current concerns about terrorism, there is an increased interest in trying to identify the kind of chemicals that can be used by terrorists and their effects in terrorism context. However, to develop an encompassing resource of all relevant information, geared specifically towards chemical terrorism, is not currently feasible since much of the scientific work is carried out in many different industrial and governmental laboratories. Our objective in this paper is to propose a method for identifying the relevant literature on chemical agents that have been or could potentially be used by terrorists, by information mining. We mine PubMed, a National Library of Medicine repository containing over 20 million articles covering more than 5000 journals which includes most important articles involving medical effects of chemical used in terrorism. However, articles retrieved using the search string "chemical terrorism" produces many results which are not primarily relevant to chemical terrorism. For example, of 292 articles obtained from PubMed and annotated, only 197 articles were found by human experts to be relevant to the topic. The purpose of this article is two fold: a) to investigate the use of a boosting algorithm in a supervised learning context in order to identify relevant articles and b) to automatically generate critical feature vectors and measure their efficacy.
The results of this work would help develop future criteria for automatically selecting relevant articles as they are published for examination by human researchers.
Title: Higher Order Latent Dirichlet Allocation
Entity matching, or resolution, is the discovery of objects of the same type. In our research we are developing an Entity Matching System (EMS) that addresses entity resolution by leveraging multiple approaches. The EMS includes the following functionality: (1) an open-source framework; (2) support of NIEM/GJXDM to interface with current and future NIJ, DHS, and related projects; (3) multiple entity matching solutions, including traditional pairwise unsupervised algorithms, a Bayesian Logistic Regression algorithm, and algorithms using topic modeling; (4) ground truth data that evaluates the effectiveness of entity matching.
The EMS is built on the Blackbook framework, a semantic web support system. Blackbook is designed to provide analysts with an easy-to-use tool to access valuable data. As a web-services framework, Blackbook supports access to both local or remote data sources. Blackbook is based on the Resource Description Framework (RDF); everything ? including the data ? is stored in RDF in Blackbook. To interoperate with NIEM/GJXDM, the EMS includes a GJXDM converter.
To date three different algorithms have been developed and incorporated into the EMS in support of entity resolution. The first is a pairwise unsupervised algorithm based on q-grams. The second is like it, but resolves numeric data. The third is a supervised algorithm named Bayesian Online Extensible Regression (BOXER). These three algorithms have been integrated into Blackbook and tested on a ground truth data set drawn from the Global Terrorism Database (www.start.umd.edu/gtd) and the Worldwide Incident Tracking System (wits.nctc.gov). Another important evaluation was conducted on identify theft data from the Internet Crime Complaint Center (http://www.ic3.gov) using Correlated Topic Model to match modus operandi.
In addition we are conducting research into the extension of probabilistic graphical models for topic modeling. Using both Latent Dirichlet Allocation (LDA) and Correlated Topic Model (CTM), we introduced additional latent information drawn from the data sets being modeled based on Higher Order Learning [1]. To date a higher-order version of LDA has been developed and evaluated on synthetic and benchmark data. The results show that Higher-order LDA outperforms LDA using Gibbs sampling most of the time.
1.Ganiz, M.C., Lytkin, N.I. and Pottenger, W.M.. Leveraging Higher Order Dependencies between Features for Text Classification. ECML/PKDD'2009 , pp 375-390.
Weave (WEb based Analysis and Visualization Environment) is a web based visual analytics platform developed by the Institute for Visualization and Perception Research at UMass Lowell. Studies have shown that the use of collaboration is an extremely effective technique in solving complex interdisciplinary problems. Currently in the world of visualization systems, there is a dearth of collaborative frameworks. Systems exist today that support asynchronous collaboration. Alternatively, Weave seeks to provide sessioned synchronous collaboration between users in different locations.
Weave users come from a variety of backgrounds, and accordingly have access to different levels of resources. For example; equipment, level of expertise, and data may all be factors that affect a user's usage scenario. In order to effectively collaborate, ease and speed of data import and dissemination is critical. Weave provides a variety of methods for the user to handle the import and dissemination of data in accordance with his or her usage scenario. Weave deals with both local data and distributed data.
Title: Linking Textual Content to Location to Provide Geospatial Intelligence
We have developed a web-based application that uses automatic techniques to integrate information about specific points of interest for a given area of geospatial imagery and maps. The result is a seamless and intuitive integration of knowledge and geography providing geospatial intelligence for better decisions faster. The resulting system, called GeoXray, makes it easy for a range of users to make smart, location-based decisions based on a fusion of imagery and maps with textual content such as news, blogs, tweets, internal documents and reports, and a range of other data.
In this talk I will describe the underlying technology to support the rapid and accurate linking of textual content to location. In other approaches to this problem, the individual geospatial references are linked to separate locations and the types of locations are limited. In contrast, we exploit the entire context in a document to determine the most likely geospatial focus of a document and provide the ability to link to a much wider range of geospatial locations. I will also present our recent work that allows users to import and update features in the background knowledgebase and supports the import of various types of documents, such as intelligence reports. I will describe the various APIs and users interfaces, which allow other groups to both use and integrate with the GeoXray technology.
Title: Fully Dynamic Connectivity in a Streaming Model
Finding the connected components of an undirected graph is a fundamental problem of graph theory, and many simple algorithms exist when the graph is static (e.g. BFS or DFS). However, if the graph is dynamic in nature, the problem becomes much more interesting. Many fast algorithms for such a problem have been found for incremental, decremental, and fully-dynamic graphs over the past 30 years, and improvements are still being made.
More recently, researchers have considered maintaining connected components in the stream model, where edge insertions and deletions arrive over time in a stream and storage space is a limiting factor. This models online monitoring of processes such as financial transactions or sequential access of massive data from, for example, a simulation stored in tertiary memory. Though the incremental case has been solved up to a logarithmic factor of optimality [1], when modeled in a stream model called W-Stream [2], the fully-dynamic case remains unresolved.
We present a heuristic algorithm which attempts to solve the fully-dynamic connectivity problem in a slightly stronger, more flexible stream model than that of W-Stream. Furthermore, our algorithm allows for indefinitely long streams, whereas all previous partially-dynamic streaming algorithms relied on the stream being finite.
Acknowledgment: This is ongoing research with Jonathan Berry and Cynthia Phillips of Sandia National Labs.
[1] C. Demetrescu, I. Finocchi, and A. Ribichini, Trading off space for passes in graph streaming problems, ACM Trans. Algorithms, 6 (2009), pp. 1-17.
[2] M. Ruhl, Efficient Algorithms for New Computational Models, PhD thesis, Massachusetts Institute of Technology, 2003.
Title: Anomaly Detection in IP networks using Optimal Monitoring Interval Length
Our anomaly detection system uses a network traffic representation called a protocol graph. A protocol graph is a graph-based representation of network traffic observed on a particular Internet protocol over some interval of time. In our protocol graphs, the vertices represent hosts using a particular protocol and edges represent interaction between those hosts.
The advantage of using protocol graphs over other detection methods is that the protocol graphs generated contain host relationship information that is reflected in the resulting graph structure. We hypothesize that time series analysis of protocol graphs are a viable method for detecting various types of attacks. Even if an attacker has a list of vulnerable targets, the attacker does not know how other parties in the network normally interact with each other, and consequently the attacker risks detection by disrupting measurable attributes of the corresponding protocol graph. Further, we hypothesize that we can improve detection rates by intelligently deciding on an optimal monitoring interval length (OMIL).
The prevailing approach in time series analysis is to choose a monitoring interval length that appeals to human standards (e.g., 60 seconds, 10 minutes) rather than a monitoring interval length that will optimally detect anomalies. The criteria for optimality depend on the network administrator, but can and should be made concrete. For example, the network administrator may wish to maximize the true positive rate subject to a constraint on computation time. In previous anomaly detection work, authors go to lengths to justify the detection methods but not the monitoring interval length (MIL) used, which plays an important role in the process. Additionally, it is not well know that methods work equally well independent of the MIL used.
Title: Can you believe what you read?: Trustworthiness of Online Content
Decision makers and citizens are highly influenced these days by information they get from on-line resources ? news, on-line encyclopedias, blogs and forums, other social interactions, and product reviews. With advent of social media, emergent and time-sensitive news gets reported and followed through cell phones, twitter, and forums. The ease of publishing on the Web, on the other hand, has also allowed nefarious sources to openly express their views and opinions as if they are facts. With such abundance of data, can you believe all that you read online? Can you identify what information is trustworthy and what is not? We explore these questions and believe that advanced text analytics, involving evidence retrieval, natural language understanding, and trust modeling can help separate the wheat from the chaff. We explore the use of community knowledge (forums, blogs) and differential trust levels on the sources (experts vs. laymen, commercial vs. federal), to model trustworthiness of information. We do this by grouping semantically entailed information, retrieving evidence to support or oppose a claim, and aggregating these based on the source reliability. This, we believe, is a pre-requisite to text analysis applications downstream. We specifically focus on the medical domain ? where malicious behavior as well as commercial interests could exploit the vulnerability of large population of patients, as was evident in FDA intervention during recent Swine Flu outbreak.
Recent Progress: We built a medical trustworthiness model that aims to rank reliable treatments for a disease over unreliable or recalled drugs; especially for chronic ailments such as Cancer and Arthritis that do not have single well-known treatment and are prone to quacks. We explore use of community knowledge, such as health forums and discussion boards, and model trustworthiness based on the sentiment for or against a treatment. In another independent work, we built a classifier to detect and rank reliable medical websites based on page and site level features. We plan to continue exploring different models of trustworthiness in medical and other domains. Current research focuses on validating a claim in a "fact finder"-like application based on retrieving textual evidence from news articles.
Title: Machine learning methods for cybersecurity
We are working on machine learning tools for improving cyber defenses through early discovery of attack indicators and through learning from human experts. These tools will support proactive analysis of attacks; automated adaptation of defenses to the needs and usage patterns of individual users; and sharing knowledge among users. They will complement the standard defenses by adding a "layer of armor" that detects novel threats.
We have so far developed a mechanism that helps inexperienced users utilize available Internet security tools and integrated it into web browsers. We have built a system that learns the user's web-browsing patterns, guides the user through security decisions, clarifies related issues, and helps to adjust security setting to the user's needs. We are now building a crowd-sourcing module that will strengthen the security by transferring experience from experts to novices, thus helping to detect novel threats that may not be known to automated tools.
Title: Optimal Evacuation and Resource Routing for Crisis Situations in Urban Centers
We will discuss a model which will help in predicting optimal strategies for conversion of locations to temporary medical facilities, given that we know the explicit spatial layout of an urban center. The model will also help in showing which routing strategies are best suited for getting people to the necessary facilities. We will also cover the applicability of the model to various crisis situations including extreme heat events and outbreaks of infectious disease.
Title: Visual Analytics Science and Technology (VAST) The purpose of the Visual Analytics Science and Technology (VAST) Challenge is to increase awareness of difficult visual analytics problems in a variety of areas and improve evaluation methodologies for complex visual analytics systems. The VAST 2010 Challenge consisted of three mini-challenges, one involving text records and an investigation into arms dealing, one involving hospitalization records and a characterization of a pandemic spread, and one involving genetic sequences and tracing the mutations of a disease. The grand challenge combined all data sets and the task was to investigate any possible linkage between the illegal arms dealing and the pandemic outbreak. The scenario created for this mini-challenge focused on a hypothetical pandemic outbreak involving a fictitious rapidly mutating virus. Outbreaks require the rapid deployment of limited and temporary resources to mitigate potential damage to society and loss of life. Authorities and health care professionals utilize decision support and bioinformatics tools to ascertain the situation as it develops in order to properly allocate limited resources. Our goal was to see what tools would be used for the solution and would support both insight and scale. As demonstrated in previous challenges, the design of real world scenarios with their accompanying synthetic datasets enabled users to evolve their visual analytics tools to solve these scenarios. In this paper we describe the scenario, the generation of the data set, and its issues as well as some of the solutions that were innovative.
Title: Using what we know about biomechanics to explore the nonlinear dynamics of iris deformation
Here we present a novel approach using what we know about biome- chanics to explore the nonlinear dynamics of iris deformation. Current iris recognition systems and algorithms at most assume that the dilation is linear. Furthermore, research on iris deformation do not take into ac- count the mechanical properties of the iris tissue as well as the cause of deformation, which is from the iris musculature. In our work, we explore the tissue mechanics of the iris region. From looking at the mechanics, we are able to get a complete understanding of the dynamics as well as disprove the current linear assumption.
Title: Applications of Confluent Graph Visualization Engine to Data Analysis and Information Extraction
Besides being a powerful graph visualization tool, the engine has the capability of collecting database information in XML format, convert it to regular text format, store it in appropriate arrays, and make them ready for use. We experimented with numerous applications. Its database search engine was used to process data files provided by the vaccine group to identify relationships. The engine was extended in order to create an extra module to run, animate and replay the representation of telephone communications between groups of people in a chronological order. The visualization simplifies the interpretation of the sequence of phone calls between group members and helps with the identification of different clusters.
We are working on converting the software to parallel processing using CUDA in order to handle a very large database.
Title: Discovering Connections between and within DHS Centers of Excellence
The current project, COE Explorer, is a joint effort between CCICADA and VACCINE. The goal of the project is to allow people to interactively learn about the personnel, activities, capabilities, and interests of DHS Centers of Excellence (COE), and eventually also other related organizations. It includes two principal components:
We at CCICADA-USC/ISI are building the data analytics component. This component takes as input either text (such as project descriptions or reports) or structured information (such as lists of center members, project titles, etc.) and applies natural language processing technology (including Conditional Random Fields for Information Extraction and Latent Dirichlet Analysis for Topic Modeling). It produces as output a set of database tables containing discovered connections between center members or projects that have similar research topics or interests. The results are displayed by the visualization component built at VACCINE. Out initial data describes our two linked centers (CCICADA and VACCINE).
Once the initial system is completed we will broaden system capabilities to also handle other DHS Centers of Excellence, and eventually other appropriate government organizations at the request of DHS. We will investigate the possibility to help users discover correspondences between DHS centers and government agencies. This will support the alignment of DHS researchers with likely counterparts in government agencies, potentially leading to new collaboration opportunities and interesting policy options.
Title: Shape-free detection of hazard materials and its application to counter-terrorism
Radon transform is the cornerstone of the modern image processing. Using Radon transform we can detect the location and the shape of the hidden objects. This algorithm has been built in almost all medical imaging programs. It takes advantage of the difference of contrasts between the objects and the circumstance. However it doesn't provide any information about the physical parameters (such as density, dielectric constant, etc.) of the objects. For the diagnosis purpose in medical imaging it is very useful and powerful. But for the counter-terrorism purpose it needs to be modified in order to be applicable because the hazard materials can be hidden in any shape.
We revisited the Radon transform and found that Radon transform has the ability to detect the physical parameters of materials if we modify the mathematical model correspondingly. We assume that some hazard materials can be distinguished by few physically measurable parameters. Using a known reference material and regular Radon transform we are able to design a new algorithm that can be used to detect the physical parameters of hidden objects.
A new device based on the above-mentioned algorithm can be developed and will be used to detect the hazard materials carried by the terrorists no matter where and in what shapes the materials have been hidden. In the future we will explore more possible applications of this algorithm to homeland security.
Document last modified on October 7, 2010.