Title: Integrating and Summarizing Web Pages, Structured Information, and Maps for Strategic Analysis on Multiple Media
Geospatial imagery and maps show geometric relations among entities. They cannot be used to present other kinds of knowledge about the temporal, topic, and other conceptual relations and entities, which are typically contained in text (both in structured form, such as tables and lists, and in unstructured free form), in order to support varied search and decision tasks. We describe the integration of maps and geospatial images with potentially large amounts of textual data, and the ways one can organize this material for optimal display. User cases are studied to show how users interact with text in an integration context. Initial eye-tracking results are presented to illustrate and verify the observations in the case study.
Title: Detection of Nuclear Materials in Shipment to U.S
Detonation of a nuclear weapon on U.S soil is the most feared type of terrorist attacks. Standardized shipping containers are highly vulnerable vehicles for delivering nuclear and radiological weapons or materials. As a measure of counteraction, Department of Homeland Security (DHS) is moving forwards inspections of all containers entering U.S. It starts with a real time inspection based on cargo's manifest data and radiation scanning data, then follows by a stringent and expensive inspection. There are two competing priorities: 1) to detect any illicit nuclear materials and 2) to move cargoes as fast as possible from the port of entries to reduce waiting costs. We aim to improve the efficiency of the inspection. We employ modern data and text mining techniques to process the massive and often unclear data manifest data. Penalized regression modern is conducted to pick up important predictors and give the risk score for each container. Furthermore, we combine the information from manifest data and radiation portal data via confidence distribution method under a new meta-analysis framework. By maximizing the utilization of available information intelligently and collectively, we can increase the likelihood of detecting suspicious cargoes and minimizing the false alarms.
Title: Design and Deployment of a Mobile Sensor Network for the Surveillance of Nuclear Materials in Metropolitan Areas
Nuclear attacks are among the most devastating terrorist attacks, with severe losses of human lives as well as damages to infrastructures. It becomes increasingly vital to have sophisticated nuclear surveillance and detection systems deployed in major cities in the U.S. to deter such threats. In this paper, we outline a robust system of a mobile sensor network and develop statistical algorithms and models to provide consistent and pervasive surveillance of nuclear materials in major cities. Specifically, the network consists of a large number of vehicles, such as taxicabs and police cars, on which nuclear sensors and Global Position System (GPS) tracking devices are installed. Real time readings of the sensors are processed at a central surveillance center, where mathematical and statistical analyses are performed. We use simulations to evaluate the effectiveness and detection power of such a network.
Title: Recognizing, Tracking and Deduplicating events in News
With the enormous amount of information available on-line, in international, national and local news sources, it is essential to be able to automatically detect and track events that could have implications on societal stability and, consequently, on homeland security.
These tasks are especially important in support of homeland security because analysis of chains of events can uncover and predict early enough possible consequences of the events. An example of an event chain could be: (a march by an ethnic minority to protest food rationing due to climatic changes; the killing of demonstrators by opposing ethnics; a riot stimulated by the killings; excessive brutality in suppressing the riot; the emergence of organized ethnic violence; a mass migration to more fertile regions; ethnic conflict involving the migrants; etc.) In this research, we are developing natural language processing and information extraction tool to allow the automatic recognition of events, tracking chains of events and relating and de-duplicating events in different news sources. Two of the key challenges of event detection is dealing with a large number of event types, often in a hierarchical taxonomy, and recognizing multiple events in a single text fragment. Our event recognition model, in its preliminary stage, follows the framework of semantic role labeling to classify events and identify participants in the events which, in turn, supports event tracking and de-duplication.
Title: Finding and describing objects in broad domains
We propose an approach to find and describe objects within broad domains. We introduce a new dataset that provides annotation for sharing models of appearance and correlation across categories. We use it to learn part and category detectors. These serve as the visual basis for an integrated model of objects. We describe objects by the spatial arrangement of their attributes and the interactions between them. Using this model, our system can find animals and vehicles that it has not seen and infer attributes, such as function and pose. Our experiments demonstrate that we can more reliably locate and describe both familiar and unfamiliar objects, compared to a baseline that relies purely on basic category detectors.
Title: Higher Order Learning
Real world machine learning problems have to deal with large amounts of data. Training a classifier on such large data sets is prohibitively expensive. This is normally addressed by learning from a sample of the data. However, accurate estimation of model parameters from small samples is challenging. Traditional classification algorithms assume that the instances of training data are independent and identically distributed (I.I.D.). However recent research on the use of higher-order associations in text classification exploits interdependencies between instances to obtain higher levels of accuracy than traditional classification approaches. Unlike approaches that assume data instances are independent, the novel Bayesian framework for classification named Higher Order Naive Bayes(HONB) leverages co-occurrence relations between feature values across different instances. We developed a novel data-driven space transformation that allows any classifier operating in vector spaces to take advantage of these higher-order co-occurrence relations. Results obtained on several benchmark text corpora demonstrate that higher-order approaches achieve significant improvements in classification accuracy over the baseline (first-order) methods.
HONB outperforms Naive Bayes (NB) and SVM in text classification, especially on small sample sizes, where traditional classification algorithms fail. We evaluated and compared four graph-sampling algorithms that exploit second order associations on very small sample sizes, to random sampling. On a number of benchmark data sets, we empirically demonstrate that second order path counts in document relation graphs can be leveraged to reduce the training sample size without significantly affecting the classification performance. These new graph-sampling techniques have many applications in learning when employing higher order techniques to sample from large data sets.
Title: Dynamic Networks, Social Complexity, and Disease Risks
We explore the effects of familial bonds on the evolution of social complexity using individual-based network models. Additionally we investigate how certain individual behaviors play a role in this evolution, as well as potentially hinder the prevalence of an epidemic disease in a society.
Title: Advancing Visual Analytics Evaluation through Competitions
The VAST Challenge is a competition aimed at providing visual analytics problems made from synthetic datasets with embedded ground truth. Submissions to the Challenge consist of reports of the situation including answers to specific questions (who, what, when, where) and a description of the process used to arrive at that assessment. These submissions are evaluated using both quantitative accuracy ratings and subjective ratings from visualization experts and professional analysts. The success of this competition and the growing need for evaluating complex interactive systems warrants the development of cyberinfrastructure services that facilitate the following activities:
1. Developing a collection of datasets with ground truth and analytic problem descriptions 2. Supporting the online management and judging of analytic challenges using both quantitative and qualitative measures 3. Supporting self-assessment by researchers based on both qualitative and quantitative measures 4. Developing measures of complexity for the combination of datasets and analytic tasks to measure progress in tools for supporting analysis
The final goal of the competition is to serve as a test-bed for providing generalized approaches, metrics, and technologies for the evaluation of other complex, highly interactive systems used in analytical reasoning.
Title: Decision Making Using High Dimensional Observational Data
It is often too dangerous or expensive to actively gather new data points, so decisions must be made using existing observation data. Traditional stochastic optimization methods assume that it is possible to obtain noisy sample from the loss function infinitely often. Additionally, the observations that we do have may contain significantly different covariates than those currently observed. We approach these issues with a two step approach. First, we present a flexible new method for high dimensional regression using Dirichlet process-generalized linear model mixtures. Second, we build on this method to solve scalar resource allocation problems using observational data. Possible applications include threat monitoring and decision-based risk mitigation.
Faculty advisors: William A. Wallace, Malik Magdon-Ismail, and Mark Goldberg, Rensselaer Polytechnic Institute
Title: Simulating the Diffusion of Warnings
In the event of a natural or technological disaster, or hazardous event, warning systems are important for alerting the at-risk population of potential dangers and providing precautionary information to promote safety. It is essential to make use of the social communication network in communities to spread the warnings so that information can reach a larger audience and people at risk will act on the information they receive. This project involves formulating an axiomatic framework for modeling the diffusion of warnings in dynamic social networks through the concept of trust. The network is dynamic where individuals may leave the network and disrupt the flow of information as warnings are being diffused. We assess the framework by modeling the 2007 San Diego Firestorms, in particular the diffusion of the Reverse911 evacuation warnings sent during the event. We configure the parameters and map the process using multiple data sources relevant to the event. We use the model to examine how social group structure, distribution of trust, and existence of weak ties affect the spread of evacuation warnings. The results show the value of dynamic social network analysis and simulation in studying diffusive processes and community response to warnings.
Title: Preposition Sense Disambiguation Using Linguistically Motivated Features
Prepositions, though often ignored, can have a number of different senses, just like nouns or verbs. Disambiguating these senses can help to identify the relation between the preposition arguments, and in turn disambiguate those, too. It is useful for information extraction applications and machine translation. We present a supervised classification approach for preposition sense disambiguation. Instead of using a fixed window size, we derive features from syntactically related phrases surrounding the preposition. Evaluating on the SemEval 2007 Preposition Sense Disambiguation datasets and comparing it to the participating systems, we can report an increased accuracy that outperforms the best system in the SemEval task.
Title: A Practical Differentially Private Random Decision Tree Classifier
In this paper, we study the problem of constructing private classifiers using decision trees, within the framework of differential privacy. We first construct privacy-preserving ID3 decision trees using differentially private sum queries. Our experiments show that for many data sets a reasonable privacyguarantee can only be obtained via this method at a steep cost of accuracy in predictions. We then present a differentially private decision tree ensemble algorithm using the random decision tree approach. We demonstrate experimentally that our approach yields good prediction accuracy even when the size of the datasets is small. We also present a differentially private algorithm for the situation in which new data is periodically appended to an existing database. Our experiments show that our differentially private random decision tree classifier handles data updates in a way that maintains the same level of privacy guarantee.
Title: Confidence-Based Techniques for Rapid and Robust Topic Identification of Conversational Telephone Speech
We investigate the impact of automatic speech recognition errors on the accuracy of topic identification in conversational telephone speech. We present a modified TF-IDF feature-weighting calculation that provides significant robustness under various recognition error conditions. For our experiments we take conversations from the Fisher corpus to produce 1-best and lattice outputs using one recognizer tuned to run at various speeds. We use SVM classifiers to perform topic identification on the output. We observe classifiers incorporating confidence information to be significantly more robust to errors than those treating output as unweighted text.
Title: Schrodinger's Cat and Epidemiological Modeling: Human behavior and the estimation of etiological parameters from reported outbreaks
To analyze past and predict the course of future outbreaks, epidemiological models are validated by agreement of predicted spread patterns with observed incidence data. Mathematical models of infectious disease dynamics focus primarily on two basic parameters governing pathogen spread in the population: rates of transmission ($\beta$) and rates of recovery ($\gamma$) . Once validated in this way, unless explicit, etiological analysis is available (e.g. through exhaustive contact tracing, which is only very rarely possible, especially for emerging diseases of immediate concern), these parameter values are inferred from the model by finding the values that provide the best fit to the observed data. This is how we estimate the critical $R_0$ threshold (i.e. reproductive number), from which we estimate the relative potential impact of outbreaks for each disease . However, these models operate on the fundamental and implicit assumption that disease detection occurs uniformly throughout the course of an outbreak. We present a demonstration, discussion, and mathematical analysis of how this implicit assumption of constant sensitivity in reported disease incidence can drastically impact the accuracy of the estimated transmission parameters.
Title: Emerging Epidemics in Virtual Worlds
Virtual worlds present unique opportunities to study human reactions and estimations to risk in an environment that can be thoroughly quantified and used to validate both theoretical and observational models. Here we describe previous work done examining the emergence of, and response to, an infectious disease outbreak in the popular online game World of Warcraft.
Title: Topic models for integrating and analyzing opinions in blog articles
In homeland security applications, it is often a need to gather and integrate scattered opinions about an entity such as a person, an organization or a policy. Thanks to Web 2.0 technology which has enabled more and more people to freely express their opinions, the Web has become an extremely valuable source for mining people's opinions. In this work we study how to automatically integrate opinions expressed in a well-written article with lots of opinions scattering in various sources such as blogspaces and forums. We formally define this new integration problem and propose to use semi-supervised topic models to solve the problem in a principled way. Experiments on integrating opinions about two quite different topics (an event and a political figure) show that the proposed method is effective for both topics and can generate useful integrated opinion summaries. The proposed method is quite general. It can be used to integrate a well written review with opinions in an arbitrary text collection about any topic to potentially support many interesting applications in multiple domains.
Title: A Report on RNA Secondary Structure Prediction of the HIV-1 Molecule Using a Lattice Walk Approach: Are There National Security Implications?
This work focuses on predicting a more stable RNA secondary structure of the HIV-1 molecule. The lattice walk approach is applied to the SL1 and SL2 domains of the HIV-1 RNA sequence, which are both hairpin structures that are important for genomic packaging of viral RNA. We investigate the existing bijection between the RNA secondary structures and the Lattice Walks by constructing various Lattice Walks and analyzing their associated RNA secondary structures. This work reports on a lattice walk approach used to create a mathematical model that predicts more stable HIV-1 RNA sequences. Are there national security implications as a result of HIV sequence prediction? The pandemic scale of HIV has only two parallel in history: the 1918 flu pandemic and the fourteen century Black Death. At least 39 million people now infected with the virus are expected to die in the next 5-10 years. This depletion of elite workers and professionals constitutes a threat to homeland security, which will then be at greater risks of civil disturbance, conflict and disorder. The disparity of access to retroviral drugs increases the widening life-expectancy gap between poor countries and Western countries. As a result of this, there is increasing concern that nations highly infected with HIV might engage in bioterrorist acts against the United States. The lack of an effective and affordable vaccine against the virus makes this threat even more conceivable. Therefore, HIV research efforts are of high importance.
Title: An Optimal Learning Approach to Finding an Outbreak of a Disease
We describe an optimal learning policy to sequentially decide on locations of a city to test for an outbreak of a particular disease. We use Gaussian process regression to model the level of the disease throughout the city, and then use the correlated knowledge gradient, which implicitly uses exploration versus exploitation concepts, to choose where to test next. The correlated knowledge gradient policy is a general framework that can be used to find the maximum of an expensive function with noisy observations.
Title: Lexical entailment for Privacy Protection in Medical Records
HIPPA has mandated that people's medical and other health information is private and should be protected. The Privacy Rule, a Federal law, stipulates that information doctors, nurses, and other health care providers put in medical records as well as the care or treatment received by patients should remain private. However, most "anonymized" samples of medical records show that the treatment that the patient receives in terms of medication and their dosages are not hidden or anonymized. In this poster, we present some work on using lexical substitution as a method for preserving privacy of medical records. Our approach uses lexical substitutions in which one word can be substituted by another while preserving or entailing the original meaning. We applied lexical substitutions on strings denoting dosages of medication. A grammar was developed and run on a sample of 2000 medical records. In order to assess the effectiveness of our approach, in protecting privacy and maintaining the clustering results, we run the K-means clustering method on the original sample and then on the modified sample and compared their results. The preliminary results show that the approach is very promising.
Title: iTopicModel: Information Network-Integrated Topic Modeling
Document networks, i.e., networks associated with text information, are becoming increasingly popular due to the ubiquity of Web documents, blogs, and various kinds of online data. In this paper, we propose a novel topic modeling framework for document networks, which builds a unified generative topic model that is able to consider both text and structure information for documents. A graphical model is proposed to describe the generative model. On the top layer of this graphical model, we define a novel multivariate Markov Random Field for topic distribution random variables for each document, to model the dependency relationships among documents over the network structure. On the bottom layer, we follow the traditional topic model to model the generation of text for each document. A joint distribution function for both the text and structure of the documents is thus provided. A solution to estimate this topic model is given, by maximizing the log-likelihood of the joint probability. Some important practical issues in real applications are also discussed, including how to decide the topic number and how to choose a good network structure. We apply the model on two real datasets, DBLP and Cora, and the experiments show that this model is more effective in comparison with the state-of-the-art topic modeling algorithms.
This work has been accepted by 2009 Int. Conf. on Data Mining (ICDM'09 ).
Title: DAPA-V10: Discovery and Analysis of Patterns and Anomalies in Volatile Time-Evolving Networks
We address the problems of finding patterns and detecting anomalous behavior in volatile time-evolving networks (such as communication networks). Our approach, the DAPA-V10 algorithm, first identifies persistent patterns over time, which provide a basis for expected normal behavior in the network. For very volatile networks, the task of identifying persistent patterns in the data is highly non-trivial and important in its own right. The patterns are then used to detect anomalous behavior on both a local and global scale. Finally, we evaluate the effectiveness of our approach with experiments on the Enron email dataset.
Title: Predicting Spatio-Temporal Risks from Insecticide Resistance and Vector-borne Disease Threats
Chemical insecticide controls are routinely used in the US and elsewhere in the world to control the risk of mosquito populations transmitting diseases such as West Nile Virus, dengue, malaria and chikungunya, yet mosquitoes are known to have evolved resistance to the limited number of available insecticides. My research focuses on spatial risk assessment modeling to predict which regions and insecticide-based control strategies pose threats to the health of human and wildlife populations.