Special Event: Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA)-Research Retreat 2014

May 1 - 3, 2014
Rensselaer Polytechnic Institute, Troy, New York

Organizers:: Eduard Hovy, CCICADA/CMU, hovy at andrew.cmu.edu; Cindy Hui, CCICADA/RPI, huic4 at rpi.edu; Bill Pottenger, CCICADA/Rutgers, billp at dimacs.rutgers.edu; Dan Roth, CCICADA/University of Illinois at Urbana-Champaign, danr at uiuc.edu; Thomas Sharkey, CCICADA/RPI, sharkt at rpi.edu; William "Al" Wallace, CCICADA/RPI, wallaw at rpi.edu

Presented under the auspices of the The Homeland Security Center for Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA).

Abstracts

Moussa Doumbia, Howard University

Title: Malaria Incidence and Anopheles Mosquito Density in Irrigated and Adjacent Non-Irrigated Villages of Niono in Mali

In this paper, we extend the mathematical model framework of Dembele et al. and use it to study malaria disease transmission dynamics and control in irrigated and non-irrigated villages of Niono in Mali. As case studies, we use our "fitted" model to show that in support of the survey studies of Dolo et al. and others, the mosquito density in irrigated villages of Niono in Mali is much higher than that of the adjacent non-irrigated villages. Many entomological studies have observed higher densities of mosquitoes in irrigated villages than in adjacent areas without irrigation. Our "fitted" model supports these observations. That is, there are more malaria cases in non-irrigated villages than the adjacent irrigated villages. In addition, we use the extended "fitted" model to determine the drug administration protocols that lead to fewest first episode of malaria in both irrigated and adjacent non-irrigated villages of Niono in Mali during the wet season.

José Holguín-Veras, William H. Hart Professor Director of the Center for Infrastructure, Transportation and the Environment Rensselaer Polytechnic Institute

Title: The Lessons of Large Disasters for Humanitarian Logistics: Research Needs

Extreme events pose serious logistical challenges to emergency and aid organizations active in preparation, response and recovery operations, as the disturbances they bring about turn normal conditions into chaos. This is particularly true in the case of catastrophic events. Under these conditions, delivering the critical supplies urgently required becomes an extremely difficult task because of the severe damages to the physical and virtual infrastructures and the very limited, or non-existent, transportation capacity. In this context, the recovery process is made more difficult by the prevailing lack of knowledge about the nature and challenges of post-disaster humanitarian logistics. This presentation is based on the quick response fieldwork conducted by the author and his colleagues on the largest disasters of recent times, which include, among others, the 2011 Tohoku disasters in Japan, the 2010 Port-au-Prince earthquake, and the 2005 Hurricane Katrina. The presentation discusses the important lessons that ought to be learned from these disasters, and the role of industrial engineers and the civic society.

Richard Garrett, Rensselaer Polytechnic Institute

Title: Dynamic Modeling for Arctic Resource Allocation

Resource allocation in the Arctic is a persistent and complex challenge that is at the center of many Coast Guard missions including navigational safety, oil spill response, search and rescue, and traffic management. The Alaskan Arctic, comprised of the Chukchi and Beaufort seas, is an immense, seasonally-variable waterway with very little infrastructure along its 2,191 nautical-mile shoreline. Renewed interest in resource exploration, specifically in the Chukchi, has lead to increased traffic and the potential for year-round off-shore drilling to occur. The Arctic is an environmentally sensitive area with little commercial, mar-itime or safety infrastructure and great distances to access resources in the case of a mar-itime, personnel casualty or oil spill incident. Given the absence of shore-based infrastructure, hyper-inflated costs due to climate extremes, remoteness of operations, and the vast distances involved, long-range planning for oil spill response is required. A large-scale dynamic network expansion problem with sto-chastic scenario considerations is proposed as a means to assess oil spill response resource allocation policies in the High North. The model specifically focuses on addressing task lists required for all potential spills in order to improve the objective of weighted task completion times based on allocated resource position. Stochastic programming solutions methods are employed, and observations and solution performance results are discussed.

Jacob Geiger, Yale University

Title: SemRel

SemRel is a generative model for text and can be used to perform automatic categorization of text through topic modeling. Historically, it builds upon LDA, which introduced modeling words are drawn from categories, a latent variable. By inferring the categories that compose any given document, we can infer its content, and perform automatic categorization within large corpuses of documents. LDA has since been extended by PAM, which adds a hierarchical category system, and Type-LDA, which replaced the feature space of words with one consisting of semantic relations. SemRel is novel in its combination of both the hierarchical and relational aspects of these models, and in its ability to guarantee an arbitrary level of differential privacy. The differential privacy limits the amount of information about the documents analyzed that can be inferred from SemRel's output, making it safer, in a formal sense, to condense sensitive documents into a summary of their content in terms of categories.

One part of my research is focused on testing SemRel's performance relative to the older model Type-LDA on the task of automatically classifying Wikipedia articles. This has been successful; on tests of model fit on a dataset consisting of 3785 Wikipedia articles, SemRel consistently outperformed Type-LDA with high confidence, confirming earlier results.

Two ongoing subjects of research involve establishing a usable relationship between SemRel's output, and the human categorization of the Wikipedia articles, and testing the differential privacy aspect of SemRel. By matching SemRel's statistical classification of the articles to the existing system of classification created by human Wikipedia editors, it would be possible to predictively classify future articles according to meaningful human categories, but without necessitating a human to read and classify them. Finally, the differential privacy component of SemRel needs to be implemented, then tested to ensure that it preserves good performance

Georges Grinstein, University of Massachusetts - Lowell

Title: The Role of Session History in Visual Analytics

Weave is a web-based analysis and visualization environment that was specifically designed to simplify the process of exploring and visualizing data. It is open source and currently in use in several multi-partnered alliances in metro regions including Atlanta, Boston, Chicago, Columbus, Grand Rapids, Lowell, San Antonio San Diego, and Seattle, several states such as Arizona, Connecticut, Rhode Island, South Florida, and Utah, and government agencies such as the Mass Dept of Early Education and Care, US Dept of Labor and the CDC. Each has a variety of custom needs and the result is Weave has been developed to be a powerful but flexible toolset. Weave was designed with a session history-based architecture that records every action taken during data visualization and analysis. This step-by-stop process, or "session history", can be edited, saved, shared and evaluated by others. We have built a plug-in capability in Weave which allows other external web tools, whether or not written by us, to harness Weave's session history and to have two-way communication with Weave. We'll show applications to graphs, text visualization and analyses, and storytelling.

Emily Heath, Rensselaer Polytechnic Institute

Title: Applying Ranking and Selection Procedures to Network Mitigation

This research looks at how existing ranking and selection procedures can be applied to the problem of network mitigation. In network mitigation, we examine how an infrastructure system can be strengthened before an expected disruption occurs. Infrastructure systems are modeled with a directed graph consisting of nodes and arcs, following a typical network flow structure. In the event of a disruption, arcs in the infrastructure network are damaged and unable to carry sufficient flow to meet demand. In this research, we focus on how strengthening activities or mitigation can be performed in order to improve the expected restoration. Specifically, the question we seek to answer is how to find the best mitigation plan with statistical certainty.

Ranking and selection procedures are methods designed to determine with a given level of statistical significance what the best system is, where best refers to the system with the highest level of some performance measure. R&S procedures are well-suited to problems where the number of systems is large, and simulation to determine the performance measure is expensive. In our research, the competing systems are arc mitigation plans. The performance measure we use is the improvement in the weighted met demand in the network over the restoration horizon, which is measured as the objective function of an integrated network design and scheduling problem. In fact, we must solve two INDS problems to find this improvement: one where damage occurs and mitigation had been performed, and one where damage occurs and mitigation had not been performed. Since solving the INDS problems requires finding the solution to an integer programming problem, we have a high computational cost for finding the performance measure of an arc mitigation plan. Even for moderately sized networks, considering all possible single arc plans, or all possible two arc plans, etc., can result in a very large number of systems as well. We discuss how R&S procedures, combined with information gathered about the network, can be applied to determine the best mitigation plan, the difficulties we encountered, and the areas we see for future work.

Hongzhao Huang, Rensselaer Polytechnic Institute

Title: Collective Tweet Wikification based on Semi-supervised Graph Regularization

Wikification for tweets aims to automatically identify each concept mention in a tweet and link it to a concept referent in a knowledge base (e.g., Wikipedia). Wikification is a particularly useful task for short messages such as tweets because it allows a reader to easily grasp the related topics and enriched information from the KB. From a system-to-system perspective, wikification has demonstrated its usefulness in a variety of applications, including coreference resolution, classification, user interest discovery. Due to the shortness of a tweet, a collective inference model incorporating global evidence from multiple mentions and concepts is more appropriate than a non-collecitve approach which links each mention at a time. In addition, it is challenging to generate sufficient high quality labeled data for supervised models with low costs the unlinkability and ambiguity of mentions, as well as determination of prominent mentions. To tackle these challenges, we propose a novel semi-supervised graph regularization model to incorporate both local and global evidence from multiple tweets through three fine-grained relations: local compatibility, coreference, and semantic relatedness. In order to identify semantically-related mentions for collective inference, we detect meta path-based semantic relations through social networks. Compared to the state-of-the-art supervised model trained from 100% labeled data, our proposed approach achieves comparable performance with 31% labeled data and obtains 5% absolute F1 gain with 50% labeled data.

Paul Kantor, Rutgers, the State University of New Jersey

Title: Both Wide and Deep: Volumetric and Non-Linear Estimation of the FEMA payouts for flooding events

"[Both] Wide and deep my grave will be; With the wild goose grasses growing over me" Tarrytown (http://www.reveries.com/folkden/tarrytown.html)

FEMA must plan and budget for flood recovery and remediation, a source of considerable fiscal uncertainty. While one cannot predict the weather, immediately after a flood event substantial information is available about the height of waterways, and, in particular, the height above flood level. A typical economic regression analysis seeks to associate the payout, for various flood events, with the peak height of the flood, or the peak above flood level. Such analyses, as is customary in economic studies, present numbers with seven or eight digits. The reality is, of course, far less certain.

The work presented here looks for a more plausible kind of physical link between the flood event, and the economic harm that it causes. Each riverbed has a cross sectional profile. The volume of water required to rise one foot above flood level depends on that cross section. If the walls of the river are very steep, little additional water is needed. If they are gently sloped the same flood height represents a much larger volume of water. Similarly, a very brief flooding event puts a much smaller volume of water into the affected buildings and other structures, and might be expected to cause correspondingly less harm.

We present a numerical procedure for estimating the total volume of excess water associated with a flood event. We also present a piecewise linear threshold-based model relating the volumes of flood water to the FEMA payouts associated with a number of events in the Raritan Basin. This model was tested for four communities, and it achieves excellent predictive behavior against historical data, and can be used to relate the hydrological model results directly to FEMA payout records. Many possible explanatory variables were considered, and the most effective was found to be the aggregated quantity of water above flood level during the time directly associated to the flooding event that caused the claims and payouts. Thus the results indicate that volumetric estimation of costs is generally more accurate than existing methods of analysis.

Hurung Kibirige (Lila Ghemri), Texas Southern University

Title: Trust in Medical Websites

The use of the Web as a reference to locate and validate medical information has been growing. A recent report shows that more than 77% of internet users use general purpose search engines, such as Google or Bing, to look up specific diseases, treatments or procedures and that 67% of them believe that the online health information is reliable and trustworthy. However the internet has also become a worrisome source for the propagation of fake online pharmacies, sham hospitals and medical schools. We present a novel method for re-ranking webpages based on the website names in order to not only increase their precision but also their trustworthiness. Our re-ranking approach aims at capturing and returning only those websites that are consistently retrieved across search engines and takes advantage of the fact that the life span of fake websites is relatively short compared to legitimate ones. Preliminary testing of re-ranking results has shown that it yields more relevant websites to the user query than the general purpose search engines.

David Klaper, Carnegie Mellon University

Title: Knowledge Resources for Cybersecurity

Cybersecurity is a fragmented research field. Most research focuses on a specific issue in a subfield of Cybersecurity. This complicates forming a consistent body of knowledge about Cybersecurity. Moreover, the lack of a systematic view what constitutes cybersecurity hinders education and inhibits developing effective curricula for teaching this important subject. This project aims at building an overview of cybersecurity and using data analysis techniques to extend and complement the information available.

The Knowledge resources are a taxonomy and a website discussion portal. The taxonomy presents cybersecurity in a hierarchy of concepts. Each concept has a short description and possibly some links to further information. The web portal provides users a possibility to comment on the security of websites by using a browser plugin. The web portal provides links to relevant entries in the taxonomy, connecting concrete discussions with corresponding educational material.

Furthermore, the project aims at developing information extraction methods to provide more comprehensive information in the taxonomy. In particular, the goal is to extract summaries about a concept from relevant articles, as well as concrete reports of incidents pertaining to the concept. Such methods help provide a broader coverage of each concept in the taxonomy. I would present preliminary results and ideas on what information to extract and how to approach this problem.

In conclusion, I would like to present this taxonomy and web portal in a short talk and also give a demo. The project aims at providing information resources about Cybersecurity. The current research topic of the project is information extraction of web resources in order to enhance the scope and content of the taxonomy.

Brian Knopp, Montana Tech of the University of Montana (Christie Nelson)

Title: Modeling Microtext with Character n-grams and Higher Order Learning

Microtext classification is an exciting yet challenging research area due to the lack of information inherent in microtext. Microtext modeling offers unprecedented data analysis opportunities due to the popularity and abundance of microtext available through social media sources. Microtext is generated through text messages as well as certain forms of social media (e.g., Twitter), etc.

Prior work has studied microtext from diverse perspectives, including modeling microtext generated during natural disasters. Previous work utilized Higher Order Na-Aove Bayes, a novel variant of Naove Bayes, to model microtext messages using word stems. Although this previous work demonstrated statistically significant improvements over existing approaches, there was still room for improvement. As a result, this effort focuses on modeling microtext gathered during the 2010 Haitian earthquake using Higher Order Naove Bayes. However, this research demonstrates how to further improve the utility of Higher Order Learning methods by leveraging a different feature space, namely, character n-grams.

An automated preprocessing framework was developed to extract character n-grams from the original messages. Higher Order Naove Bayes was then utilized with character n-grams as attributes to classify microtext messages. This work has uncovered an interesting correlation between the performance of Higher Order Naove Bayes and how closely the distribution of the features used for learning follows a Zipfian distribution. Using attributes which are distributed approximately Zipfian tends to increase the performance of Higher Order Naove Bayes relative to that of existing Bayesian approaches. These results may help explain why Higher Order Naove Bayes performs better in certain scenarios, and is an important step towards understanding the theoretical underpinnings of Higher Order Learning.

Hao Li, Rensselaer Polytechnic Institute

Title: Events in Social Media

As the rapid development of social media and social networks, it becomes an important channel of information dissemination. Recently there has been an increasing interest in event detection in social media. Benefit from its real-time nature, social media can be used as a sensor to gather up-to-date information about the state of the world. For example, during Hurricane Sandy people use twitter to share evacuation information in their own neighborhood. They will become extremely valuable information if we could automatically detect and understand such events efficiently. H owever, identifying and understanding events in social media is a challenging problem. It is mainly due to the speed and volume of data. For example in Twitter, users send over 400 million tweets per day. The second challenge comes from the informal nature of social media. For example, each tweet has a length limitation of 140 characters so its context is usually short, incomplete and noisy. A simple tweet itself usually cannot provide a complete picture of the corresponding events. Furthermore social media consists of social networks and information networks thus understanding an event requires multi-dimensional information. W e intend to tackle three types of new and unique challenges for events in social media: (1). Expand short contexts by linking social media to news articles, cluster social media by nearest neighbor search (2). Handle overwhelming amount of data by on topic detection and first story detection (3). Propose a new event representation introducing a new event attribute through Event Prominence Computing; a new link edge between entities through Community/Leader Role Detection; and a new link edges between an entity and an event mention through Sentiment Analysis.

Qi Li, Rensselaer Polytechnic Institute

Title: Joint Modeling for Information Extraction

The goal of information extraction (IE) is to extract information structures of entity mentions and their interactions such as relations and events from unstructured documents. The task is often artificially broken down into several subtasks, and different types of facts are extracted in isolation. Errors in upstream components are propagated to the downstream classifiers, often resulting in compounding errors. However, various entity mentions and their interactions in the information structure are inter-dependent. Also the structures should comply with various soft and hard constraints. In this study, we aim to improve single-document IE by a novel joint framework to extract multiple components together in a single model. Consider the following sentences with ambiguous word "fired": a cameraman died when an American tank fired on the Hotel."He has fired his air defense chief" Knowing that "tank" is very likely to be an Instrument argument of Attack events, the correct event type of fired in the first sentence is obviously Attack. Likewise, "air defense chief " is a job title, hence an event argument classifier is likely to label it as an Entity argument for End-Position trigger, indicating that the second "fired " is a trigger of End-Position event. Following the above intuitions, we consider IE to be a structured prediction problem over multiple components. By jointly predicting the information structures, we aim to capture the interactions among multiple tasks, and exploit global features of the graph structures. Structured perceptron with inexact search is a natural choice for this purpose. We have conducted experiments on two tasks: joint extraction of event triggers and arguments, and joint extraction of entity mentions and relations. Both of them demonstrated the efficacy of the proposed framework and achieved state-of-the-art performance.

Curtis McGinity, Rutgers, the State University of New Jersey

Title: ACCAM Simulation: US Coast Guard Air Station Simulator

We present a model and discrete event simulation of USCG Air Stations, accounting for the mission demands and maintenance procedures pertaining to USCG aircraft. The simulation provides aircraft availability distributions and mission performance metrics based on varying input scenarios, including changes in the number of stationed aircraft and maintenance targets. The Air Station model is novel in its relatively simple, easily tunable, renewal process treatment of maintenance procedures, mitigating the need for the modeling of complex maintenance subprocesses and the resulting statistical estimation of numerous parameters. The simulation also models mission requirements such as Search and Rescue that are stochastic in time and space. Simulations are consistent with historical data and offer insights into hypothetical scenarios.

David Mendonca, Associate Professor, Industrial and Systems Engineering (with Barbara Cutler, William A. Wallace, James D. Brooks) Rensselaer Polytechnic Institute

Title: Collaborative Training Tools for Emergency Restoration of Critical Infrastructure Systems

Large-scale disasters can produce profound disruptions in the fabric of critical infrastructure systems such as water, telecommunications and electric power. The work of post-disaster infrastructure restoration typically requires close collaboration across these sectors. Yet the technological means to support collaborative training for these activities lag far behind training needs. This presentation motivates and describes the design and implementation of a multi-layered system for use in cross-organizational, scenario-based training for emergency infrastructure restoration. Ongoing evaluation studies are described in order suggest directions for further work.

Christie L. Nelson, Rutgers, the State University of New Jersey

Title: Global Optimization Model for the USCG Aviation Air Stations

United States Coast Guard (USCG) aircraft fleet allocations to USCG aviation Air Stations have long been dictated by assumptions of operational response, aircraft availability, Search and Rescue response, and planned missions. With the increasing demand that aviation aircraft support forward-deployed surface forces, the USCG would like the means to identify the optimal assignment of aircraft at USCG Air Stations. This requires a previously lacking modeling capability to comprehensively analyze response and mission demands at the USCG Air Station level. This model could provide USCG decision makers with alternatives for aircraft allocations, which would be optimized to operational as well as logistical capabilities and predicted mission needs. The goal of this project was to minimize aircraft fleet operational costs subject to performance targets of various types. The model can be thought of as two-stage: The first stage, prior work called ACCAM Simulation, was first a simulation of each USCG Air Station, which generates resulting performance metrics based on the specific station and what is termed a -Y4scenario.! Each scenario was determined by a large number of relevant Air Station attributes, including the USCG Air Station, the allowable aircraft at a given station, the station-F"s operational level, the station"s historical SAR mission response, the station"s deployment requirements by time period, and the station"s other mission requirements. The scenario was also determined by aircraft information such as the aircraft type"s historical maintenance processes including both scheduled maintenance and unscheduled maintenance. The second stage of the model, the ACCAM Global Optimization Model to be presented, was an optimization model over the set of station-specific scenarios to determine the optimal deployment assignments, operational levels, and aircraft allocation among all USCG Air Stations, under the current infrastructure. The optimization model can also be used to demonstrate the potential efficiencies of proposed infrastructural changes, such as the introduction of a deployment center, a depot of aircraft that can be deployed to USCG Air Stations for specific missions. This model falls under the Coastal Operation Analytical Suite of Tools (COAST) Aviation Capability and Capacity Assignment Module (ACCAM). The model, termed the ACCAM Global Optimization Model (GOM), or ACCAM GOM, was a joint effort between researchers at CCICADA/Rutgers University and the USCG.

Brian Nakamura, Rutgers, the State University of New Jersey

Title: Optimal Boat Allocations with Sharing

The United States Coast Guard (USCG) often allocates resources, such as boats, to stations for durations of a year or longer. In prior work, CCICADA developed a model to help the USCG study boat allocations that may potentially reduce costs while meeting various requirements. This naturally led to the question: How much better can we do if stations could -Y-A4share" boats? For practical purposes, we consider the following notion of boat sharing. The calendar year would be split into several time periods (e.g., three time periods of four months each), and each boat would be assigned to a station each time period. If a boat is assigned to two or more different stations during the year, the boat is said to be shared between those stations. If it is assigned to the same station in each of the time periods, it is not shared. In our current work, we developed a model that would allow the USCG to investigate boat allocations where stations may share boats. The model has two potential objectives: minimizing the number of boats needed and minimizing the total cost (given additional cost inputs). The model would find an optimal boat assignment plan (with sharing) that would minimize the number of boats (or cost) given station and mission requirements and additional boat sharing constraints

Neha Pawar, University of Illinois at Urbana-Champaign

Title: Monitor Social Buzz on the Map

With the widespread use of social network, a topic, once it starts to get attention from a sufficient number of people, will soon become a buzz on the online world. These social buzzes enable us to "listen" to what's going on in our surrounding. By pinpointing these buzzes to where they occur, we can better understand and visualize our home land, by seeing in real-time what's happening and what's coming up, what's fun and what's concerning, what's expected and what's unusual, thus contributing to our better living and security. Therefore, in this project, we aim to build a system SocialBuzz, which could collect such social buzzes from Twitter feed and visualize them on a map. The visualization will represent the information extracted from tweets, e.g., we could use different icons to annotate the events happened in that place, color the map in moods like popular, safe (SocialBuzz-KeepSafe) and categories like fun, food, etc (SocialBuzz-WhatsUp).

We will present our approach used in monitoring tweets and the entity recognition and resolution techniques used in resolving their locations. The demonstration will include social buzzes caught in the city of Champaign pinpointed on a map, which reflect the views of the people on security and lifestyle.

Navneet Rao, Carnegie Mellon University

Title: User Oriented Cybersecurity

Every day millions of people from all corners of the world access the Internet. They come in contact with content in various different forms like documents, advertisements, webpages, email etc. As the number of Internet users grow, so does the content. But the content that people access includes product advertisements, links to viruses, or sophisticated phishing attacks which sometimes fool people into providing their personal and financial information. Sometimes they make come as a combination of these different types. Thus, even though most forms of content that users access is harmless or irritating, a small portion of this may contain extremely dangerous content. Traditional isolation based methods are sometimes ineffective as users might not know the risks involved, e.g. sometimes users get tricked into accessing content with a catchy title which may have been isolated. There is thus a need for an automated system which is able to categorize the different kinds of content accessed by ordinary users in order to make their Internet content access more secure. Such categorizations may be on the basis of the semantics of the content or may be as per the risk level that the system associates with the content.

Spam email for example represents an important facet of content in the digital world. Approximately 78% of the more than 180 billion emails being generated daily comprises of spam content, and 50% of these actually reach the end users. We have worked on the categorization of different types of spam email based on factors like the sender-F"s intent, the popular types of malicious email content etc. We are now working on extending our work to other forms of content like webpages, advertisements. We use natural language processing techniques combined with text classifiers like support vector machines, logistic regression to build our models. We hope that our efforts will help not only create an automated system for content categorization but also help other cybersecurity stakeholders who are interested in identifying and studying specific forms of dangerous content.

Brian Ricks, Rutgers, the State University of New Jersey

Title: Modeling the Impact of Patron Screening at an NFL Stadium

The Department of Homeland Security identifies stadium safety as a crucial component of risk mitigation in the US. Patron screening poses difficult trade-offs for security officials: rigorous screening prevents weapons from entering the structure, but it also creates lines that become security hazards and may be infeasible if patrons are to get into the stadium within a few minutes of the beginning of the game. In order to quantitatively inform venues about how different screening procedures will affect their venue, we developed a patron screening model together with security personnel at a National Football League (NFL) stadium. Our model specifically addresses the speed of screening using different procedures: walk-through magnetometers, wandings, and patdowns. We then created a real-time simulation of queue formation using the physical arrangement of gates at the stadium. This allowed for analysis based on different patron screening procedures and configurations. We validated our model and simulation using ticket scan data and security director experience. Our approach is generic enough for any stadium and has been used to explore inspection protocols at multiple venues including an NBA arena. We also successfully demonstrated our work to NFL Security

Brian Ricks, Rutgers, the State University of New Jersey

Title: Crowd Simulation on 3D Surfaces at Bus Terminals

One of the miracles of nature is how groups of people and animals move with such elegance. Even though there is no one central controlling intelligence, crowds naturally create efficient and precise movements. Understanding how these emergent phenomena work and how to simulate them has been the seminal question of the growing research area of crowd simulation. As a fascinating interdisciplinary science, crowd simulation involves entomology, zoology, human psychology and sociology, computer science, and mathematics. Currently, crowd simulation research creates believable reproductions of natural crowd movements. However, one of the open questions in the field is how to simulation crowds constrained to 3D surfaces. Crowds constrained to 3D surfaces lack the mathematical simplicity of 2D surfaces. They also lack the freedom of unconstrained 3D movement, such as that seen in flocking birds. Starting with an introduction to crowds and crowd simulation, this talk works up these complex issues in 3D crowd simulation. I specifically address the work of CCICADA with the Pork Authority Bus Terminal of New York and New Jersey. The PABT needs our technology to help predict traffic flow in the world"s busiest bus terminal. This talk will end with my results in 3D crowd simulation and a look at exciting future opportunities in this area.

Javier Rubio-Herrero, Rutgers, the State University of New Jersey

Title: An Approach to Modeling Bed Availability in Shelters for UAC

Several models based on queuing theory are presented in order to approach the problem of determining the appropriate size of a shelter contracted by the Office of Refugee Resettlement (ORR), aiming at giving the desired level of service required to improve the conditions of those unaccompanied alien children (UAC) that are apprehended by Border Patrol in their attempt to enter the United States.

Daniel Ribeiro Silva, Carnegie Mellon University

Title: Analysis Interface of Prostitution Networks

The main purpose of this tool is to analyze a daily updated dataset of online postings for escort services and find underlying patterns and existing relations between different postings and posters. In order to do so, we have developed a system that constantly analyzes the scraped data and is able to detect relations between postings and/or posters. The system tracks such relations and continuously builds and increments several graphs. Each graph represents an active prostitution network (also referred to as cluster or group). One important characteristic of our clustering method is that we are very strict about the criteria for linking posts and posters. As a consequence, any detected link on the data will almost certainly represent a true connection in real life. The system also includes a cluster merge feature, meaning that every time we obtain a new information node that is linked to two different existing networks, the system will merge them and consider it as a new, wider network, while keeping in record how the merge occurred. In short, the system is constantly aggregating and analyzing new data and incorporating the extracted information into graphs of prostitution networks. All cluster and merging information is processed offline and stored in a database that allows fast construction of any existing cluster. Such a structured storage of clusters offers an opportunity for many possible analyses. So far, we have focused our analyzing on the migration patterns of different clusters across the country. Each cluster is comprised of different posters and associated postings. Since both the location and the creation timestamp of each posting is extracted, we are able to map and keep track on real--A-time how each cluster is moving around the country. The system is also integrated with the Google Maps API, allowing a clear visualization of the clusters and migrations across the country. Moreover, the system also contains features such as name extraction, name--based age estimation (using data about the popularity of names for newborn girls in the USA in the past years), database querying, and posting similarity.

Fred S. Roberts, CCICADA Director, Rutgers, the State University of New Jersey

Title: Sustainable Human Environments

Rapidly growing urban environments present new and evolving challenges: Growing needs for energy and water, impacts on greenhouse gases, public health, safety, security. As rapid city expansion continues, mathematical scientists can play key roles in shaping sustainable living environments in collaboration with scientists from many fields. This talk will review four key themes of creating sustainable human environments: The role of data in "smart cities"; anthropogenic biomes (urban ecosystems); security; and urban planning for a changing environment. Specifically, the talk will explore some mathematical sciences challenges in "smart" traffic management; in the theoretical notion of a "compact" city; in approaches to safety and security at sports venues; and in urban planning for climate events. (The talk will also provide an introduction to the worldwide activity known as Mathematics of Planet Earth, in which this topic plays a central role.)

Richard Seiersen, GE Global Research, General Manager, Cyber Security & Privacy, GE Healthcare

Title: The Probability of "Exploit": Predictive Analytics & Agile Security Management

Making zero-risk products, from a security perspective, could mean zero profits. Zero risk likely means zero functionality. Accepting the probability of some loss in pursuit of gain is the essence of doing business, but how much security risk is acceptable in that pursuit? This talk will demonstrate threat-based "Probabilistic Graphical Modeling" and explore how this innovation can be used to optimize security activity in relationship to rapid application development.

Fangbo Tao, University of Illinois at Urbana-Champaign

Title: NewsNetExplorer: Antomatic Construction and Exploration of News Information Networks

News data is one of the most abundant and familiar data sources. News data can be systematically utilized and explored by database, data mining, NLP and information retrieval researchers to demonstrate to the general public the power of advanced information technology. In our view, news data contains rich, inter-related and multi-typed data objects, forming one or a set of gigantic, interconnected, heterogeneous information net- works. Much knowledge can be derived and explored with such an information network if we systematically develop effective and scalable data-intensive information network analysis technologies. By further developing a set of information extraction, information network construction, and information network mining methods, we extract types, topical hierarchies and other semantic structures from news data, construct a semi-structured news information network NewsNet. The schema of the constructed net- work was shown in Figure 1. Further, we develop a set of news information network exploration and min- ing mechanisms that explore news in multi-dimensional space, which include (i) OLAP-based operations on the hierarchical dimensional and topical structures and rich-text, such as cell summary, single dimension analy- sis, and promotion analysis, (ii) a set of network-based operations, such as similarity search and ranking- based clustering, and (iii) a set of hybrid operations or network-OLAP operations, such as entity ranking at different granularity levels. These form the basis of our proposed NewsNetExplorer system. Our demo not only provides with us insightful recommendations in NewsNet exploration system but also helps us gain insight on how to perform effective information ex- traction, integration and mining in large unstructured datasets.

Ahlam and Sam Tannouri, Morgan State University

Title: Visualization of Flocking Algorithms for Planning Evacuations in Case of Major Disasters

Flocking Clustering Algorithms can be used in the exploration of big data sets pertaining to social interactions. The uncovering of underlying structures, the identification of unusual patterns and the detection of possible outliers can have major implications for planning evacuations in case of major disasters.

Visualization applications of the flocking algorithms involve simulations of large numbers of particles. The computation of interactions between of these particles is based on nearest neighbor or collision pair groupings which tend to be increasing rapidly as the number of the particles increases.

The large number of interactions that must be computed is efficiently implemented in parallel on the GPU. A simulated disturbance is launched which splits the multi-particle set into animated clusters exhibiting the flocking behavior.

Interactive visualization techniques will be presented with sets of synthetic data deriving several structures and highlighting the flock behavior like in humans.

Yulia Tyshchuk, Rensselaer Polytechnic Institute

Title: Social Media during Extreme Events

The natural disasters have ravaged the world for centuries with one of the deadliest earthquakes occurring as early as May, 115 in Roman Empire when 260K lives were lost. The difference between then and now is the existence of the technology that allows us to predict the natural disasters and engage in warning response process to get people to safety. Centuries ago people viewed natural disasters as acts of God and a form of punishment and lacked an understanding that damages and lost lives can be prevented. Now the dedicated officials set up processes for responding to the disasters and executing the recovery efforts. Officials obtain information on the projected or current occurrence of the natural disaster, which allows them to determine the effected population and infrastructure and issue appropriate warning information regarding the protective action. Traditionally, this process entailed oneto-many communication method, when information was issued by the emergency managers using sirens, radio, TV, and doortodoor notification. The emergence of the new technologies has provided people with easy access to the data beyond what's been previously available. This data, however, must be processed into information that is understandable and useful for decision-making. A variety of new channels are now capable of delivering such information in various forms to the public at an instant. The information is no longer delivered in one-to-many manner but it has now transformed to a many-to-many communications environment. Information leaves a source or multiple sources, then it get searched, organized, used, stored, re-distributed through the digital environment. People are now able to create their own content to share with a wide population, as well as, validate the information using multiple and/or trusted sources, and finally disseminate the information. This information cycle leaves a digital footprint that can be traced and evaluated.

The proposed research doesn't aspire to explain all aspects of warning response process but simply seeks to explain how social media, i.e. Twitter, facilitates and/or impedes warning response process. The researchers have made significant leaps in analyzing social media in the context of extreme events, however, the research hasn't provided an evidence that social media, in fact, enables more people to evacuate. Moreover, current research takes the behaviors demonstrated on social media by the users at the face value and doesn't seek to connect the stated behaviors (i.e. behavioral intents) to their manifestations (i.e. behavior). The research develops a methodology to automatically extract the behavioral intents stated on Twitter and link it to their real life manifestations in order to provide evidence that people do what they say they do on Twitter. Additionally, the research has shown that people use Twitter to obtain, confirm, and share emergency relevant information, however, the researchers have not proven that information obtained on Twitter impacts person's warning response decision to take preventative action. The research uses social psychology theoretical basis to study the components that lead to the formation of intent, specifically intent to take prescribed action. As emergency managers want to improve its utilization of social media, i.e. Twitter, they seek to facilitate better public response. Emergency managers want to avoid redundant actions and wasting of the resources, and providing information on Twitter that doesn't facilitate action is wasteful. This research aims to allow emergency managers to reduce the redundancy and be able to facilitate response through social media.

Terrence Vaughn, Texas Southern University

Title: US Coast Guard Air Station Simulator

The US Coast Guard needs an intuitive software package that provides them with capabilities to simulate random air station events. It will provide them with the capability to analyze response and mission demand, in an effort to more efficiently allocate their resources (aircraft). By more efficiently allocating their resources they can more effectively complete in their missions, such as search and rescue and law enforcement with a higher level of success. In attempting to simulate random occurrences of events we made use of two statistical distributions, the Poisson distribution and the Exponential distribution. Also did some research on the queuing theory.

This approach was taken because these statistical distributions can describe the frequency probability of specific events when the average probability of a single occurrence is known and also describe the time between events in a Poisson process

Hongning Wang, University of Illinois at Urbana-Champaign

Title: Discovering Opinion Networks in Social Media

The growth of social media data offers great opportunities for understanding the diversity of opinions of people and potential cultural conflicts of people, which are important in many DHS applications, particularly in detecting early signs of potential conflicts and understanding the impact of national policies. In this paper, we study how to discover opposing opinion networks automatically from forum discussions. Online forums have been growing dramatically in recent years with increasingly more participants. People discuss and often also debate on all kinds of issues, and often form ad hoc opposing opinion networks, consisting of subsets of users who are strongly against each other on some topic. To discover such networks, we propose to use signals from both textual content (e.g., who says what) and social interactions (e.g., who talks to whom) which are both abundant in online forums. We also design an optimization formulation to combine all the signals in an unsupervised way. We created a data set by manually annotating forum data on five controversial topics and our experimental results show that the proposed optimization method outperforms several baselines and existing approaches, demonstrating the power of combining both text analysis and social network analysis in analyzing and generating the opposing opinion networks.

Fan Wu /Andrew Underwood, Tuskegee University

Title: High Performance Numerical Solutions of Heat and Mass Transfer Simulation in Capillary Porous Media Using Programmable General Purpose Graphics Processing Units

Heat and mass transfer simulation plays an important role in various engineering and industrial applications. To analyze physical behaviors of thermal and moisture movement phenomena, we can simulate it in terms of a set of coupled partial differential equations. However, to obtain numerical solutions to heat and mass transfer equations is very time consuming process, especially if the domain under consideration is discretized ion into a fine grid.

In this research work, therefore, one of acceleration techniques developed in the graphics community that exploits a general purpose graphics processing unit (GPGPU) is applied to the numerical solutions of heat and mass transfer equations. Implementation of the simulation on GPGPU makes GPGPU computing power available for the most time-consuming part of the simulation and calculation. The nVidia Compute Unified Device Architecture (CUDA) programming model provides a straightforward means of describing inherently parallel computations. This research work improves the computational performance of solving heat and mass transfer equations numerically running on GPGPU. We implement numerical solutions utilizing highly parallel computations capability of GPGPU on nVidia CUDA. We simulate heat and mass transfer with first boundary and initial conditions on cylindrical geometry using the novel CUDA platform on nVidia Quadro FX 4800 and compared its performance with an optimized CPU implementation on a high-end Intel Xeon CPU. It is expected that GPGPU can perform heat and mass transfer simulation accurately and significantly accelerates the numerical calculation. Therefore, the GPGPU implementation is a promising approach to acceleration of the heat and mass transfer simulation.

James Wojtowicz, Rutgers, the State University of New Jersey

Title: SWOT Analysis

SWOT Analysis is an analysis method that has been developed to help organizations evaluate opportunities and functions within in their particular market space. SWOT comes from the organizational review of S Strengths, W Weaknesses, O Opportunities and T Threats. Strengths and Weaknesses are viewed in terms of internal characteristics such as existence or absence of qualities, capabilities and resources. Opportunities and Threats capture external environmental and other factors such as customer needs and shifting trends and conditions. This poster will provide an overview of the SWOT analysis, its definitions, form and potential utility as an organizational tool to help CCICADA in planning for and carrying out its activities into the future.

James Wojtowicz (for Paul Kantor), Rutgers, the State University of New Jersey

Title: Knowing Where to Knock: Cost Optimization for Experiments that Evaluate Algorithms for Processing Sensor Data

The consultant's bill arrives: $10,000. Outraged, the foreman demands an itemized account, which the consultant gladly provides: "Hitting machine with hammer: $1. Knowing where to hit the machine: $9,999. (Quoted at http://www.vividillumination.com/Lighting_Consulting/Lighting_Consulting.html)

The Domestic Nuclear Detection Office (DNDO) is a jointly staffed agency within the Department of Homeland Security. DNDO is the primary entity in the U.S. government for implementing domestic nuclear detection efforts for a managed and coordinated response to radiological and nuclear threats. Scanning of vehicles requires coordination of sensor systems (which may be multi-channel gamma radiation detectors, or imaging systems) and data processing algorithms. These algorithms must provide high screening performance, by detecting almost all threats, while holding down the number of false alarms. False alarms generate double costs: the costs of added inspection, and the costs imposed on national commerce by the delay of harmless shipments. Many algorithms have been developed, and they must be evaluated in a fair way. DNDO conducts experiments, using both simulated data and real loads of cargo containing hidden radiation sources. These experiments are time-consuming and costly. In each experiment data from the sensors is presented as input to all of the algorithms, and their outputs are compared to the ground truth.

Since the goal of the experiments is to evaluate and to rank the algorithms, the research presented here takes an end-to-end view of these experiments. This viewpoint identifies the primacy of finding experimental configurations that reveal differences between algorithms. The search for these configurations involves two distinct analyses. First, the methods of Combinatorial Experimental Design are used to cover all possible levels and pairs of levels of all the variables deemed to be important. This provides a table of configurations. This table must then be reviewed by subject matter experts, who are tasked to assign both a value and a cost to each of the configurations presented. The values are unlikely to be simply sums of "part costs" associated with the levels of the individual parameters. When the cost data are available, the second stage of the analysis can be formulated as set selection problem. Overall goal is to optimize the combined value of the selected subset of configurations, with the budget as a constraint. We present a formulation as a Binary Programming problem, which is NP complete. In addition, the value of a set of configurations may not be the sum of the values of the individual configurations. Heuristics are explored, and some simulated results are presented.

Kunikazu Yoda, Rutgers, the State University of New Jersey

Title: An Approximate Differential Privacy Mechanism for a Class of Graph Problems

We consider a differentially private algorithm for community discovery in a graph. A (non-private) problem of community discovery, known as M{\scriptsize IN}D{\scriptsize ISAGREE}, is formulated as a $0$-$1$ integer programming. A natural LP relaxation with integer rounding always gives a feasible $0$-$1$ solution to this problem. The exponential mechanism is a general method to yield random solutions ensuring differential privacy for an optimization problem. We use a hit-and-run random walk in a convex polytope where a random move is chosen based on an exponential distribution with respect to the objective function value and the polytope is the feasible set of the LP, to realize the exponential mechanism for private M{\scriptsize IN}D{\scriptsize ISAGREE}. The original exponential mechanism is known to yield poor solutions in average. We propose a variant of the exponential mechanism which ensures $(\varepsilon,\delta)$-approximate differential privacy, a relaxation of $\varepsilon$-differential privacy, that yields improved solutions in average whose objective function values are always better than some threshold while admitting approximation level $\delta$. By modifying the convex polytope, the same hit-and-run random walk can be used to realize our variant of the mechanism for approximately private M{\scriptsize IN}D{\scriptsize ISAGREE}.

Mianwei Zhou, University of Illinois at Urbana-Champaign

Title: Entity-centric Document Filtering

The immense scale of the Web has rendered itself as a huge repository storing information about various types of entities (e.g., persons, locations, companies, etc.). Much of information sought on the Web nowadays is about retrieving information related to a particular entity. For example:
Scenario 1: Knowledge Base Acceleration. Knowledge bases such as Wikipedia and Freebase grow very quickly on the Web; however, it is a heavy burden for editors to digest a huge amount of new information every day and keep knowledge bases up to date. To reduce Wikipedia editors' burden, NIST proposed the TREC Knowledge Base Acceleration (TREC-KBA) problem to automatically recommend relevant documents for a Wikipedia entity based on the content of its Wikipedia page.
Scenario 2: Business Intelligence. For a company, automatically collecting useful user experiences about its product entities is helpful for future quality improvement. Moreover, from a crisis management perTowards characterizing the relevance of a document, the problem boils down to learning keyword importance for query entities. Since the same keyword will have very different importance for different entities, the challenge lies at how to appropriately transfer the keyword importance learned from training entities to query entities. Based on the insight that keywords sharing some similar "properties"spective, constantly monitoring user opinions on the Web also helps the company detect potential or emergent crises in a timely manner.
Scenario 3: Celebrity Tracking. Nowadays, the success of microblogs is largely due to their courtship of celebrity users, as a lot of people are interested in tracking the activities of their favorite celebrities (such as movie stars and securities analysts) on microblogs every day. For the same reason, it would be promising to design a system that can automatically track interesting event updates of celebrities from other Web data.

In order to facilitate such entity-centric retrieval operations, in this project, we study the entity-centric document filtering task given an entity represented by its identification page (e.g., a Wikpedia page, a product description page ), how to correctly identify its relevant documents. In particular, we are interested in learning an entity-centric document filter based on a small number of training entities, and the filter can predict document relevance for a large set of unseen entities at query time.

Towards characterizing the relevance of a document, the problem boils down to learning keyword importance for query entities. Since the same keyword will have very different importance for different entities, the challenge lies at how to appropriately transfer the keyword importance learned from training entities to query entities. Based on the insight that keywords sharing some similar "properties" should have similar importance for their respective entities, we propose a novel concept of meta-feature to map keywords from different entities. To realize the idea of meta-feature-based feature mapping, we develop and contrast two different models, LinearMapping and BoostMapping. Experiments on three different datasets confirm the effectiveness of our proposed models, which show significant improvement compared with four state-of-the-art baseline methods.

Previous: Program

Workshop Index

CCICADA Homepage

Document last modified on April 30, 2014.

Special Event: Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA)-Research Retreat 2014

May 1 - 3, 2014 Rensselaer Polytechnic Institute, Troy, New York

May 1 - 3, 2014
Rensselaer Polytechnic Institute, Troy, New York