Special Event: Command, Control, and Interoperability Center for
Advanced Data Analysis (CCICADA)-wide Research Retreat
March 7-8, 2010
Morgan State University, Baltimore, MD
- Organizers:
- Ed Hovy, CCICADA/USC, hovy at isi.edu
- Jack Jarmon, CCICADA/DIMACS, jjarmon at dimacs.rutgers.edu
- Asamoah Nkwanta, Morgan State University, asamoah.nkwanta
at morgan.edu
- Bill Pottenger, CCICADA/Rutgers, billp at
dimacs.rutgers.edu
- Fred Roberts, CCICADA/DIMACS, froberts at
dimacs.rutgers.edu
- Dan Roth, CCICADA/University of Illinois at
Urbana-Champaign, danr at uiuc.edu
- Guoping Zhang, Morgan State University, guoping.zhang at
morgan.edu
Presented under the auspices of the The Homeland Security Center for
Command, Control, and Interoperability Center for Advanced Data
Analysis (CCICADA).
Abstracts:
Earl R. Barnes, Morgan State
Talk Title: A Graph Partitioning Problem for
Disease Control
We consider a graph in which nodes represent
individuals and edges
correspond to pairs of individuals that are in frequent contact with
each other. Assume that a certain number of individuals become
infected with a communicable disease. Our problem is to find the least
number of edges that must be cut to isolate the infected individuals
from a certain percentage of the population. This is a graph
partitioning problem with constraints. We obtain bounds on the number
of edges that must be cut to isolate the infected individuals from a
certain percentage of the population. The bounds depend on the
eigenvalues of the adjacency matrix of the graph.
Smriti Bhagat, Rutgers
Talk Title: Hone:
Automatically
Watching Across Information Networks
An important security task is to watch a set of candidate personas
online. This is challenging because (a) often one has only a small
amount of identifiers (emails, tel numbers, etc) about personalities,
and have to discover the rest, (b) often only knows identifiers in a
few of the information networks (say email) while personas can reside
in many different networks (say in addition, chatrooms, social
networks), and most importantly, (c) personas are dynamic, with
persons behind them quickly able to drop old personas or adopt new
ones in one of the myraid information networks. Some of these networks
e.g., blogs and twitter, leave a publicly crawlable trail of the
communication. Some other networks are more private such as email,
facebook, and VoIP. As a result, systems that watch personas not only
have to identify multiple identities across networks which is a known
challenge in Intelligence Data Analysis, but have to adopt quickly and
rapidly as well, an aspect that has got limited attention in research,
and further work beyond the limitations of information available on
the web.
We present our system Hone. It adopts the unique approach of
simultaneously monitoring real time packet level IP traffic as well as
targeted analysis of application level data. It combines real time
streaming analysis of packet level data with traditional crawl and
information retrieval analysis of the web. The result is a system that
starts with a rudimentary watch list and successively hones on the
multiple personas of the watch list entities as new associations are
discovered and tracks them as they change. This presents an analysts
with a complete view of a suspect's online communications and enables
tracking the suspect in an automatic and dynamic way.
Joint work with S. Muthukrishnan and Narus Inc.
Congxing Cai, USC/ISI
Talk Title: Integration and summarization of multiple media
This talk describes our work on the integration and
summarization
of multiple media for strategic analysis. We demonstrate how to link
and present different media for analysis. When the amount of
associated textual data is large, it is organized and summarized
before display. A hierarchical summarization framework, conditioned on
the small space available for display, has been fully implemented.
Eugene Fink, CMU
Talk Title: Machine learning methods for cybersecurity
We are working on machine learning tools for
improving
cyber defenses through automated discovery of unexpected patterns in
system behavior, network traffic, and other attack indicators. These
tools will support proactive analysis and early detection of attacks;
automated adaptation of defenses to the needs and usage patterns of
individual users; and sharing related knowledge and experience among
multiple users. They will complement the standard defenses by adding a
"layer of armor" that detects novel threats.
Cibin George, Rutgers
Poster Title: Higher order feature associations for
classification
Higher Order Naive Bayes (HONB, Ganiz et al., 2009)
exploits higher
order associations between features for classification purposes. We
first confirm the utility of HONB at low sample sizes using the
Wilcoxon signed rank test. We then present a new class of graph
sampling algorithms that exploit higher order associations. We
empirically demonstrate that second order path counts in document
relation graphs can be successfully leveraged to reduce the sample
size without significantly impacting classification performance.
Emilie Hogan, Rutgers
Poster Title: View Discovery in OLAP Databases Through
Statistical
Combinatorial Optimization
In many projects so much data is being collected
that it quickly becomes unmanageable. When the data being collected is
multidimensional, an Online Analytical Processing (OLAP) database can
be used for storage and analysis. The capability of OLAP database
software systems to handle data complexity comes at a high price for
analysts, presenting them a combinatorially vast space of views of a
relational database. To get the most out of the amount of data that we
have there needs to be a way to reveal areas of the data that analysts
may otherwise never find.
We responded to the need to deploy technologies that
will allow users to guide themselves to areas of local structure. We
did this by casting the space of "views" of an OLAP database as a
combinatorial object of all projections and subsets (i.e. - a lattice),
and "view discovery" as a search process over that lattice. We equipped
the view lattice with statistical information theoretical measures
sufficient to support a combinatorial optimization process. We outlined
"hop-chaining" as a particular view discovery algorithm over this
object, wherein users are guided across a permutation of the dimensions
by searching for successive two-dimensional views, pushing seen
dimensions into an increasingly large background filter in a
"spiraling" search process. For testing purposes we have applied our
algorithm to the database of summary statistics for radiation portal
monitors at US ports.
Cindy Hui, RPI
Talk Title: Simulating the Diffusion of Warnings in Large
Dynamic
Networks
Warning systems play an important role in informing
the at-risk population of potential dangers during hazardous events.
These systems are also used to provide information on protective
measures to promote safety in the community. In addition to a
technological reliable system, it is important to understand and make
use of the social communication network in communities to spread the
warnings to a larger audience and to help ensure that the people at
risk will act on the information they receive. This project involves
formulating an axiomatic framework for modeling the diffusion of
warnings in dynamic social networks through the concept of trust. The
network is dynamic where individuals may leave the network and disrupt
the flow of information as warnings are being diffused.
We assess the framework by modeling the 2007 San Diego
Firestorms, in particular the diffusion of the Reverse911 evacuation
warnings sent during the event. We generate a hypothetical social
network of San Diego County with one million household nodes. We
configure the parameters and map the process using multiple data
sources relevant to the event. We use the model to examine how social
group structure, distribution of trust, and existence of weak ties
affect the spread of evacuation warnings.
Advisors: William A. Wallace, Malik Magdon-Ismail, and
Mark Goldberg
Ming Ji, University of Illinois at Urbana-Champaign
Talk Title: Mining Hidden Communities in Heterogeneous
Information Networks
Information networks, composed of large numbers of
objects linking to each other,
are ubiquitous in real life. Common examples include telephone account
networks
linked by calls, co-author networks and paper citation networks
extracted from bibliographic data, webpage networks interconnected by
hyperlinks in the World Wide
Web, etc. Discovering hidden communities of special interests in
information networks
with the help of prior knowledge for part of the objects has recently
attracted substantial
interest. Current work about hidden communities discovery mainly focus
on
homogeneous information networks, i.e., networks composed of a single
type of objects,
as mentioned above. But in real life, it is more natural to mine hidden
communities
in heterogeneous information networks composed of multiple types of
objects. For
instance, the blogsphere can be viewed as a heterogeneous information
network composed
of blogs, users and terms. By giving prior knowledge for a few terms
and users,
we can precisely detect online communities with special interests. In
fact, applications
like terrorist email detection, fraud detection and research community
discovery can
all be cast as a community mining problem on heterogeneous information
networks.
In this work, we try to mine hidden communities with
special interests on heterogeneous
information networks directly, which has hardly been explored so far.
Given
prior knowledge on part of the objects about which community each of
them belongs to,
we solve the problem by predicting communities for all types of the
remaining objects.
A novel graph-based regularization framework is proposed to model the
link structure
in heterogeneous information networks with arbitrary network schema and
number of
object/link types. Specifically, we explicitly differentiate the
multi-typed link information
by incorporating it into different relation graphs. Efficient
computational schemes
are then introduced to solve the corresponding optimization problem.
Experiments
on the DBLP data set show that our algorithm significantly improves the
prediction
accuracy over existing state-of-the-art methods. This work is
submitted to KDD 2010.
Joint work with Yizhou Sun, Marina Danilevsky, Jing Gao
and Jiawei
Han, University of Illinois at Urbana-Champaign
Darja Krusevskaja, Rutgers
Poster Title: Inferring Multi-Relationships
We consider entities that are connected to
each other in information
networks. What characterizes the relationship between entities? Common
analysis techniques address presence or absence of the link between
pairs
of entities or their strengths. However the nature of link is more
versa-
tile in information networks, in particular, ones of interest to
security and
intelligence communities.
A link some times can be characterized by
several types of relationships
between entities, e.g. two persons can be friends, co-workers and
classmates
at the same time. We also might be interested in strength of different
types
of relationships. In case of networks where multiple types of
connections are
possible we might be interested in top few relationship types that
describe
the relation best in terms of their strength.
We propose two sets of algorithms for the
problem of predicting relationship types and strengths from the graph
structure. One is derived from
similar algorithms used for inferring labels for nodes in graph, rather
than
edges. This is accomplished using the dual graph so that edge inference
problem becomes node inference problem. The other algorithm is based on
intuition that the strength of the edge should reect the stationary
probability of a suitable random walk from one endpoint of that edge to
another and
vice versa. Both these algorithms have been implemented, and
preliminary
tests have been done with Newman co-authorship dataset from 2003. Our
experimental results show good correlation between true relationship
types
and our predictions.
Joint Work with CCICADA PI S. Muthukrishnan.
Yue Lu, University of Illinois at Urbana-Champaign
Talk Title: Statistical
Topic
Models for Large-Scale Opinion
Integration
Project Scope: In homeland security applications, there is often a
need to gather and integrate scattered opinions about an entity such as
a person, an organization or a policy. Thanks to Web 2.0 technology
which has enabled more and more people to freely express their
opinions, the Web has become an extremely valuable source for analyzing
views and opinions. However, with the current technologies, it is still
difficult for people to integrate and digest all opinions relevant to a
specific topic. In this work we study how to automatically generate a
structured summary for any given topic by integrating opinions from
different kinds of resources about the topic, such as well-written news
articles, database tables, opinions scattering in blogspaces and
forums; the goal is to help people digest and exploit a large number of
scattered opinions in a general way.
Recent Progress: We have studied how to automatically integrate
opinions expressed in a well-written article with lots of opinions
scattering in various sources such as blogspaces and forums. We
proposed semi-supervised topic models to solve the problem in a
principled way. The models can be used to integrate a well written
review with opinions in an arbitrary text collection about any topic to
potentially support many interesting applications in multiple domains.
We have already obtained interesting opinion integration results on all
US presidents and Hurricane Katrina.
Future Plans: We will address a more general setup of the problem:
integrating opinions in an arbitrary text collection with a set of
well-written articles instead of a single one. We also plan to
investigate many other resources, such as existing databases about
people or events. Ultimately, the effectiveness of our research results
will be demonstrated by developing a toolkit to facilitate broad
applications of opinion integration.
Relevance to listed research areas: Our research is in the area of
"Advanced Data Analysis and Visualization" and it can potentially
support many interesting applications, such as in the areas of "Social
and Behavioral Sciences" and "Human Factors".
Carlos T. Murray, Morgan State
Talk Title: A Practical Application of Oracle Apex
Software
In this presentation I will discuss how Oracle Apex
software is used to make the management of financial transactions
easier for any business with an automated checking system. The topics
covered in this computer application involves database structures,
Structured Query Language which involves having a background of data
structures, problem solving using the C++ and Perl programming.
Helene Nguewou, Morgan State University
Poster Title: A
Perl Implementation of a
Contact-waiting Time Metric for HIV-RNA Folding Prediction: Are There
National Security Implications?
This work focuses on creating a user-friendly
prediction tool for RNA folding kinetics in Perl language. The
contact-waiting time (CWT) metric is applied to certain HIV sequences
in order to calculate their folding rates. The CWT metric will be
converted from MATLAB to Perl language and will be tested on sequences
in HIV databases. The purpose of creating the Perl implementation is to
make the CWT more widely available and easy to use as a bioinformatics
tool. In addition, are there national security implications as a result
of HIV sequence prediction? At least 39 million people now infected
with the virus are expected to die in the next 5-10 years. This
depletion of elite workers and professionals constitutes a threat to
homeland security, which will then be at greater risks of civil
disturbance, conflict and disorder. The disparity of access to
retroviral drugs increases the widening life-expectancy gap between
poor countries and Western countries. As a result of this, there is
increasing concern that nations highly infected with HIV might engage
in bioterrorist acts against the United States. The lack of an
effective and affordable vaccine against the virus makes this threat
even more conceivable. Therefore, HIV research efforts are of high
importance.
Joint work with Asamoah Nkwanta.
Bill Pottenger, Rutgers
Talk Title: Higher Order Learning
In order to recognize and capture inherent
semantics in data we have developed an approach to feature space
transformations termed Higher Order Learning (HOL) that has proven
effective in both discriminative and generative settings (Ganiz, 2008;
Ganiz et al., 2006; Ganiz, Pottenger and George, 2010; Li et al., 2007;
Li et al., 2005; Ganiz, Lytkin and Pottenger, 2009; Menon and
Pottenger, 2009). We have successfully leveraged the power of
data representations based on HOL in a variety of problem domains.
HOL-based methods significantly outperform the traditional approaches
in text classification (Ganiz, 2008; Ganiz, Pottenger and George, 2010;
Ganiz, Lytkin and Pottenger, 2009) and in association rule mining (Li
et al., 2007; Li et al., 2005). HOL-based models have also proven
effective in network event and anomaly detection (Ganiz, 2008; Ganiz et
al., 2006; Menon and Pottenger, 2009) on time series data from the
Border Gateway Protocol, the backbone protocol in the Internet routing
infrastructure. In more recent work, HOL was successfully
applied to threat detection in streaming data in a defense setting
(Pottenger, 2009).
Warren B. Powell, Princeton
Talk Title: Optimal Learning for Homeland Security
Optimal learning addresses the challenge of
collecting information
quickly when observations are time consuming and expensive. While
optimal policies for collecting information are computationally
intractable, the knowledge gradient policy has proven to be
particularly powerful. We have adapted this concept to both offline
(guiding laboratory research) and online (observe as you go) problems.
We have extended applications from the usual domain of a finite number
of discrete alternatives to problems on graphs (learning about
networks), to learning about large numbers of subsets, learning
multidimensional continuous surfaces and, most recently, learning
general, nonconvex functions. I will discuss different types of
applications within homeland security.
Andrew Rodriguez, Rutgers
Talk Title: Graph evolution over time
The purpose of this project is to study how
graphs evolve over time. Understanding normal
parameters of computer and communication networks, for example, allow
for pattern and anomaly detection in such environments. If activity is
detected that produces network characteristics or structures that
deviate significantly from the norm, it can be flagged as an
abnormality, potentially helping with the detection of fraud, spam,
and denial of service attacks, for example. The problem of anomaly
detection has been approached from various angles, including
artificial intelligence, machine learning, and state machine modeling.
This work begins by using change-point detection for anomaly
detection.
This work is done in collaboration with Dr. Tin
Kam Ho, Dr. Jin Cao,
Dr. Harald Steck, and Dr. Ayou Chen of Bell Labs.
Dan Roth, UIUC
Talk Title: Making Sense of Unstructured Data
Recent studies have shown that over 85% of the
information organizations deal with is unstructured - the vast majority
of which is text in different forms. A multitude of techniques has to
be used in order to enable intelligent access to this information and
to support transforming it to forms that allow sensible use of the
information. The fundamental issue that all these techniques have to
address is that of semantics - there is a need to move toward
understanding the text at an appropriate level, beyond the word level,
in order to support access, knowledge extraction and synthesis.
I will discuss some of our research in these
directions, focusing on tools and products rather than techniques. The
talk will address several dimensions of text understanding that can
facilitate access to information and extraction of knowledge from
unstructured text, transforming it to forms that are useful to
different users in different settings, and integrating it along
multiple dimensions and with existing institutional resources.
Gilna Samuel, Morgan State
Poster Title: A Data Mining Approach to Predicting
Optimal Regimens
in Cancer Treatments
Cancer is one of the leading causes of death. Many
new drugs are
currently being manufactured and tested in clinical trials to combat
this disease. A database collected by a group of researchers at
Massachusetts Institute of Technology (MIT) stores significant
information, taken from journal articles on clinical trials for
cancer. The database is updated weekly and more data is added
continually to further development and enhancement of this
database. The aim of this project is to build models that predict
survival rates of patients undergoing a variety of cancer
treatments. This was accomplished by analyzing the information
gathered in this database by applying data mining methods involving
regression models and visual algorithms. Initial results from the
various data mining methods will be presented for breast cancer,
non-small-cell lung cancer and gastric cancer. In each case, these
drugs had the higher overall survival rates and lower toxicity rates
and thus were the most effective. At this stage of the research, more
emphasis was placed on finding optimal algorithms to compare the
results from the database. The results were limited since the database
used has a relatively small number of clinical trials. A larger
database would give more accurate results and the clinical trials
could be grouped by cancer stage, in addition to cancer type. Thus,
the combination of drugs would be both stage and type specific and
could be used in future clinical trials to create more effective
cancer regimens.
Warren Scott, Princeton
Poster Title: An Optimal Learning Approach to Finding an
Outbreak
of a Disease
We describe an optimal learning policy to
sequentially decide on
locations of a city to test for an outbreak of a particular disease.
We use Gaussian process regression to model the level of the disease
throughout the city, and then use the correlated knowledge gradient,
which implicitly uses exploration versus exploitation concepts, to
choose where to test next. The correlated knowledge gradient policy
is a general framework that can be used to find the maximum of an
expensive function with noisy observations.
Yizhou Sun, UIUC
Talk Title:iTopicModel: Information Network-Integrated
Topic
Modeling
Project Scope: Document networks, i.e., networks
associated with text information, are becoming increasingly popular due
to the ubiquity of Web documents, blogs, online social networks and
various kinds of online data. Topic analysis considering both text
information and link information for such networks will not only
improve the quality of topic models, but also understand the individual
objects in the network more clearly.
Recent Progress: In this paper, we propose a novel
topic modeling framework for document networks, which builds a unified
generative topic model that is able to consider both text and structure
information for documents. A graphical model is proposed to describe
the generative model. On the top layer, we define a novel multivariate
Markov Random Field for topic distribution random variables for each
document, to model the dependency relationships among documents over
the network structure. On the bottom layer, we follow the traditional
topic model to model the generation of text for each document. A joint
distribution function for both the text and structure of the documents
is thus provided. A solution to estimate this topic model is given.
Some important practical issues in real applications are also
discussed, including how to decide the topic number. We apply the model
on two real datasets, DBLP and Cora, and the experiments show that this
model is more effective in comparison with the state-of-the-art topic
modeling algorithms.
Relevance to Research Area: All sorts of online
networks are ubiquitous now, due to the rapid development of the
Internet. It is impossible to understand such huge networks merely via
human laboring. iTopicModel can easily detect the major topics in these
networks and associate each individual object to these topics. Even for
those objects with no text information but only structural information,
their topics can be easily predicted. This study will be extremely
useful for risk analysis on online social networks.
Future Plans: In the future work, we will study how to
combine different networks to further improve the quality of topic
models and to better understand individual objects in these networks.
We will report more technical details on this progress if the abstract
is selected for poster in the coming DHS conference.
Sam Tannouri and Ahlam Tannouri, Morgan State
Talk Title: Convoluted graph visualization using
splines to render
edges
Visual analytics technologies combined with
computational tools and analytic reasoning form a platform for
detecting, analyzing and responding to threat in the cyber and physical
worlds. They handle a huge amount of data, which is very difficult task
especially when one is concerned about extracting critical information.
The complex and convoluted nature of data sets make them very difficult
to clearly illustrate in a visually aesthetic form. To alleviate the
congested visual representation of these data sets, we propose the use
of convoluted graphs rendering their edges using splines.
Configurations of these splines are studied and interactively
illustrated.
Stephen Tratz, USC/ISI
Talk Title: Noun-Noun Compounds for Text Analysis
The automatic interpretation of noun-noun compounds
is an important
subproblem within many natural language processing applications and is
an area of increasing interest. The problem is difficult, with
disagreement regarding the number and nature of the relations, low
inter-annotator agreement, and limited annotated data. We present a
novel taxonomy of relations that integrates previous relations, a
large dataset annotated according to these relations, and a supervised
classification method for automatic noun compound interpretation.
Yuancheng Tu, UIUC
Talk Title: Aspect Guided Text Categorization with
Unobserved
Labels
Project Scope: Categorizing text into one of multiple
possible categories is an archetypical multi-class classification
problem with many important applications in knowledge management and
information access, ranging from email classification to author
identification to sentiment analysis. Studies in this area develop a
principled way to analyze massive amounts of information from texts or
other sources and quickly and reliably determine whether they hold a
desired property, such as, written by a known entity, represent a
suspicious event, etc. As such, it has broad applications to DHS and
national security applications. In this project, we propose a novel
method in the domain of text categorization with a large label space
and, most importantly, can handle the case in which some of the labels
were not observed previously in the training data. Our method exhibits
great potential to correctly predict unobserved labels which
traditional multiclass classification methods cannot handle at all.
Recent Progress: We have developed a novel multi-class
classification method that exploits the structure and meaning of the
label space and thereby can classify even with respect labels that were
not previously observed in the training data. The key insight is the
introduction of intermediate aspect variables that encode properties of
the labels. Aspect variables serve as a joint representation for
observed and unobserved labels. This way the classification problem can
be viewed as a structure learning problem with natural constraints on
assignments to the aspect variables. We solve the problem as a
constrained optimization problem over multiple learners and show
significant improvement in classifying short sentences into a large
label space of categories, including previously unobserved categories.
Future Plans: One perspective for further development
of this work is to find more constraints in the inference procedure and
to experiment with incorporating the constraint based inference within
or after the learning. Another way to enhance the system is to develop
an interactive protocol to induce prediction feedback as a way to
quickly adapt to new labels
Relevance to listed research areas: Automatic text
classification methods are an essential component in developing the
ability to interpret in a meaningful way properties of massive amounts
of information as a way to support decision making with respect to it.
Abdul-Aziz Yakubu, Howard University
Talk Title: Discrete-Time Epidemic Models
We use a periodically forced SIS epidemic model
with disease
induced mortality to study the combined effects of seasonal trends and
death
on the extinction and persistence of discretely reproducing
populations. We
introduce the epidemic threshold parameter, R0, for predicting disease
dynamics in periodic environments. Typically, R0<1 implies disease
extinction. However, in the presence of disease induced mortality, we
how
that a tiny number of infective population can drive an otherwise
persistence
population with R0>1 to extinction. Furthermore, we obtain
conditions for
the persistence of the total population. In addition, we use the
Beverton-Holt
recruitment function to show that the infective population exhibits
perioddoubling
bifurcations route to chaos where the disease-free susceptible
population lives on a 2-cycle (non-chaotic) attractor.
Guoping Zhang, Morgan State University
Talk Title: Radon transform and its applications
One of the major inventions in last century is the
CT-scanner (computerized Tomography). Cormack and Hounsfield got the
Nobel-prize in Medicine 1979 for their work with computed axial
tomography. The CT-scanner can be used for reconstruction of X-ray
absorption in the interior of structures, such as patients or machine
parts with possible internal fractures.
What is common for the development of all types of
scanners is that they, to some extent, have been based on the Radon
transform. The crucial idea for the image reconstruction is that Radon
transform can provide a natural link between the image function of the
object and the measurement of the machine such as CT-scanner and MRI.
In this talk, I will give a brief introduction to
Radon transform, the generalized Radon transform and their relations
with microlocal analysis theory. I also want to discuss its potential
applications to DHS research. The original goal of my talk is to seek
future potential collaborators among CCICADA members through our
discussion during the retreat.
Mianwei Zhou, UIUC
Talk Title: Data-oriented Content Query System: Searching
for Data into Text on the Web
With the ever growing richness of the Web, people
nowadays are no longer satisfied with finding interesting documents to
read. Instead, we are becoming increasingly interested in the various
fine granularity information units, e.g., movie release date, book
price, which appear within the content of Web documents. We are
witnessing several emerging Web-based search applications towards
exploiting such rich data on the Web, such as:
1. Web-based Information Extraction (WIE). With the
richness and redundancy of the Web, WIE tries to rely on simple phrase
patterns (e.g. "X is the capital of Y") to harvest numerous facts
online.
2. Typed-Entity Search (TES). Several efforts were
tried to search entities inside the Web pages (e.g. searching the phone
number of Amazon's customer services). Such techniques often rely on
extract data types of interest, and then matching the extracted
information based on some proximity patterns.
With so many ad-hoc efforts exploiting Web contents, there is a
pressing need to distill their essential capabilities?thus we propose
the concept of Data-oriented Content Query System (or DoCQS).
DoCQS aims at generally supporting "content querying"
for finding data over the Web. In DoCQS, we utilize the relational
model for modeling Web data, and propose the corresponding SQL-style
language CQL for content querying. Based on DoCQS, Web-based
applications can easily access the Web data as if writing SQL for
querying a database, avoiding a lot of repetitious work. For efficient
processing, we design novel index structures and query processing
algorithms. We evaluate our proposal over two concrete domains of
realistic Web corpora, demonstrating that our query language is rather
flexible and expressive, and our query processing is efficient with
reasonable index overhead.
Next: Call for Participation
Workshop Index
CCICADA Homepage
Contacting
CCICADA
Document last modified on March 5, 2010.