Title: Ranking Genes, Ranking Documents, Ranking Drug Candidates: A Unified Machine Learning Approach
Prioritizing information is a ubiquitous need in our daily lives. With the growth of digital data sources, computational methods for prioritizing information are an increasing need in the 21st century.
In this talk, I will describe the application of ranking methods in machine learning -- an emerging technology and currently an active area of research -- to prioritizing information in three different fields: computational biology, information retrieval, and drug discovery. I will show how ranking methods in machine learning give state-of-the-art performance in information retrieval, outperform existing methods in drug discovery, and have led to the identification of new genes related to leukemia and colon cancer.
Title: Using Diverse Types of Statistical Evidence
Analysts may find many types of evidence that bear on a particular question of interest. Besides direct evidence, which itself may come from diverse data sources, there may be prior information or indirect evidence, a term used by Brad Efron (Statistical Science, forthcoming). I will contrast these different types of evidence and discuss their sometimes fuzzy boundaries. A useful principle is that analysts should use all relevant evidence to address a problem. Bayesian methods offer a framework for doing so, but not without pitfalls. I will describe some of these pitfalls and suggest some possible ways to mitigate them.
Title: The Bayesian Approach to Combination of Information
Bayesian analysis is particularly geared towards combing information from diverse sources. It easily allows for combining of information from data and expert opinion; learning from related experiments; combining information from deterministic models and data; and adapting to data sources of different accuracy. This will be illustrated (time permitting) through examples drawn from engineering, vaccine trials, geosciences and transportation.
Title: Anonymization and Uncertainty
Data anonymization techniques have been the subject of much study in recent years, for many kinds of structured data, including tabular, graph and item set data. Anonymization is best viewed through the lens of uncertainty. Essentially, anonymized data describes a distribution over possible worlds, one of which corresponds to the original data. I'll introduce some of the key anonymization ideas within this framework, and discuss how models of uncertainty can help in reasoning about privacy under various priors, and for using uncertain data to answer aggregate queries.
Title: Bayesian Combining of Information from Sensors: Applications to Nuclear Detection and Multi-Sensor Fusion
I will describe two applications involving combining of information for a given sensor across time and across different sources of data using Bayesian decision theory. Both of these applications involve detection of nuclear material entering seaports and airports. I will propose algorithms to combine the information and examine its performance analytically, computationally and by simulations. I will also describe some of the challenges related to multi-sensor fusion.
Title: Collective Graph Identification
Within the machine learning and data mining communities, there has been a growing interest in learning structured models from input data that is itself structured or semi-structured. Graph identification refers to methods that transform observational data described as a noisy, input graph into an inferred "clean" information graph. Examples include inferring organizational hierarchies from communication data, identifying gene regulatory networks from protein- protein interactions, and understanding visual scenes based on inferred relationships among identified objects. The key processes in graph identification are: entity resolution, link prediction, and collective node classification. I will overview algorithms for these tasks, discuss the need for integrating the results to solve the overall problem collectively.
Title: Product Formulas for Positive Functions and Applications to Network Data
Any nonnegative function on the interval [0,1] can be uniquely written as a (potentially infinite) product of simple factors of the form 1 + ah(I,x), where I is a dyadic interval and the function h is � 1 on the left (resp. right) half of I. The constant a (depending on I) lies in [-1,+1]. The same is true for Borel probability measures and a version of this also holds on any interval or cube in Euclidean space. This provides an alternative method of analyzing signals that are positive, for example the volume of traffic over a network. One advantage to this approach is that one can easily normalize signals from several channels. If one uses instead a standard wavelet decomposition, the normalization problem is much harder.
Another advantage is that it becomes very easy to synthesize "bursty" signals. We give some examples of network data that has been analyzed by these methods. One of our approaches is to use the coefficients in the product formula to (re) represent the signal for study by the methods of Diffusion Geometry. We also give an exposition of where product formulas of this type appear naturally in various areas of analysis, geometry, and mathematical physics.
This is joint work with Devasis Bassu, Linda Ness, and Vladimir Rokhlin.
Title: Architectures for Inferences from Multiple Sources Paul B. Kantor, Rutgers Universit
When information must be assembled from diverse sources in order to support inference and reasoning, we face the problem that there is too much data in many scientific and other applications, so that it is not always possible to save all of the information from which the inference is drawn. In addition, intermediate processing algorithms may produce derived assertions which are supported by the data but which, like all inferences, have some degree of uncertainty. To cope with this problem we present a general framework for incorporating both the assertion and the uncertainty information into a "confidence-value" array. This array is designed to support many forms of uncertain inference. Specific technical challenges arise in (a) converting confidence assertions of various kinds into a "lingua Franca" for compilation and (b) interpreting confidence assertions about variables with complex structure, such as the nodes of an ontology.
Title: A General Framework for Combining Information and an Application to Incorporating Expert Opinions
Incorporating external information, such as prior information and expert opinions, can play an important role in the design, analysis and interpretation of clinical trials. Seeking effective schemes for incorporating prior information with the primary outcomes of interest has drawn increasing attention in pharmaceutical applications in recent years. Most methods currently used for combining prior information with clinical trial data are Bayesian. We demonstrate that the Bayesian approach may encounter problems in the analysis of binary outcomes, especially when informative prior distribution is skewed.
In this talk, we present a frequentist framework of combining information using confidence distributions (CDs), and illustrate it through an application of incorporating expert opinions with information from clinical trial data. A confidence distribution (CD), which uses a distribution function to estimate a parameter of interest, contains a wealth of information for inferences; much more than a point estimator or a confidence interval ("interval estimator"). We present here a formal definition of CDs, and develop a general framework of combining information based on CDs. This CD combining framework not only unifies most existing meta-analysis approaches, but also leads to development of new approaches. In particular, we develop a Frequentist approach to combining surveys of expert opinions with binomial clinical trial data, and illustrate it using data from a collaborative research with Johnson & Johnson Pharmaceuticals. We compare results from the Frequentist approach w ith those from Bayesian approaches, and show that the Frequentist approach has distinct advantages.
This is joint work with Minge Xie (Rutgers University), C.V. Damaraju (Johnson & Johnson, Inc.) and William Olsen ((Johnson & Johnson, Inc.)
Title: Bayesian Uncertainty Quantification for Inverse Problems Using a Multiscale Hierarchical Model
We present Bayesian methods for uncertainty quantification (UQ) in inverse problems. We develop a Bayesian hierarchical model to quantify the uncertainty by integrating data from coarse as well as fine scales. We introduce this multiscale data integration method with an upscaling technique for spatial modeling of a random field. This method is more specifically used for subsurface inversions. Numerical results are presented by analyzing simulated as well as real data.
Title: Combining Multiple Data Sources for Comparative Effectiveness
A fundamental statistical problem involves making inferences of the impact of a medical technology in the absence of randomization after market release. Because the evidence for approval is often based on the results of small controlled clinical trials, the patient population and the provider population in the real world can differ dramatically from the trial populations. Consequently effectiveness and adverse events are difficult to measure. Understanding the diffusion of new technologies and their health effects are important policy questions that rely upon observational data. In this talk, I review approaches to making inference on the basis of multiple and diverse sources of data, including both observational designs and randomized designs. An example involving determination of the effectiveness of total hip replacement systems illustrates methods.
Acknowledgements: This is joint work with Danica Marinac-Dabic from the Center for Devices and Radiological Health and Art Sedrakyan from Cornell University.
Title: Integrating Diverse Biological Data Sources for Improved Gene Function Prediction
Due to the development of advanced technologies, multiple heterogeneous sources of information about the same objects is being obtained in many scientific and biomedical domains. Significant research efforts to obtain more robust prediction using information from multiple sources are being pursued. For the problem of class prediction, integrating different databases at the early stage without the use of class information will provide results that are inferior to those obtained from combining decisions from multiple sources. In this work, we have systematically developed a new methodology to combine the decisions of different classifiers in the individual datasets. We combined information from several biological datasets which provides a rich source of information such as comparative genome sequences, gene expression, protein structure, protein interaction, and metabolic, signal, regulatory pathways. The combination of these datasets yields robust results, compared to choosing the best one from the pool of classifiers for any individual dataset which often does not guarantee good performance. Our study shows that different classifiers work better for different datasets and the proposed boosting based approach for combining individual decisions to get the final combined decision proved to be more effective than the methods available in the literature. Joint work with Mohammad S. Aziz.
Title: Mitigating Manhole Events in New York City Using Machine Learning
There are a few hundred manhole events (fires, explosions and smoking manholes) in New York City every year, often stemming from problems in the low voltage secondary electrical distribution network that provides power to residential and commercial customers. I will describe work on the Columbia/Con Edison Manhole Events project, the goal of which is to predict manhole events in order to assist Con Edison (NYC's power utility) with its pre-emptive maintenance and repair programs. The success of this project relied heavily on an understanding of the current state of Manhattan's grid, which has been built incrementally over the last century. Several different sources of Con Edison data are used for the project, the most important of which is the ECS (Emergency Control Systems) database consisting of trouble tickets from past events that are mainly recorded in free text by Con Edison dispatchers.
In this talk, I will discuss the data mining process by which we transformed extremely raw historical Con Edison data into a ranking model that predicts manhole vulnerability. A key aspect in this process is a machine learning method for ranking, called the "P-Norm Push." Our ranked lists are currently being used to prioritize future inspections and repairs in Manhattan, Brooklyn, and the Bronx.
This is joint work with Becky Passonneau, Axinia Radeva, and Haimonti Dutta at the Center for Computational Learning Systems at Columbia University, and Delfina Isaac and Steve Ierome at Con Edison.
Title: Two Newly Developed Quantitative Methods for Meta Analysis
We will discuss two newly developed statistical methods in Meta analysis. The first one gives us the exact inferences about the parameter of interest under a fixed effect model. This proposal is quite useful especially when we deal with cases that the event (or incidence) rates are rather low across all studies in the meta analysis. We use a data set for examining the CV safety of a diabetic drug "Avandia" for illustration. The second method I will discuss is to estimate the random effects distribution nonparametrically. We will show the advantage of using this procedure compared with the conventional methods for estimating the mean of the random effects distribution. We will use a data set for examining the safety issues of "epo" (a red cell stimulator) for illustration. We will discuss some open problems in meta analysis and how to improve the meta analysis culture. This research is joint with Lu Tian of the Stanford University School of Medicine, Tianxi Cai of the Department of Biostatistics in the Harvard School of Public Health; and Rui Wang of the Biostatistics Center at Massachusetts General Hospital and Harvard Medical School.
Title: How to Find a Correlation When None Exists
Paleoclimatologists use statistical models built from data collected from multiple sources, for the purpose of reconstructing global temperature records. These models assume that pairs of time-series, including tree ring chronologies, measurements of temperature, ice cores, and lake sediments, are connected. Evidence of connection is established by computing the correlation of the partial sums of the time series, rather than computing the correlation of the time-series themselves. It is known that this procedure induces spurious correlation, but we provide the first quantification of this error by computing the distribution of the empirical correlation of two actually uncorrelated joint Brownian motions. We conclude by relating this result to the uncertainty of temperature reconstructions over the last 1000 years.
Title: Evaluating the Repeatability of Two Studies of a Large Number of Objects: Modified Kendall Rank-order Association Test
Assessing the reproducibility of research studies can be difficult, especially when the number of objects involved is large. In such situations, there is only a small set of those objects that are truly relevant to the scientific questions. For example, in microarray analysis, despite data sets containing expression levels for tens of thousands of genes, it is expected that only a small fraction of these genes are regulated by the treatment in a single experiment. In such cases, it is acknowledged that reproducibility of two studies is high only for objects with real signals. One way to assess reproducibility is to measure the associations between the two sets of data. The traditional association methods suffered from the lack of adequate power to detect the real signals, however. We present in this talk the use of a modified Kendall rank-order test of association, based on truncated ranks. Simulation results show that the proposed procedure increases the capacity to detect the real signals considerably. Applications to gene expression analysis and genetic epidemiology will be discussed.
Title: Google Ads Quality - An Real-Life Statistical Problem for Multi-Layered Systems
In this talk, we will look at a real-life Google ads quality problem from a statistics perspective. We will discuss the issues encountered when the overall system includes multiple layers, while the intra-layer interaction is unknown. The layers in this specific case are: 1) all the ads provided by advertisers; 2) automated filtering of bad ads; 3) manual filtering of ads deemed uncertain quality by the automated layer. The overall result is all ads shown to Google users. We will present the challenges to measure each layer of the system, and use such measurements to design the best process around the inner layers. We also need to deal with further obstacles imposed by limited experiment and measurement size, along with high noise level in the measurement. Think you have a solution? Then please come to the talk.