The useR! 2015 poster session is organised for Wednesday evening. The list of posters will be extended right up until the conference. Lists of confirmed posters sorted by poster title.
For abstracts click on the title (Expand all abstracts).
Fabián Santos, University of Bonn:
Since the Distribution Policy of Landsat data enable the free download of the whole archive, the processing and analysis of large collections of Landsat images became a challenging task which demands efficient processing chains, not available yet in open source softwares. For this reason, we develop a sequence of R user-friendly scripts for organize the management, processing and extraction of land cover change patterns from a time series archive of the Landsat sensors TM, ETM+ and OLI-TIRS. For improve the computing time, we use the parallel computing approach and chose the best libraries and algorithms available from different open source geographic information systems for enable the geometric and radiometric correction standards, as well, the cloud, shadow and water masking, change detection and accuracy assessment. Our first prototype, gave us a map of the forest restoration since the 1984 to 2014 of a set of study areas distributed along an altitude gradient of the Amazon region of Ecuador. This processing chain constitutes a useful tool for ecosystem monitoring, evaluation of potential REDD+ projects, deforestation mapping, land grabbing, and other derived studies from Landsat time series analysis.
Thierry Onkelinx, Research Institute for Nature and Forest:
Markup languages like Markdown, HTML and LaTaX separate content and style. This distinction makes it fairly easy to apply a different style to a document. The knitr package facilitates to create reproducible documents by combining R code with the markup languages. The recent rmarkdown package converts R Markdown documents into a variety of formats including HTML, MS Word and PDF. We used these tools to create a package applying the style of the corporate in reproducible documents. The source code of the documents can be either LaTeX or Markdown. Dummy documents with various style items are added as vignettes to check the consistency with the corporate identity. The main component of the package is a local texmf tree which contains the corporate identity of several types of documents (report, slides, poster). For Sweave files, only this part of the package is necessary. For R Markdown files, two additional components are necessary: Pandoc templates and R functions. The Pandoc templates select the appropriate LaTeX style and put the content of variables into the document. The R functions translate information in the YAML block of the Markdown file to the correct Pandoc template and required variables (title, author, cover image, language, ...).
Jakub Kuzilek, Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom:
Our research aims at identification students at risk of failing the course at the Open University, UK. For the purpose of analysis we developed the system, which is using R as the platform for the at-risk student identification. The available data contain several demographical attributes such as gender, previous education, age, etc. and unique data about student interactions in the Virtual Learning Environment. These data is then processed using k-Nearest Neighbours (package FNN), CART decision tree (package rpart) and naïve Bayes classifier (package e1071). The information weather the student will pass/fail of submitting next assignment provided by each classifier is then combined using majority voting and final decision is made. We are delivering predictions for more than 25000 students every week. For evaluation of classification quality we are using precision and recall. Both measures varies in the time in course but the overall values are around 70%.
Luzia Burger-Ringer, Department of Statistics, University of Technology, Graz, Austria:
In the region of Graz the threshold value (=50μg/m3) of the average daily concentration PM10 is exceeded on more than 100 days of the year. This situation appears mainly within the six months October till March. So we investigated the influence of meteorological as well as anthropogenic factors based on data from the winter seasons 2002/03 to 2014/15. Exploratory data analysis shows that the emergence of wind and/or precipitation leads to lower values of PM10, whereas temperature inversion (lower temperature on the ground than above the ground) yields rather high values of PM10. This meteorological phenomenon can be observed up to 60% of the days in winter seasons and may be one explanation for extraordinary high PM10 values in and around Graz. However, the anthropogenic impact cannot be neglected, too. We will illustrate some scenarios which point out the influence of traffic and combustion processes.
Anders L. Madsen, Hugin Expert A/S, and Antonio Salmeron, University of Almería:
Today, omnipresent sensors are continuously providing streaming data on the environments in which they operate. Sources of streaming data with even a modest updating frequency can produce extremely large volumes of data, thereby making efficient and accurate data analysis and prediction difficult. Probabilistic graphical models (PGMs) provide a well-founded and principled approach for performing inference and belief updating in complex domains endowed with uncertainty. The on-going EU-FP7 research project AMIDST (Analysis of MassIve Data STreams, http://www.amidst.eu) is aimed at producing scalable methods able to handle massive data streams based on Bayesian networks technology. All of the developed methods are available through the AMIDST toolbox, implemented in Java 8. We show how the functionality of the AMIDST toolbox can be accessed from R. Available AMIDST objects include variables, distributions and Bayesian networks, as well as those devoted to inference and learning. The interaction between both platforms relies on the rJava package.
Maxim Nazarov, Open Analytics, Antwerp, Belgium:
Toxicology testing is an indispensable part of the process of drug development. Genetic toxicity studies play important role in the safety assessment of a compound during preclinical stage. They are aimed at detecting whether compound induces DNA damage that can cause cancer or heritable defects. We present a suite of R packages aimed to facilitate statistical analysis and reporting for commonly used genetic toxicity assays: invivo and invitro micronucleus tests and comet assay. The experiments usually follow a common hierarchical set-up, that allows to automate most of the analysis and reporting. The statistical analyses implemented in the R packages follow the recent recommendations from the OECD guidelines for toxicity testing, and include fitting generalized linear models or mixed-effects models with appropriate hypothesis tests. A range of different plots can be created for data exploration. Additionally functionality to generate customized reports (including Word) is provided using R-markdown and pandoc with custom filters.
Agnes Salanki, Department of Measurement and Information Systems, Budapest University of Technology and Economics, Budapest, Hungary:
According to the classical definition by Hawkins, outliers are observations deviating so much from the bulk of data that it is suspicious that they were generated by other mechanisms. Thus in several domains, like security (network intrusion detection) of finance (fraud detection), outlier identification and characterization is not only preparation of further statistical model building but an equivalent, individual step of data analysis. Automatic outlier detection is supported in R, several implementations are available in packages like depth, fields, robustX, DMwR, etc. The poster presents two use cases for applications of the above mentioned built-in R functions. First, outlier detection is used for finding performance anomalies in the behavior of our educational cloud functioning at our university since 2012. The cloud serves 300 students and is designed as a high-availability system (>99% availability), thus, identification of anomalous behavior causing further performance problems is vital for us. Secondly, conclusions of outlier detection performed on the PISA survey results are presented, including possible interpretations and open questions about school performance of students from individual countries. Interpretation of results is supported with visualizations tailored to outlier detection.
Amy Large, Office for National Statistics, UK:
In April 2013, the UK Government Digital Service (GDS) released the Government Service Design Manual. This selection of documents includes guides on picking the right technological tools for jobs we want to do. Of open source software, the manual states that the Government has a level playing field between proprietary and open source software, and it should be actively considered when looking at software solutions. At the Office for National Statistics (ONS), the tools currently being used for statistical processing are predominantly traditional licence-based tools such as SAS and SPSS. With the freedom to use open source software, we are now in a position to make better use of R. But how do we bridge that gap between what we know and are comfortable with, and the new possibilities afforded to us with open source tools, both as an Office and as individuals? This paper will focus on my steps to ensure that I am embracing R and all it has to offer. I will also discuss how R is being used within ONS, and what steps are being taken to encourage analysts and developers to consider this as an alternative to the more traditionally used tools like SAS.
Veit Zoche-Golob, Department of Bioprocess Engineering – Microbiology, Faculty 2, University of Applied Science and Arts, Hannover, Germany:
The background of the analysis was an investigation of the association between the milk fat-protein ratio and the incidence of clinical mastitis in dairy cows. A mixed Poisson regression model for time-to-event data including repeated events and time-varying explanatory variables was fitted using the R package lme4. Because the recording of clinical mastitis might have been imperfect, a probabilistic analysis should be conducted to assess the direction and the magnitude of the misclassification bias on the conventional estimates. It was not feasible to refit the model several times due to its complexity. Therefore, data sets were simulated using the fitted model. A matrix adjustment method was used to simultaneously model the misclassification of the outcome (different across the number of previous events and the levels of the fat-protein ratio) and the misclassification of the number of previous events which depended on the misclassification of the outcome. Given the assumptions we made about the bias parameters and the methods we used, the conventional parameter estimates for fat-protein deviations were biased toward the null by 10-20% by the misclassification of clinical mastitis.
W.H. Moolman, Department of Statistics, Walter Sisulu University, Mthatha, South Africa:
Mann and Whitney (1947) gave a recursive formula to calculate probabilities of this distribution. This formula is rather slow and not very useful from a computational point of view. Better computational algorithms that are based on the probability generating function of the distribution were suggested by a number of authors including Wilcoxon, Katti and Wilcox (1973) and Harding (1984). The R implementation of the three before mentioned algorithms will be shown. For sufficiently large sample sizes the normal approximation can be used. An R program for determining the sample sizes for which the normal approximation is accurate will also be presented.
Zuguang Gu, Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany:
Circular layout is efficient to visualize multiple dimensional data. The pioneer software Circos  already makes great success for the circular visualization in many areas, especially for understanding huge amount of genomic data. Here we present the circlize  package which implements circular visualization in R as well as enhances available software. The package is based on the implementation of basic low-level graphics functions (e.g. drawing points and lines). Therefore, it is flexible to customize new types of graphics. In addition, with the seamless connection between data analysis and visualization in R, automatic procedures for generation of circular designs can be easily achieved. With the generality and simplicity of the package, circlize provides a basis on which high-level packages focusing on specific interests can be built. We will demonstrate how to make close control on the circular layout, how to use low-level graphics function to build a complex circular plot and specific application e.g. on phylogenetics and genomics, Finally, we will demonstrate the customization of Chord Diagram which is useful to revealing complex relations in the data. References:  Krzywinski M, et al. Circos: an information aesthetic for comparative genomics. Genome Res. 2009;19:1639-1645.  Gu Z, et al. circlize implements and enhances circular visualization in R. Bioinformatics. 2014;30:19
Barbara Jenko, Institute of biochemistry, Faculty of Medicine, University of Ljubljana, Slovenia:
Objectives: Variability in genes involved in methotrexate (MTX) transport and target pathways was associated with MTX treatment outcome in rheumatoid arthritis (RA) patients. We investigated the effect of a large number of clinical and genetic factors on MTX discontinuation due to inefficacy and adverse events (AE). We developed and evaluated a prognostic index that could facilitate the translation of our results into clinic. Methods: In total 333 RA patients were genotyped for polymorphisms in folate and adenosine pathway and MTX transporters. Multivariable Cox models with LASSO penalisation was used to estimate prognostic factors and to construct the prognostic index. Its predictive capacity was evaluated with the cross-validated area under time dependent receiver operating characteristic curve (tAUC). Results: MTX dose, ABCG2 and ADORA2A were associated with discontinuation due to inefficacy, while RF or ACPA seropositivity, MTX dose, MTX monotherapy, SLC19A1, ABCG2, ADORA3 and TYMS were associated with discontinuation due to AE. Clinical-pharmacogenetic model of MTX discontinuation due to AE had better predictive ability than non-genetic model however the prediction was mostly worthless during the 2 years treatment time period. Conclusions: Application of those clinical-pharmacogenetic predictive models may support the future development of personalized MTX treatment in clinical practice.
Andrew Bray, Department of Mathematics, Reed College, Portland, Oregon:
The userR! 2014 conference in Los Angeles featured an invited talk on OpenIntro, a project that develops free and open-source educational resources. The focus of the talk was on the development of the OpenIntro Statistics textbook as a collaborative and open-source enterprise and its parallels with the R project. At the end of the talk, by far the most common question by useR! participants was: is the textbook on GitHub? In June 2014, the answer was no, but one year later, the answer is yes. The full textbook is now in a public GitHub repository as are the OpenIntro Labs, which teach statistics by analyzing real data in R. In this poster we would like to update conference participants on how educators at all levels can access, remix, and contribute content related to statistics education. We would also like to showcase a model for successful forks of educational materials, namely, R labs that have been translated into the R mosaic idiom.
Mark van der Loo, Department of Methodology, Statistics Netherlands, The Hague:
My recently published "settings" package is aimed to make option settings management in R more convenient. In particular, with this package one can: * Define one's own option settings manager (with default settings) in a single call. * Alter or request options like with "options()", but also reset all option values with a single call. * Merging or altering option settings either globally or locally with ease (e.g. when writing functions with the "..." argument). Besides that, the package offers a convenient way to reset "options()" or "par()" to their `factory settings' in a single call. For example, calling reset_par() resets almost every graphical parameter to its default; exceptions are a few settings that typically have device-dependent defaults such as "mai" (margin size in inches) or "pin" (current plot dimensions in inches). A call to reset_options() resets all of R's options their defaults.
Robert Tell, Abbott Diagnostics, Abbott Laboratories:
Within the medical device industry, there is great reliance on the SAS programming language deriving from decades of SAS-based clinical trial data analysis. This proficiency has proliferated throughout industry over the years, creating both a wealth of programming experience and large libraries of code. This contributes to significant institutional inertia to use only SAS. R is a functionally equivalent language that offers to expand the toolset available for statistical programming. R allows for great flexibility due to its rich selection of open-source libraries, scalability, and ease of deployment. However, the adoption of R by a corporation within a regulated industry presented a series of challenges. These surrounded compliance both with the FDA and Abbott’s own quality system. As part of these processes, there were the steps of vendor quality assurance, software quality assurance, and software code validation. Finally, there was the effort required to assure stakeholders that R could be used successfully for these tasks while maintaining regulatory and quality compliance. All of these proved significant challenges in extending analytical applications to R within the research and development organization at Abbott Diagnostics. Herein, the process will be outlined and a roadmap for others might be furnished in developing compliant R applications.
Matti Lassila, Jyväskylä University Library, Finland:
The integrated library system (ILS) has been traditionally the backbone of all library operations, including acquisition of the resources, cataloguing and collection management. Therefore, a wealth of information is being stored to the ILS and is potentially available for analysis. Unfortunately, the built-in reporting capabilities of the ILS are usually very limited. Sometimes, these limitations can be circumvented using external database query tool if the ILS vendor permits direct SQL-access to the relational database powering the ILS. In our case we were using MS Access as a primary reporting interface to the Oracle 11g database of the ILS. At the request of non-technical staff members systems librarian created MS Access queries on ad-hoc basis. This workflow was time consuming and because of the manual nature of the process, it was impossible to utilize real time information, such as book hold statuses or transaction logs. Using existing SQL queries as a starting point, we created a Shiny web app to automatize and greatly improve our reporting process. In addition to the Shiny the key building blocks have been ROracle and scheduleR, which we are using as a lightweight Extract-Transform-Load (ETL) tool.
Anita Höland, Institute for Medical Informatics, Justus-Liebig-University, Giessen, Germany:
Microarray analysis has been developed to identify the expression levels of a large number of genes simultaneously. R and the Bioconductor platform are then widely used to conduct the necessary data preparation and analysis steps to identify emerging patterns of gene expression. Packages like Limma1were developed in the early beginnings of R and the Bioconductor projects supplies a large variety of R-packages for data analysis for different kinds of microarrays. The package Codelink2, which we use, was developed to preprocess and analyze Codelink Bioarrays (GE Healthcare). To use these packages and to prepare the data for analysis a certain amount of knowledge about the required operations and the usage of R is needed. To ease and speed up this process, for user without knowledge or R, we are developing a workflow in R which then only requires the user to input certain parameters to produces the output in graphs and tables. The workflow includes functions from existing packages and newly implemented functions developed for this purpose. To broaden the range of applicable studies the workflow can be adapted for supervised (case-control studies) and unsupervised (cluster analysis) study designs.
Yasuto Nakano, School of Sociology, Kwansei Gakuin University, Nishinomiya, Japan:
The purpose of this presentation is to propose an environment for socialresearch data and its analysis. A R package DDIR and an IDE dlcm, which utilize social research informations in DDI format, offer you integrated environments for social research data. DDI(Data Documentation Initiative) is a XML protocol to describe informations related to social research including questionnaire, research data, meta data and summary of results. There are several international research projects which use this protocol as a standard format. ICPSR(Interuniversity Consortium for Political and Social Research), one of the biggest data archive for social research data, encourages data depositors to generate documentation that conforms with DDI. In R environment, there is no standard data format for social research data . In many case, we have to prepare numerical data and label or factor informations separately. If we use DDI file as a data file with DDIR in R, only one DDI file is needed to be prepared. DDI file could be a standard data format of social research data in R environment, just same as 'sav' file in SPSS. DDIR realizes integrated social research analysis environment with R, and ensures it as a reproducible research.
Gokmen Zararsiz, Hizir Yakup Akyildiz and Ahmet Ozturk, Department of Biostatistics, Faculty of Medicine, Erciyes University, Kayseri, Turkey, and Dincer Goksuluk, Selcuk Korkmaz, Sevilay Karahan and Eda Karaismailoglu, Dept. of Biostatistics, Hacettepe University, Ankara, Turkey:
A quick evaluation is essential for patients with acute abdominal pain. It is crucial to differentiate between surgical and nonsurgical pathology to prevent mortality and morbidity. Practical and accurate tests have importance in this differentiation. Recently, D-dimer level is found to be an important marker in this diagnosis and obviously outperforms leukocyte count, which is widely used for diagnosis of certain cases. Here, we built DDNAA, a user-friendly shiny application, to assist physicians in their decisions to diagnose patients with acute abdomen. An experimental study is conducted and 28 statistical learning approached were assessed for this purpose by combining leukocyte count and D-dimer levels in order to make an increase in the diagnostic accuracies. DDNAA web-tool includes the best performed algorithms naïve Bayes, robust quadratic discriminant analysis, k-nearest neighbors, bagged k-nearest neighbors and bagged support vector machines that provided an increase in diagnostic accuracies up to 8.93% and 17.86%, comparing to D-dimer level and leukocyte count, respectively. DDNAA shiny application is available at http://www.biosoft.hacettepe.edu.tr/DDNAA/.
Jason Waddell, Open Analytics:
Traditional color legends present a missed opportunity for gaining added insight into the color variable. The densityLegend() function introduces functionality that combines the legend with a color-partitioned density trace, for visualizing the distribution of the color variable. In addition to legends, color-paritioned density traces introduce a range of unique visualizations. We present a package for easy integration of density legends into base R plots, with a planned ggplot2 adaptation.
Pavel Bocek, Department of Stochastic Informatics, UTIA AVCR, the Czech Republic:
The presentation introduces an R package for performing two recent directional multiple-output quantile regression methods generalizing Koenker's quantile regression to the case of multivariate responses. It starts with a necessary but brief theoretical introduction, continues with a brief description of the R package and its functionality, and concludes with a carefully designed set of practical illustrative examples how the package can be used to solve the parametric optimization problems behind both of the directional multiple-output quantile regression approaches, to evaluate the resulting regression quantile contours or their cuts, and to compute various meaningful inferential statistics. The applications include locally constant multiple-output quantile regression and the computation of halfspace depth contours in two to six dimensional spaces, among others. In summary, the R package finally makes the two promising multiple-output quantile regression methods freely available to the statistical public.
Andreas Karlsson, Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden:
Discrete event simulation (DES) is a powerful technique for simulating complex systems. Surprisingly, there are few options for DES in R. Microsimulation is a modelling technique that operates at the level of individuals. We required a flexible DES framework for microsimulation of prostate cancer screening. For the microsimulation package (https://github.com/mclements/microsimulation), we provide a pedagogic DES implementation with R5 classes. However, the R5 implementation and a C++ process-oriented DES scale poorly for 10^6 individuals. For speed and flexible simulation specification, we use a C++ event-oriented DES library as the simulation core. A natural workflow uses R for pre- and post-processing, with complex data structures passed between R and C++ using Rcpp. We have developed several tools to support the microsimulation. For variance reduction, we use common random numbers, with stream manipulation at the C++ level. We also use C++ reports to substantially reduce the post-processing burden in R. We demonstrate the package by predicting the cost-effectiveness of prostate cancer screening for different screening scenarios. The combination of Rcpp and C++ allows for a fast DES framework with easy data management and analysis in R.
Tom De Smedt, KULeuven:
Disease mapping is one of the most widely used techniques in epidemiology. In disease mapping we divide the area of interest in different subareas, and we display the disease rate for each of these subareas, potentially allowing us to identify anomalous subareas. In order to be able to correctly compare subareas when the disease under investigation is statistically rare, modern disease mapping methods use different kinds of spatial smoothing methods. In ecological regression we also have an exposure variable for each subarea, which allows us to look at the relationship between the disease and exposure on the subareal level. We have developed a Shiny application, where the user can upload the disease data (and the exposure data) for each subarea. The application can then correctly map the data, where the user can choose between different smoothing methods (no smoothing, Bayesian hierarchical smoothing methods, spline smoothing). For ecological regression, both exposure and disease data are uploaded, and the user can choose between different methods to investigate the relationship. Given the large rise in available datasets, this application allows epidemiologists to quickly evaluate novel data using an easy-to-use and lightweight platform.
Sonya Abbas, E-Government unit, NUIG, GALWAY, IRELAND:
The pressure of evaluating and improving the government’s actions plans is driven by a combination of factors. These factors include the difficulties of discovering the similarities between actions plans of different countries as this helps learning from similar experience and the need to evaluate the actions plans based on the challenges and commitments they address. Documents clustering have been used as an approach in order to solve this problem. However, This wasn’t straight forward as we face challenges related to data preparation such as dividing the actions plans into challenges and commitments, preparation such as filtering the features and visualization. Data Science steps have been followed from data collection, cleaning, preparation, analyzing, visualization and interpretation. We use R for implementing this. Kmeans and hierarchical clustering has been applied and different visualization ways has been presented using different packages like ade4, ggplot2, ellipse, HSAUR and flexclust. As an evaluation, we suggest internal and external quality measures like entropy, F measure and overall similarities and imply it using R packages.
Konstantin Chizhov, Laboratory of radiation-health studies, Burnasyan Federal Medical Biophysical Center, Moscow, Russia:
The article shows developed methodology for evaluating the reliability of dose reconstruction taking into account human factor uncertainty. In the absence of radiation monitoring, analysis of doses is based on questionnaires - recorded on paper stories about people activities. This situation is very typical for accidents when there are not enough dosimetry equipment. Thus, there is a chance that the questionnaire will contain some information that does not correspond to reality, and some information will be absent or be invented. In our study we have checked questionnaires of 28 emergency workers. We used two questionnaires for each worker - one was filled in the first year after the accident, and the second - after 20 years. For comparison we transformed questionnaires to a matrix of elemental fragments with a set of parameters: dose rate, coordinates, residence time, protective factor. Dose rate in every location was calculated using RADRUE method by interpolation of measured values. Through this analysis, we calculated contribution of the human factor in uncertainty of doses and find places where workers reported incorrect information.
Yoni Schamroth, Perion Networks:
Controlled Experimentation has been universaly adopted by the online world as an essential tool in aiding in the decision making process and (maybe new sentence) has been widely recognized as a successful scientific method for establishing causality. Frequently referred to A/B testing or multivariate testing, controlled experiments provide a relatively straightforward method for quickly discovering the expected impacts of new features or strategies. One of the main challenges involved in setting up an experiment is deciding upon the OEC, or overall evaluation criteria. In this paper, we demonstrate the importance of choosing a metric that (focuses on, captures, emphasizes) long term effects. Such metrics include measures such as life-span or lifetime value. We present motivating examples where failure to focus on the long term effect may result in an incorrect conclusion. Finaly we present an innovative methodology for early detection of lifetime differences between test groups.
Ana Costa e Silva, Tibco Software Inc.:
TIBCO Spotfire® - Hadoop integration points can be grouped into two categories: TIBCO Spotfire native data connectors and TERR (TIBCO's enterprise platform for the R language)-Hadoop integration. Both provide an extensive set of analytic features and security options. Spotfire Hadoop connections can be quickly configured into analytic workflows, dashboards, or reports, which can then be shared, reused, and consumed across organizations. KPIs based on Hadoop data can be pushed to virtually any user device via HTML. Extensive geo analytic support within Spotfire makes it easy to generate insights from geographical data. In this session, we will explain and demo the powerful combination of Spotfire, TERR, and Hadoop and how it enables deeper, more valuable analysis of Hadoop data. We will demonstrate how with it the business user can have an easy to use front-end from which to: a) visualise big data interactively with surprising performance, b) with just a few clicks, deploy map-reduce jobs, which run R code in the TERR engines installed in the Hadoop data nodes, c) with again just a few clicks, launch H2O jobs when running calculations on all data-nodes at once, d) consume the result of calculations, e.g. predictive models, and deploy them in real-time.
Dincer Goksuluk, Selcuk Korkmaz, Sevilay Karahan and A. Ergun Karaagaoglu, Dept. of Biostatistics, Hacettepe University, Ankara, Turkey, and Gokmen Zararsiz, Department of Biostatistics, Faculty of Medicine, Erciyes University, Kayseri, Turkey:
ROC analysis is a fundamental tool for evaluating the performance of a marker in number of research areas, e.g., biomedical, bioinformatics, engi- neering etc., and is frequently used for discriminating cases from controls. There are number of analysis tools guiding researchers through their analysis. Some of these tools are commercial and provide basic methods for ROC analysis while some others come up with advanced analysis techniques and command based user interface, such as R programming. R programming includes comprehensive tools for ROC analysis, however, using command based interface might be challenging and time consuming when a quick and reliable evaluation is desired especially for non-R users, physicians etc. Hence, a quick, comprehensive, free and easy-to-use analysis tool is demanded. For this purpose, we developed a user-friendly web-tool which is based on R language. This tool provides ROC statistics, graphical tools, optimal cut point calculation and comparison of several markers to support researcher in their decision without writing R codes. easyROC can be used via any device with internet connection free from device configuration and operating system. The web interface of easyROC is constructed with an R package shiny. This tool is freely available through www.biosoft.hacettepe.edu.tr/tools.html.
Sebastian Kreutzer, IRAMAT-CRP2A, Université Bordeaux Montaigne, France:
Earth surface processes decisively shape our planet. To decipher the timing and rates of Earth surface pro- cesses throughout the last 250,000 years, one numerical dating method has reached paramount importance: Luminescence dating. This method provides robust numerical data on environmental changes by measuring the luminescence signal of minerals, which is reset by daylight exposure or heating and has the advantage of using nearly ubiquitously available mineral grains of quartz or feldspar. During the last decades more and more luminescence-based ages have been requested and the method has been considerably enhanced. However, an increasing methodological complexity demands for a flexible and scalable software solution for data analysis. The presented R package Luminescence is designed as a toolbox intending to provide customised solutions for a variety of requirements, e.g. measurement data import, statistical analysis, graphical output. The used algorithms and statistical treatments are always transparent and the user maintains in control of combining and adjusting algorithms by taking advantage of the wide range of functions available in R. Our contribution summarises the concept of the R package Luminescence and focuses on some conceptional aspects and selected practical examples.
Tan Teck Kiang, Institute for Adult Learning, Workforce Development Agency, Sngapore:
The quasi-symmetry model (QS) is one of the doubly classified non-standard log-linear models commonly used in social sciences in examining relationship of cells within a square table. The main characteristic of QS is that it exhibits symmetry in odds ratios for off-diagonal cells. A new proposed model, quasi-symmetry model with n degree symmetry, QS(n), relaxes this symmetry odds ratios assumption to a general QS with varying degree of symmetry. When the degree of symmetry at the lowest level with n=1, the QS(1) model, only those cells closest to the diagonal are in symmetry and those further away are freely estimated. The number of cells in symmetry goes up when the degree of symmetry n increases, thus formed a series of QS models with symmetry degree n. The QS(n) models are fitted using generalized linear model. Package R function gnm is used to fit the QS(n) model. Using a survey data that aims to examine the association between literacy skills and problem solving skills, the results show that QS(1) models fit better than QS model in explaining the association between the two skills, indicating the incremental information content and usefulness of QS(n) model.
Malene Juul, Department of Molecular Medicine (MOMA), Aarhus University Hospital, Denmark:
Malene Juul, Johanna Bertl, Qianyun Guo, Asger Hobolth, Jakob Skou Pedersen In the age of big data, efficient data handling and analysis are challenging tasks. R gives access to a comprehensive and well maintained set of tools valuable for doing statistics and data analysis. However, these advanced functionalities often come at the price of calculation speed. In cancer genomics, data sets with billions of data points are routinely produced. In this work we analyze 2,500 whole genome DNA sequences, each consisting of approximately three billion data points. In this setting the bottlenecks are efficient and accurate analysis methods. Here, we are interested in determining the distribution of the sum of independent discrete stochastic variables using a dynamic programming approach for mathematical convolution. The chosen granularity of the discretization is a trade-off between calculation accuracy and speed efficiency. I will cover a variety of the code optimization strategies applied in the R implementation, e.g. the relatively simple changes needed to extend the primary R data structure of data.frames to the enhanced version of data.tables.
Meryam Krit, Open Analytics:
Although the Weibull distribution is widely used in many areas such as: engineering, biomedical sciences, economics, reliability, etc. The checking of its relevance for a given data set is not always done or done by elementary techniques such as Weibull plots. There exist more sophisticated techniques which aim to determine if a given model is adapted to a given data set; the goodness-of-fit (GOF) tests. Many GOF tests for the Weibull distribution have been developed over the years, but there is no consensus on the most efficient ones. The aim of the talk is to present the R package EWGoF that gives an overview of up-to-date GOF tests for the two-parameter Weibull and the Exponential distributions. It contains a large number of the GOF tests for the Exponential and Weibull distributions classified into families: the tests based on the empirical distribution function, the tests based on the probability plot, the tests based on the normalised spacings, the tests based on the Laplace transform and the likelihood based tests, ... An illustrative application of the GOF tests to real data sets is carried out at the end of the talk.
Pavel Kulmon, Jana Noskova, David Mraz, Department of Mathematics, Faculty of Civil Engineering, Czech Technical University in Prague, Czech Republic:
The Hough transform is a feature extraction technique and its purpose is to find imperfect instances of objects within a certain class of shapes. The transformation has been successfully used in several areas such as computer vision, image analysis and last but not least, photogrammetry and remote sensing. The foundation of this work is built upon the book „Introduction to image processing using R: learning by examples “ written by A.C. Frery and T. Perciano, where were outlined options for working with the digital images using the R-project. Our main aim is to develope the new R package Hough. In this package algorithms for non-user image evaluation will be implemented. Now the package contains the Hough transformation for a line detection using the accumulator. We also want to incorporate other Hough techniques like the recognition of more advanced analytic curves. The preparation of the non-user image evaluation consists of various methods such as the image grayscaling, the thresholding or the histogram estimation. The transfer from the gray scaled image to binary is realised by the aproximation of the derivatives, which were computed by the Sobel operator. The new R package Hough will be the collection of all mentioned techniques.
Lukas W. Lehnert, Department of Gography, Philipps-University Marburg, Germany:
An R software package is introduced which focuses on the processing, analysis and simulation of hyperspectral (remote sensing) data. The package provides a new class (Speclib) to handle large hyperspectral datasets and the respective functions to create Speclibs from various types of datasets such as e.g., raster data or point measurements taken with a field spectrometer. Additionally, the package includes functions for pre-processing of hyperspectral datasets and gives access to the vegetation reflectance simulation models PROSPECT and PROSAIL. The functionality of the package to analyze hyperspectral datasets encompasses a huge range of common methods in remote sensing, such as the transformation of reflectance spectra using continuum removal, linear spectral unmixing, the calculation of normalized ratio indices and over 90 different hyperspectral vegetation indices. Additionally, a direct access to multivariate analysis tools such as generalized linear models and machine learning algorithms via the caret-package is provided. The contribution shows a subset of available methods which are demonstrated by the analysis of 3D hyperspectral data taken to investigate effects of CO2 enrichment on grassland vegetation.
Jouni Helske, Department of Mathematics and statistics, University of Jyväskylä, Finland:
State space modelling is an efficient and flexible method for statistical inference of broad class of time series and other data. Structural time series, ARIMA models, and generalized linear mixed models are just some examples of models which can be written as a state space model. Standard methods are often restricted to Gaussian observations due to their analytical tractability. I introduce an R package KFAS (Kalman Filtering And Smoothing), which can be used for state space modelling with the observations from exponential family, namely Gaussian, Poisson, binomial, negative binomial and gamma distributions. After introducing the basic theory behind the state space models and the main features of KFAS, an illustrative example for forecasting alcohol related deaths in Finland is presented.
Johanna Bertl, Department of Molecular Medicine, Aarhus University, Denmark:
Understanding the mutational process in cancer cells is crucial to distinguish driver mutations, responsible for the initiation and progress of cancer, from passenger mutations. The heterogeneity of the process on various levels makes this a challenging question: whole-genome analyses have shown that the mutation pattern differs fundamentally between different cancer types, but also between patients and along the genome. Here, we analyse whole-genome DNA sequences of tumor and healthy tissue of 505 patients with 14 different cancer types (Fredriksson et al., Nature Genetics, 2014). We model the probabilities of different types of mutations at each position on the genome by multinomial regression. Explanatory variables capture local genomic characteristics like the local base composition, the functional relevance of the region and epigenetic factors. The enourmous dataset creates two different computational challenges: First, with the 3 billion basepairs of the human genome, $n$ is very large. This requires to save and analyse the data in a compact format, even at the cost of loosing information. Second, including interactions between the patient ID and genomic variables considerably increases the number of parameters to estimate and thereby creates convergence problems. We approach these challenges using existing R-packages and our own developments.
Marco Chiarandini, Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark:
Recent research in the field of heuristic algorithms for nonlinear optimization has focused on methods to fine tune the main heuristic decision that are embedded in these algorithms. These decisions can be represented as categorical and numerical parameters. We designed and build a method to find the best setting of parameters based on graphical models and Bayesian learning. Each parameter is modeled by a node of the network and dependencies assumed a priori by arcs. Nodes have associated a local probability distribution, that for continuous parameters is given by Gaussian linear regression. Learning is achieved by a combination of importance sampling and Bayesian calculus. We implemented the method in R building on the package deal. Every data point corresponds to a run of the algorithms to tune and it may be computationally expensive, therefore we used Rmpi to execute the algorithms in parallel in a distributed environment. The results on two test cases, the traveling salesman problem and a nonlinear continuous optimization case derived from least median of squares, show that the method achieves competitive results with respect to state-of-the-art automatic tuning systems. More extensive testing is needed.
Davor Cubranic and Jenny Bryan, Department of Statistics, University of British Columbia, Canada:
_Logr_ implements a logging framework that users R's existing messaging functionality, and builds upon it an API that is simple to use. Because it uses the same underpinnings, _logr_ can capture output generated by `message`s and `warning`s, making it easy to adopt even in a mature codebase. Most logging packages today seem to copy Java log4j API. It's important to remember that Log4j originated as the logging code for the Apache web server. As such, it was designed for use in large, long-running, and complex applications that contain many subsystems and potentially produce many output events per second. _Logr_ instead targets a lighter-weight usage scenario -- arguably more likely in an R codebase -- of a relatively short, often interactive script or command-line utility. To this purpose, _logr_ provides a minimal API with sensible defaults that requires little effort to use in code. Still, the API is powerful enough to allow multiple logging destinations, each with its own level of detail. This makes it simple, for instance, for a script to provide informational progress messages to the user, while recording detailed output in a log file.
Earvin Balderama, Department of Mathematics & Statistics, Loyola University Chicago, Chicago, Illinois, USA:
The spatial distribution and relative abundance of marine birds along the US Northeast and Mid-Atlantic coastlines are of special interest to ocean planners. However, marine bird count data often exhibits excessive zero-inflation and extreme over-dispersion. Our modelling effort incorporates a spatial-temporal double-hurdle model specifically tailored to look at extreme abundances, which is especially important for assessing potential risks of offshore activities to seaducks and other highly aggregative species. We discuss several distributional forms of each component of the model, including negative binomial, log-normal, and a generalized Pareto distribution to handle the extreme right tails. Spatial heterogeneity is modelled using a conditional auto-regressive (CAR) prior, and a Fourier basis was used for seasonal variation. Model parameters are estimated in a Bayesian hierarchical framework, using an MCMC algorithm with auto-tune parameters, all written and run in R. We demonstrate our model by creating monthly predictive maps (using ggmap) that show areas of high probability of aggregation and persistence for each of over fifty "high-priority" species as listed by Marine-life Data Analysis Team (MDAT) in collaboration with the Northeast Regional Ocean Council (NROC). A Shiny (R-Studio) app is currently being developed for quick reference of a desired space-time-species map.
Jonatan A. González, Department of Mathematics, University Jaume I, Castellón de la Plana, Spain:
Point processes are random collections of points falling in some space, this concept is used in order to describe a huge set of natural phenomena in a wide variety of applications. Our interest concerns the spatial point processes, where each point represents the location of some object or event, such as a tree or sighting of a species. The classical model for a point processes is the Poisson process, where the numbers of points in any disjoint sets are independent random variables. The Poisson process is a natural null model in the absence of clustering or inhibition. In order to differentiate between individuals based on their patterns, it is necessary the definition of a distance between two point patterns. The purpose of this work is to outline several types of distances (and non-metric measures of dissimilarity) between two point patterns, and . We aim to implement dissimilarity measures based on functional or scalar descriptors of point processes, including estimators of first and second moments of the processes, or classical test statistics based on these moments. Finally we perform a simulation study and a real data analysis through functions and packages in R.
Gokmen Zararsiz and Ahmet Ozturk, Gokmen Zararsiz, Department of Biostatistics, Faculty of Medicine, Erciyes University, Kayseri, Turkey, Dincer Goksuluk and Selcuk Korkmaz, Department of Biostatistics, Faculty of Medicine, Hacettepe University, Ankara, Turkey, Vahap Eldem, Department of Biology, Faculty of Science, Istanbul University, Istanbul, Turkey, Izzet Parug Duru, Department of Physics, Faculty of Science, Marmara University, Istanbul, Turkey, and Turgay Unver, Department of Biology, Faculty of Science, Cankiri University, Cankiri, Turkey:
With the recent developments in molecular biology, it is feasible to measure the expression levels of thousands of genes simultaneously. Using this information, one major task is the gene-expression based classification. With the use of microarray data, numerous classification algorithms are developed and adapted for this type of classification. RNA-Seq is a recent technology, which uses the capabilities of next-generation sequencing technologies. It has some major advantages over microarrays such as providing less noisy data and detecting novel transcripts and isoforms. These advantages can also affect the performance of classification algorithms. Working with less noisy data can improve the predictive performance of classification algorithms. Further, novel transcripts may be a biomarker in related disease or phenotype. MLSeq package includes several classification and feature selection algorithms, also normalization and transformation approaches for RNA-Seq classification. MLSeq is available at http://www.bioconductor.org/packages/release/bioc/html/MLSeq.html
Florent Baty, Department of Pulmonary Medicine, Cantonal Hospital St. Gallen, Switzerland:
Six-minute walk tests (6MWT) are common examinations performed on lung disease patients. Oxygen uptake (VO2) kinetics during 6MWT typically follows 3 phases that can be modelled by nonlinear regression. Simultaneous modeling of multiple kinetics requires nonlinear mixed models which, to our knowledge, have not yet been fitted in practice. The aim is to describe functionality of the R package medrc that extends the nlme package framework of fitting nonlinear mixed models with a user-friendly interface, and demonstrate its usefulness in pulmonary medicine. 6MWT VO2 kinetics were measured on 61 patients with chronic obstructive pulmonary disease classified into 3 severity stages. A 6-parameter nonlinear regression model was defined and fitted to the set of kinetics using the function medrm(), allowing for automated fitting of a single joint nonlinear mixed model on multiple curves by extending the functionality of nlme(). All kinetics phases were incorporated within a single mixed model including fixed factors stratified into 3 clusters (disease stages), together with patient-specific random effects. Significant between-stage differences were found regarding maximum VO2 during exercise testing, inflection point and oxygen level at recovery. medrc provides a comprehensive framework for the parametrisation and inference for hierarchical nonlinear mixed-effects regression models in various biomedical applications.
Joe Suzuki, Osaka University:
When we deal with data analysis with R, computing mutual information (MI) is needed very often. For two discrete variabes, it is easy to compute the MI (one can construct a function with R easily). On the other hand, if they are Gaussian, it will be easy as well because only the correlation coefficient should be estimated. In this work, we condider the most general case in which its density function may not exist. The estimator is the difference of the BIC values when they are dependent and independent. The same principle works for discrete and continous cases. The estimated MI is strongly consistent, and is almost surely negative if and only if the two variables are independent. The implementation is based on Ryabko's measure (2009). Given samples of size n, the estimation completes in O(n log n). In the presentation, we show the source program and several experimental results using the package. Finally, we show the same principle is useful in learning a graphical model structure given examples if we extend the estimator of MI of X,Y to that of conditional MI of X,Y w.r.t. Z that can detect conditional independence of X,Y given Z.
Kirill Müller, IVT, ETH Zurich:
For reproducible research, it is crucial to be able to generate all results from original raw data. By automating the process, it is possible to easily verify reproducibility at any stage during the analysis. Automation also allows easy recreation of the entire analysis based on modified inputs or model assumptions. However, rerunning the entire analysis starting from raw data soon becomes too time-consuming for interactive use. Caching intermediate results alleviates this problem but requires a robust mechanism for cache invalidation. R packages are a suitable container for statistical analyses: They can store data, code, and documentation. Recent efforts have considerably simplified the packaging process. This poster presents an approach to conduct a statistical analysis by creating a "package web" -- interdependent packages where each serves a dedicated purpose (e.g., holding raw data, munging data, input validation, modeling, analysis, reporting, ...). Package dependencies define the data flow for the entire analysis. The "rpkgweb" companion package tracks which downstream packages need to be rebuilt if a package changes, and builds independent packages in parallel. Reproducibility can be monitored continuously with minimal effort, yet the modular structure permits interactive work.
Ana F. Militino, Universidad Pública de Navarra:
PASWR2 is the second version of PASWR, acronym of Probability and Statistics with R. This package contains data sets, functions and scripts created for solving exercises, problems and theoretical questions of the book entitled with the same name. Its goal is to teach Statistics at an intermediate level using both mathematical classical tools, and R. Traditional scripts following theoretical formulae and programmed commands to illustrate them are presented.
Gergely Daróczi, Easystats Ltd, United Kingdom:
The poster shows an annotated but mainly visual map of the world, which highlights the activity of R users from various points of view -- similar to what I have presented at the previous useR! conference. The plots and the infographics were created in R, inspired by some recent blogposts of cartograms and the xkcd package, but using a wide variety of date sources collected, cleaned, merged and aggregated by the author. These include the number of visitors of R-bloggers.com, the attendees of all previous useR! and some other R-related conferences, the members and supporters of the R Foundation, the number of users on GitHub with R repositories, package download statistics from CRAN mirrors and the number of R User Groups and the number of attendees of the events. Besides these raw data, the poster will also present a population-weighted scale of R activity for all countries of the world.
Michael Dietze, Helmholtz Centre Potsdam - German Research Centre for Geosciences:
Geomorphology, i.e. the investigation of processes that shape our planet at scales of milliseconds to millions of years, of nanometers to continents, is a long-established, vibrant scientific field, which is confronted with trans-disciplinary methodological demands. Two key innovations in the last 30 years have profoundly affected this discipline: the ability to efficiently quantify the shape and rate of change of landforms and ii) the ability to determine the ages of landforms. This has renewed the relevance of Earth surface process research to many disciplines, such as geology, biology, geography, engineering and social sciences. The integrative, scale-bridging nature of geomorphology demands matching data handling methods. This contribution will give insight to emerging Earth surface process research fields and how R feeds into comprehensive and effective data processing. It shows how existing package functionalities contribute to novel, task-specific packages. The contribution highlights how more and more packages cover methodological gaps and how CRAN policy allows joining these packages to integrated spatial data handling, time series analysis, numerical modelling, signal processing as well as statistic analysis and modelling.
Florian Detsch, Environmental Informatics, Philipps-Universität Marburg, Germany:
'remote' implements a collection of functions to facilitate empirical orthogonal teleconnection analysis. Empirical Orthogonal Teleconnections (EOTs) denote a regression-based approach to decompose spatio-temporal fields into a set of independent orthogonal patterns. They are quite similar to Empirical Orthogonal Functions with EOTs producing less abstract results that are orthogonal in either space or time. In this paper we present the R implementation of the original algorithm by Huug van den Dool in the 'remote' package. Especially the utilisation of Rcpp for the intensive regression calculations ensures acceptable computation times and memory usage for this 'brute force' spatial data mining algorithm. This is a very important aspect of 'remote' as the amount of data points in spatio-temporal geoscientific fields is generally extremely large and can easily require millions, or even billions of calculations. To highlight its usefuleness we provide some examples of potential use-case scenarios for the method including the replication of one of the original examples from van den Dool's original paper, as well as statistical downscaling from coarse to fine spatial grids (using NDVI fields).
Michael Rustler, Christoph Sprenger, Nicolas Caradot and Hauke Sonnenberg, Kompetenz Wasser, Berlin:
In environmental sciences numeric models play an important role supporting decision making. Usually, the modelling procedure (parameterisation, calibration, validation, scenario analysis) includes manual steps. For example, models are often calibrated by changing parameter values manually using a trial-and-error method (e.g. Anibas et al. 2009). This makes it challenging to document how the modelling software was used, which is a prerequisite for making the applied methodology transparent and thus the whole modelling process reproducible. Automatisation by means of programming can improve the modelling process. Once a methodology is implemented in the form of program code it is inherently documented. The code can be run repeatedly and will always produce the same results given the same inputs. We used R as programming language to automate the modelling workflow. Different models have been ‘wrapped’ by means of R packages: VS2DI (groundwater flow, heat and solute transport), WTAQ-2 (well drawdown), EPANET (pressurised pipe networks) and Gompitz (sewer ageing). These models can now be configured and run from within the R environment. This allows to use R’s excellent functions for retrieving and preparing input data (e.g. monitoring, geographical data) as well as analysing and plotting simulation results and generating reports. Modelling is described in the form of version controlled R scripts so that its methodology becomes transparent and modifications (e.g. error fixing) trackable. This leads to reproducible results which should be the basis for smart decision making.
Kennedy Mwai, Data Manager, KEMRI-Wellcome Trust Research Programme:
Reproducibility being laudable and frequently called for, we should be instilling this practice in students before they set out to do research. The maturity and extensive reproducibility abilities of Git, R and RStudio based materials make an excellent choice for professional statistical skills training. The most commonly used statistical softwares namely SAS, SPSS and STATA can cost over US$500 for a single license and over US$5,000 for a discounted twenty user access and can be challenging to create reproducible courses for universities and training institutions. Currently, it has been observed that the trend is changing and R popularity is rising. This talk will focus on the success of converting a two weeks statistical methodology for the design and analysis of epidemiological studies (SMDAES) course from Stata based to R based reproducible course using RMarkdown or LaTeX, Git and GitHub, R and RStudio server. We will present the importance of using Git ,R and RStudio as tools for statistical workshops and institution trainings over commercial based tools.The talk will also show the success of teamwork on Git while creating the course materials.
Attilio Mattiocco, Economics and Statistics Department, Bank of Italy, Rome, Italy:
SDMX (Statistical Data and Metadata Exchange) is a standard for the exchange of statistical data. It is widely adopted by international institutions (e.g. ECB, OECD, Eurostat, IMF and others) for data dissemination. The RJSDMX package has been built as a connector between SDMX data providers and the R environment. The package provides functions for connecting to SDMX web services and downloading data as zoo time series. The package also provides text and graphical functions for exploring the metadata contents of the providers, helping the user identify the data of interest, build and validate specific queries. The package is part of the Web Technologies Task View and it has been inserted in the rOpenSci project. The RJSDMX package is part of a wider framework, the 'SDMX Connectors for Statistical Software', an Open Source project that aims to provide the same SDMX data exploration and access functions in the most popular data processing tools (R, STATA, SAS, Excel, MATLAB).
Sebastian Warnholz, Statistische Beratungseinheit:
The demand for reliable regional estimates from sample surveys has substantially grown over the last decades. Small area estimation provides statistical methods to produce reliable predictions when the sample sizes in specific regions are too small to apply direct estimators. Model- and design-based simulations are used to gain insights into the quality of the methods utilized. We present a framework which may help to support the reproducibility of simulation studies in articles and during research. The R-package saeSim is adjusted to provide a simu- lation environment for the special case of small area estimation. The package may allow the prospective researcher during the research process to produce simulation studies with minimal coding effort. It provides a consistent naming convention and highlights a literate programming philosophy.
Jesse H. Krijthe, Pattern Recognition and Bioinformatics, Delft University of Technology, The Netherlands:
Semi-supervised learning considers a particular kind of missing data problem, one in which the dependent variable (label) is missing. The goal of semi-supervised learning is to construct models that improve over supervised models that disregard the unlabeled data. These models are used in cases where unlabeled data is easy to obtain or labeling is relatively expensive. Example applications are document and image classification and protein function prediction, where additional objects are often inexpensive to obtain, but labeling them is tedious or expensive. For my research into robust models for semi-supervised classification I have implemented several new and existing semi-supervised learners that have been combined in the RSSL package. This package also contains several functions to set up benchmarking and simulation studies, comparing several semi-supervised algorithms. This includes the generation of different kinds of learning curves and cross-validation results for semi-supervised and transductive learning. The goal of this work is to make reproducible research into semi-supervised methods easier for researchers and to offer simple consistent interfaces to semi-supervised models for practitioners. The package is still under development and I would like to discuss how to improve the interfaces of the models to interact more easily with other packages.
Ann Liu-Ferrara, Statistical Tools Department, BD:
Demonstrate a shiny application that uses multiple new technologies to increase efficiency and user ease. The tool was created to combine material testing results from a pdf and up to five data files and process the results for upload into a database. The tool automates tedious manual work that had taken users hours to do, and reduces the chance of human error. The tool is for BD internal use, but the techniques have wide applicability. Behind the user interface the reactive upload fields are created within a sapply loop which is efficient and reduces replicate code. The data in each file are displayed as the data are loaded using a ggvis plot. Each plot contains either one or five curves showing the testing results, one curve per sample. The user can see data details by hovering the mouse over a data point and can select a curve by clicking. The selected curve is cleaned and will be highlighted instantaneously. The user can download the raw and cleaned data and multiple graphs for upload to the database. This result is in an easy to use, fast interface that produces a dramatic time saving and a very satisfied client.
Rytis Bagdziunas, CORE, Université catholique de Louvain, Louvain-la-Neuve, Belgium:
Macroeconomic data has lately become accessible in computer-friendly formats, e.g SDMX REST APIs used by Eurostat, ECB or OECD. Such data availability allows analysts and researchers keep and maintain their datasets up to date at no cost and with little effort. Easy access to large datasets continues to spur growth of data-driven techniques for short-term forecasting and construction of diffusion indices. While these techniques are still being actively researched, common dimension reduction methods, such as principal component or factor analysis, are already known to be consistent and have desirable properties under fairly reasonable assumptions, even for serially correlated economic data. `dynfactoR` package (in development) aims to reimplement these economic models in `R` language in a concise and self-explanatory manner. This includes dynamic factor estimation based on EM algorithm and Kalman filtering, support for missing data, linking quarterly and monthly observations as well as data segmentation. My poster will illustrate how, in a few minutes, `dynfactoR` along with `rsdmx` package can be used to construct from scratch easily maintainable short-term economic forecasting models loosely comparable to those used nowadays in central banks.
Maxim Dorofiyenko, AdRoll:
Shiny is an amazing tool for building powerful web applications with R but the deployment process can be a real headache. Normally developers need to interact with a command line (ssh/scp) to work on their apps. Sparkle is a framework for deploying shiny apps that promotes version control and attempts to eliminate barriers to shiny development. We do this using a combination Jenkins CI and the R packages Brew and Knitr. With the Sparkle workflow deploying an app becomes as easy as pushing to Github. From there the developer can easily iterate on their app without running into some of the common problems associated with shiny app hosting on a remote machine. The goal is to make this available through open source in the near future.
Bruce Moore, Moore Software Services LLC:
Differences in loan interest rates for different racial and ethnic groups in the United States have been a topic of research for decades and are currently used as the basis for regulatory enforcement actions under Fair Lending laws. There is significant public policy debate over the statistical methods used by regulatory agencies and the method that is used to estimate the race and ethnicity of borrowers, as lenders cannot legally require race and ethnicity information on a loan application. A simulation model shows that the lower accuracy of race and ethnicity estimation for African Americans makes it unlikely that race-based discrimination would be detected. It also shows that it is likely that a small number of enforcement actions would occur when perfect race and ethnicity identification would detect no race-based discrimination. A second simulation model shows that a very small number of preferential loans to white non-Hispanics can result in a an enforcement action when the total number of loans to white non-Hispanics is small relative to other racial and ethnic groups.
Patrick Bolbrinker, no affiliation (private):
Objective: Present a computational workflow in R for text exploration and give an overview of possibilities and problems pertaining to quantitative text analysis. Background: Natural language processing (NLP) started with word length studies in the late 19th century. Today, over 100 years later, corpora analysis is - besides the development of efficient machine translation systems - one of the main objectives in natural language processing. Context processing and analysis of texts derived from large sources, e.g. all tweets in 2014, can be extremely difficult or even impossible due to limited computational recourses. Therefore, it is more reasonable to use a representative subset. However, how do we measure “appropriate” for sample selection, and deal with language ambiguity? Methods: With the novice in mind I present an efficient workflow for preprocessing, sampling and statistical analysis of English language texts from Twitter and Project Gutenberg. The main focus lies on statistics for randomized sample selection and quantitative corpora exploration (discourse analysis, text-to-text word “marker” distributions etc.).
Carlos Cinelli, Central Bank of Brazil and University of Brasilia:
It is common to have many potential explanatory variables to choose from in situations in which the theory can be ambiguous about which ones to include in the model. One way to tackle this problem is using Bayesian Model Averaging, and the R ecosystem has (among others) two good packages for that, BMA and BMS. This presentation will introduce the sValues package (soon to be on CRAN), which provides an implementation of the S-value statistic, a measure of sturdiness of regression coefficients proposed by Leamer (2014a) and discussed in Leamer (2014b) to assess model ambiguity, an alternative approach to the methods above mentioned. The sValues package has a main function (with formula, data.frame and matrix methods) which does all the analysis and calculations for the user, and it also provides methods for summary, coefficients, plots and printing to let the user explore and export the results. To illustrate the package use, we use the “Growth Regressions” example, showing how one can easily replicate Leamer (2014a) using the sValues package and also compare its results with those of Fernández et al (2001, Sala-i-Martin et al (2004) and Ley, E. and Steel, M. F. (2009) using the BMS package.
Emilio López Cano, Sharing Knowledge and Intelligence Towards Economic Success (SKITES):
Because we are transitioning from the global economy to the collaborative economy, new business models are needed throughout economy sectors. In a data-driven economy where _data is the new oil_, the Analytics business is definitely in a good position to take advantage of this new paradigm. Moreover, the specific environment in the analytics sector, such as the **talent crunch**, expected growth in business and investments, shortage of experts, etc. needs new paths to explore. On the other hand, the academia and business worlds traditionally have different objectives and it is not easy to take talent to the business, and business to the talent. Through crowdsourcing initiatives like the one presented in this poster, a bridge is built to link both worlds. Talentyon is a global network of experts on Analytics, through a crowdsourcing platform, that provides the much needed access to expertise and innovation in the field. The platform has been started by a team of Academicians, Data Scientists and business consultants. The role of the R community is expected to be an important one, as Talentyon is an R natural-born initiative in several aspects presented in the poster.
Charlotte Rennuit and Sasha D. Hafner, Department of Chemical Engineering, Biotechnology and Environmental Technology, University of Southern Denmark, Odense, Denmark:
Anaerobic digestion is a biological-based process for producing biogas (a mixture of methane and carbon dioxide) from organic material. It is an important source of renewable energy. For example, Denmark has > 20 centralized biogas plants converting organic waste into heat and electricity. Tens of millions of small digesters produce cooking and lighting fuel from household waste in China. Biogas production is an active research area, and laboratory experiments are used to quantify production from particular substrates or systems. Collected data must be processed in a set of steps which may be implemented in different ways. These steps are rarely fully described, complicating comparisons between experiments. We developed an R package, “biogas” (available from CRAN), to simplify data analysis and increase reproducibility. Low-level functions include stdVol for standardizing gas volumes and interp for interpolating biogas composition or production. The cumBg function can be used to calculate cumulative gas production and rate, combining interpolation, volume standardization, and summation. Biochemical methane potential (BMP) can be directly obtained using the flexible summBg function. And biogas production and composition can be predicted using predBg. We hope that the biogas package simplifies analysis of biogas data and facilitates standardization in data processing.
Ashley Noel Hinton and Paul Murrell, Department of Statistics, The University of Auckland, Auckland, New Zealand:
The 'conduit' package for R is intended to support greater use of open data sets by encouraging the creation, reuse, and recombination of small scripts that perform simple tasks. The package provides a "glue system" for running "pipelines" of R scripts. Each script is embedded in an XML "module" wrapper, which defines the inputs required by the script and the outputs that the script produces. An R script can be made into a module even when the original script author has no knowledge of 'conduit'. This means that it is easy for any script to be reused in pipelines via the 'conduit' package. Embedding scripts in modules also helps to organise code and makes it simple to reuse scripts in other pipelines. Pipelines specify connections ("pipes") from the outputs of one module to the inputs of another module. The 'conduit' package orchestrates the execution of the module scripts and passes their results to subsequent modules. As modules are defined in XML, they are not specific to R scripts, and it is possible to wrap scripts written in other languages, such as Python.
Tal Galili, Tel Aviv University:
A dendrogram is a tree diagram which is often used to visualize a hierarchical clustering of items. Dendrograms are used in many disciplines, ranging from Phylogenetic Trees in computational biology to Lexomic Trees in text analysis. Hierarchical clustering in R is commonly performed using the hclust function. When a more sophisticated visualization is desired, the hclust object is often coerced into a dendrogram object, which in turn is modified and plotted. The dendextend R package extends the palette of base R functions for the dendrogram class, offering easier manipulation of a dendrogram's shape, color and content through functions such as rotate, prune, color_labels, color_branches, cutree, and more. These can be plotted in based R and ggplot2. dendextend also provides the tools for comparing the similarity of two dendrograms to one another: either graphically (using a tanglegram plot, or Bk plots), or statistically (with Cophenetic correlation, Baker's Gamma, etc) - while enabling bootstrap and permutation tests for comparing the trees. The dendextendRcpp package provides C++ faster implementations for some of the more computationally intensive functions.
Jefferson Davis, Research Analytics, Indiana University, Bloomington, Indiana, USA:
The factors that encourage violence and civil war are of clear interest. Work in the last decade has suggested that the most important are poverty, political instability, rough terrain, and large populations. This downplays ethnic divisions as well both the role democracy plays in providing a non-violent outlet for dissidence and the role that authoritarianism plays in tamping down dissidence. Using R to access national datasets can clarify the source of the data and allow easy updating. R scripts also automate the standardization of names for nations, ethnic groups, and religious groups within nations. A scripting approach also clarifies judgment calls in political science: should a state and a successor state count as separate nations? At what point does a breakaway region establish independence? What level of conflict qualifies as a war? All these questions allow multiple sensible answers. It is best to make the choices clear and easy to change if warranted. In our analysis both stable democracy and authoritarianism decrease the risk of civil war, refining the role of political instability as a factor. Also, with current detailed geographical data the rough terrain effect disappears. The use of R makes this analysis more transparent and more widely replicable.
Satu Helske, Department of Mathematics and Statistics, University of Jyväskylä, Finland:
In social sciences, sequence analysis is being more and more widely used for the analysis of longitudinal data such as life courses. Life courses are described as sequences, categorical time series, which constitute of one or multiple parallel life domains. Sequence analysis is used for computing the (dis)similarities of sequences and often the goal is to find patterns in histories using cluster analysis. However, describing, visualizing, and comparing large sequence data with multiple life domains is complex. Hidden Markov models (HMMs) can be used to compress and visualize information by detecting underlying life stages and finding clusters. The seqHMM package is designed for the HMM analysis of life sequences and other categorical time series. The package supports models for one or multiple sequences with one or multiple channels (dimensions/life domains), as well as functions for model evaluation and comparison. Sequence data can be clustered during the model fitting and external covariates can be added to explain cluster membership. Visualization of data and models has been made as convenient as possible. The user can easily plot multichannel sequence data, convert multichannel sequence data to single channel representation, and visualize hidden Markov models.
Haichuan Wang, Computer Science Department, University of Illinois at Urbana-Champaign:
In this presentation, we will introduce the VALOR package. Its purpose is to improve the performance of program written in terms of Apply. The implementation of Apply in the R interpreter incurs in significant overhead resulting from the iterative application of the input function to each element of the input data. For example, in lapply(L,f), the function f will be interpreted once for each element of L. Our approach performs data transformation and function vectorization to convert the looping-over-data execution into vector operations. The package transforms the input data to Apply operations into vector form and vectorizes the function invoked. In this way, it converts the conventional, iterative implementation of Apply into a single function invocation applied to a vector parameter containing all the elements of the input list. With the built-in support of vector data types and vector operations, this new form has much less interpretation overhead since the vectorized function is only interpreted once and vector operations in it are supported by native implementations. We used a suite of data analysis algorithm benchmarks to evaluate the package. The results show that the transformed code can achieve on average 15x speedup for iterative algorithms and 5x for direct (single-pass) algorithms.
Markus Loecher, Dept. of Economics, Berlin School of Economics and Law, Germany:
The name "multi-armed bandit" describes a hypothetical experiment where one has to choose between several games of chance ("one-armed bandits") with potentially different expected payouts. This is a classic "exploit-explore" dilemma as one needs to both find the game with the best payout rate, but at the same time maximize one's winnings. It was recently claimed that in the context of ad campaign optimization sequentially updating Bayesian posterior probabilities in combination with Thompson sampling would enable decision making at drastically reduced sample sizes compared to a classical hypothesis test. We have implemented these ideas in the R package bandit. We further derive a normal approximation to the quantile of the value-remaining distribution which is fast to compute. The agreement is within a few (relative) percent. We further show empirical results which indicate that the stopping rule based on the value-remaining distribution is overly optimistic. The true rate of falsely declaring the wrong arm to be the winner is substantially higher than the set level would suggest. We speculate that this "significance level" needs to be adjusted due to multiple testing.
Brooke Anderson, Department of Environmental & Radiological Health Sciences, Colorado State University, Fort Collins, Colorado, USA:
Certain rare heat waves can have devastating effects to a community’s public health and well-being. Here, we used R machine learning tools to build models that classify a heat wave as “very dangerous” to human health or “less dangerous” based on heat wave characteristics (e.g., absolute temperature, temperature relative to its community’s temperature distribution, length, timing in season). Very dangerous heat waves are very rare, and so we considered methods to account for this class imbalance in model building. To build and test these models, we used data from 82 large US communities, 1987—2005. We built and evaluated five different types of models (two types of classification trees, bagging, boosting, and random forests ensemble models) using four different approaches for class imbalance (nothing, over-sampling from the rare class, over / under-sampling, and Randomly Over Sampling Examples [ROSE]), for a total of twenty models. We evaluated the models with Monte Carlo cross-validation, identifying three acceptable models. Using these, we predicted the frequency of very dangerous heat waves in these 82 communities in 2061—2080 under two scenarios of climate change (RCP4.5, RCP8.5), two scenarios of population change (SSP3, SSP5), and three scenarios of community adaptation to heat (none, lagged, on-pace).
Sophie Birot, Statistics and Data Analysis Section, DTU Compute, Danish Technical University, Denmark:
Food allergies are a public health concern as high prevalence and severity of the reaction can lead to harmful consequences. So, risk management must be conducted; avoidance diets are the most common way for allergen management. However, a risk might remain due to allergens contamination of food products leading to an unintended consumption of allergen. To estimate the risk following unintended allergen consumption, the recommended approach is the probabilistic risk assessment. It is currently reviewed and improved within the iFAAM project (Integrated Approaches to Food Allergen and Allergy Risk Management). This method takes into account 3 different sources of information: the amount of unintended allergen in the food (product contamination), the consumption of the contaminated product and the allergen threshold distribution (allergen dose which triggers an allergic reaction). All 3 distributions are modelled using the R software. Risk simulations are performed 2 different ways in R; Monte Carlo simulations and Bayesian networks to assess the number of allergic reaction. These methods both propagate variability and uncertainty from the input variables to the outcome, hence confidence and credibility intervals are provided for the allergic risk.
Stanislaw Swierc, Institute of Computer Science, Silesian University of Technology, Poland:
R is a very popular language of choice among statisticians and data scientists who share their work with the Open Source Software community. They do it by making their code available either by publishing it for immediate download or by hosting it in public repositories. GitHub is one of the biggest project hosting service where many R packages such as devtools are being developed. This platform is open and it has been used as a source of data for research on OSS development. In our research we mine GitHub for projects that use R. There are over 75k public repositories with list it as their primary programming language. They make heavy use of the self-contained packages available from The Comprehensive R Archive Network. By looking at which packages are referenced together in existing code we can extract frequent itemsets and association rules to discover interesting relationships and usage patterns. The end-to-end research, which includes data collection, analysis and visualization, is performed using many tools available to R users.
Francisco Javier Rodríguez Cortés, Department of Mathematics, Jaume I University, Castellón, Spain:
Modelling real problems through spatial point processes becomes essential in many scientific fields. Spatial cluster analysis is a key aspect of the practical analysis of spatial point patterns. The idea of considering individual contributions of a global estimator as a measure of clustering was introduced by Anselin (1995) with the name of Local Indicators of Spatial Association (LISA), and it has been used as an exploratory data analytic tool to examine individual points in a point pattern in terms of how they relate to their neighbouring points. The local versions of the second-order product density set a powerful tool to address the problem of classification of interesting subpatterns that often form spatial clusters. LISA functions can then be grouped into bundles of similar functions using multivariate hierarchical clustering techniques according to a particular statistical distance Cressie and Collins (2001). We introduce a new coherence measure for the classification of bundles of LISA functions to classify points according to a certain clustering degree in the pattern. The performance of this technique is outlined through multivariate hierarchical clustering methods and multidimensional scaling using R. We apply this methodology for the classification of Earthquake Catalog on a seismically active area.
Luke Fostvedt, Pfizer Inc.:
Clinical trials have many moving parts and protocol compliance is important to gain the necessary knowledge for a comprehensive submission package to regulatory agencies. The collection of both pharmacokinetic (PK) and pharmacodynamics (PD) samples is the basis for characterizing the safety and efficacy of any new compound. In some cases, there can be a misunderstanding between the pharmacologists and the clinicians as to why the timing of specific collections is important. We have seen that graphics connecting the sampling schedule with PK time/conc curves improves the understanding of why such collections are necessary. R provides a very flexible platform to share this information with clinicians. Using R and shiny clinicians are able to understand the expected behavior of the drug adsorption and disposition. The physicians are also able explore the behavior under many different assumptions (amount dosed, frequency of administration, linear vs. nonlinear elimination, two-compartment v. multiple compartment models, etc.) and patient characteristics (weight, height, age, body-surface area, renal and hepatic impairment, etc.) that could be important factors to consider for dose adjustments. R graphics along with a shiny app will be presented to illustrate how compliance can be improved.
Tomasz Żółtak, Educational Research Institute, Warsaw, Poland:
Educational value-added measures attempt to evaluate school and/or teacher quality. They are used in various forms and on a large scale mostly in the US (e.g., TVAAS/EVAAS) and the UK (as an element of School League Tables). A few years ago they were implemented also in Poland, providing indicators for about six thousand lower-secondary schools and five thousand upper-secondary schools a year (see http://ewd.edu.pl/en/). We would like to present how R has been integrated into a heterogeneous system computing value-added indicators for Polish schools and making the indicators available to the public. Computing value-added indicators is a complex process which involves scaling examination scores with IRT models, estimating mixed-effects regression models for large data sets and sophisticated data manipulation. We use R in three ways. First, it is our primary statistical tool (especially the packages mirt and lme4). Second, it operates external software (Mplus). Third, it integrates the analytical process with a huge SQL database, i.e. it retrieves data from the database, conducts statistical analyses and stores their results in the database. In sum, we would like to share our experiences in applying R to complex solutions in public information systems.
Tomokazu Fujino, Department of Environmental Science, Fukuoka Women's University, Japan:
The vdmR package generates web-based visual data mining tools by adding interactive functions to ggplot2 graphics. Brushing and linking between multiple plots is one of the main features of this package. These functions are well known as “multiple linked view”. Currently, scatter plots, histograms, parallel coordinate plots, and choropleth maps are supported in the vdmR package. In addition, identification on the plot is supported by linking between the plot and the data table. In this talk, we will introduce the basic usage of this package and give some demonstrations of implementing this package as a Web application.
Roy Smith, College of Science & Health Professions, Northeastern State University, Tahlequah, Oklahoma, United States:
As both consumer and personal websites expand in complexity, the structure of these sites can become convoluted. As such, human-computer interaction becomes extremely important for usability. This project uses a combination of technologies to implement a Web structure mining procedure to generate raw data for a given website. An analyzer, written in R, has been created to assimilate and generate a 3D visual “tree” of the site. This tree will allow for users to have an easily understandable sitemap. An additional feature allows these maps to be exported to a format supported by a 3D printer to enable users to create a physical model. Using additional raw data, links can be converted into occurrence listings and used to generate plots that allow simple comparisons of each link to every other link from the given site. It is the intention that these sitemaps and occurrence plots can be used in comparison with maps from other sites for designers to easily determine how to restructure a site to have a more efficient layout for users. With further time and research, this system may also be used to find patterns across the internet to determine the separation of important links.
András Tajti, Department of Statistics, Eötvös Loránd University, Hungary:
Gokul Bhandari, Management Science, University of Windsor, Ontario, Canada:
The purpose of this presentation is to demonstrate how R can be successfully introduced in business curriculums by implementing various strategies. A shinyapps developed for analyzing Assurance of Learning (AOL) will be demonstrated and discussed.