Papers by Jan-Philipp Kolb
methods, data, analyses, 2011

arXiv (Cornell University), Sep 29, 2019
Nonresponse in panel studies can lead to a substantial loss in data quality due to its potential ... more Nonresponse in panel studies can lead to a substantial loss in data quality due to its potential to introduce bias and distort survey estimates. Recent work investigates the usage of machine learning to predict nonresponse in advance, such that predicted nonresponse propensities can be used to inform the data collection process. However, predicting nonresponse in panel studies requires accounting for the longitudinal data structure in terms of model building, tuning, and evaluation. This study proposes a longitudinal framework for predicting nonresponse with machine learning and multiple panel waves and illustrates its application. With respect to model building, this approach utilizes information from multiple waves by introducing features that aggregate previous (non)response patterns. Concerning model tuning and evaluation, temporal cross-validation is employed by iterating through pairs of panel waves such that the training and test sets move in time. Implementing this approach with data from a German probability-based mixed-mode panel shows that aggregating information over multiple panel waves can be used to build prediction models with competitive and robust performance over all test waves. * The authors want to thank Rayid Ghani and the team of the Center for Data Science and Public Policy at the University of Chicago for their support for this project.

methods, data, analyses, Dec 1, 2014
In PIAAC (Programme for the International Assessment of Adult Competencies) inclusion probabiliti... more In PIAAC (Programme for the International Assessment of Adult Competencies) inclusion probabilities have to be known for every respondent at each sampling stage in all participating countries. However, in some cases it is not possible to calculate inclusion probabilities for a sample survey analytically -although the underlying design is probabilistic. In such cases, simulation studies can help to estimate inclusion probabilities and thus ensure that the necessary basis for the calculation of design weights is available. In this section, we present a Monte Carlo simulation using the German sample data. During the selection process for PIAAC Germany an error had occurred. Because of that, it was not possible to determine the inclusion probabilities analytically. Therefore a simulation study with 10,000 runs of the erroneous selection process was set up. As a result it was possible to compute the inclusion probabilities for the sample of PIAAC Germany.
Statistics in Transition New Series, Mar 1, 2016
In 2011, Germany conducted the first census after the reunification. In contrast to a classical c... more In 2011, Germany conducted the first census after the reunification. In contrast to a classical census, a register-assisted census was implemented using population register data and an additional sample. This paper provides an overview of how the sampling design recommendations were set up in order to fulfil legal requirements and to guarantee an optimal but still flexible source of information. The aim was to develop a design that fosters an accurate estimation of the main objective of the census, the total population counts. Further, the design should also adequately support the application of small area estimation methods. Some empirical results are given to provide an assessment of selected methods.

AStA Wirtschafts- und Sozialstatistisches Archiv, Dec 1, 2017
Open and reproducible research receives more and more attention in the research community. Wherea... more Open and reproducible research receives more and more attention in the research community. Whereas empirical research may benefit from research data centres or scientific use files that foster using data in a safe environment or with remote access, methodological research suffers from the availability of adequate data sources. In economic and social sciences, an additional drawback results from the presence of complex survey designs in the data generating process, that has to be considered when developing and applying estimators. In the present paper, we present a synthetic but realistic dataset based on social science data, that fosters evaluating and developing estimators in social sciences. The focus is on supporting comparable and reproducible research in a realistic framework providing individual and household data. The outcome is provided as an open research data resource.

Tabellenauswertungen im Zensus unter Berücksichtigung fehlender Werte
AStA Wirtschafts- und Sozialstatistisches Archiv, Dec 1, 2015
ZusammenfassungIm European Statistics Code of Practice wird neben vielen anderen Punkten eine adä... more ZusammenfassungIm European Statistics Code of Practice wird neben vielen anderen Punkten eine adäquate Konkretisierung von Stichproben- und Nicht-Stichprobenfehlern empfohlen. Dies umfasst insbesondere auch eine Messung der Genauigkeit unter Berücksichtigung fehlender Werte. In der Praxis werden fehlende Werte oft mit Hilfe von Imputationsverfahren ergänzt. Dabei müssen zwei Fragestellungen beachtet werden. Zum einen entsteht die Frage, ob die ergänzten Werte plausibel sein können. Dies wird mit Editing-Verfahren überprüft. Zum anderen muss bei einer Qualitätsmessung, etwa durch Varianzschätzverfahren, der Ergänzungsprozess korrekt berücksichtigt werden. Unabhängig von der Methodik werden zumeist computerintensive Verfahren verwendet. Dabei entsteht die Frage, welche der Methoden auf großen Surveys sinnvoll angewendet werden können.Mit dem Register-gestützten Zensus 2011 wurde in Deutschland eine sehr große Erhebung durchgeführt. Im Zensusgesetz wurden konkrete Qualitätsvorgaben für die Ermittlung der Einwohnerzahl formuliert. In diesem Zusammenhang spielt die Imputation aber keine Rolle. Dagegen ist sie bei Variablen von Interesse, die nicht im Melderegister enthalten sind. Ausbildung oder Erwerbstätigkeit sind Beispiele für solche Variablen. Bisher ist die Beantwortung des Frageprogramms im Zensus verpflichtend. Sollte der Zensus in Zukunft auch einen freiwilligen Teil umfassen, so ist eine Diskussion über die Qualitätsmessung unter Berücksichtigung von fehlenden Werten unausweichlich. Der vorliegende Artikel referiert über eine Machbarkeitsstudie zur Varianzschätzung bzw. der Schätzung des mittleren quadratischen Fehlers (MSE) unter Imputation bei großen Erhebungen, mit Fokus auf einen Register-gestützten Zensus. Im Vordergrund stehen Verfahren der einfachen und multiplen Imputation im Kontext der Ergänzung plausibler Werte.AbstractThe European Statistics Code of Practice defines standards for the production of statistics, covering data quality aspects. As important items within the quality framework, sampling and non-sampling errors are covered including measuring the accuracy of statistics in the presence of missing values. In practice, missing values are often treated by using imputation methods, where two aspects should be considered. First, the plausibility of imputed values plays an important role in official statistics applications. This can be examined with editing methods. Second, measuring the accuracy e. g. via variance estimation must incorporate the randomness of the imputation process. Since all relevant methods to be considered are computer-intensive, a comparative study of the methodology must include their applicability in the presence of large surveys.The German register-assisted census 2011 has been conducted using a large sample. Accuracy goals for the census where given in the census law for the determination of the population size where imputation does not play any role. This aspect also holds for other variables in case of mandatory participation. However, in case of future censuses when some variables are based on voluntary participation, imputation has to be considered in the context of accuracy measurement as well. This paper presents the results of a feasibility study of variance or MSE estimation under imputation in large-scale surveys focusing on the register-assisted census. The main aim is to compare selected single and multiple methods considering the plausibility of imputed values.
The AMELI Simulation Study. Research Project Report WP6, D6.1, FP7-SSH-2007-217322 AMELI
Best practice recommendations on variance estimation and small area estimation in business surveys

The Reliability of Replications: A Study in Computational Reproductions
This paper reports findings from a crowdsourced replication. Eighty-five independent teams attemp... more This paper reports findings from a crowdsourced replication. Eighty-five independent teams attempted a computational replication of results reported in an original study of policy preferences and immigration by fitting the same statistical models to the same data. The replication involved an experimental condition. Random assignment put participating teams into either the transparent group that received the original study and code, or the opaque group receiving only a methods section, rough results description and no code. The transparent group mostly verified the numerical results of the original study with the same sign and p-value threshold (95.7%), while the opaque group had less success (89.3%). Exact numerical reproductions to the second decimal place were far less common (76.9% and 48.1%), and the number of teams who verified at least 95% of all effects in all models they ran was 79.5% and 65.2% respectively. Therefore, the reliability we quantify depends on how reliability i...
Report on the Simulation Results. Research Project Report WP7, D7.1, FP7-SSH-2007-217322 AMELI
Synthetic data generation of SILC data

R Journal, 2019
Through collaborative mapping, a massive amount of data is accessible. Many individuals contribut... more Through collaborative mapping, a massive amount of data is accessible. Many individuals contribute information each day. The growing amount of geodata is gathered by volunteers or obtained via crowd-sourcing. One outstanding example of this is the OpenStreetMap (OSM) Project which provides access to big data in geography. Another online mapping service that enables the integration of geodata into the analysis is Google Maps. The expanding content and the availability of geographic information radically changes the perspective on geodata (Chilton 2009). Recently many application programming interfaces (APIs) have been built on OSM and Google Maps. That leads to a point where it is possible to access sections of geographical information without the usage of a complex database solution, especially if one only requires a small data section for a visualization. First tools for spatial analysis have been included in the R language very early and this development will continue to accelerate, underpinning a continual change. Notably, in recent years many tools have been developed to enable the usage of R as a geographic information system (GIS). With a GIS it is possible to process spatial data. QuantumGIS (QGIS) is a free software solution for these tasks, and a user interface is available for this purpose. R is, therefore, an alternative to geographic information systems like QGIS (QGIS Development Team 2009). Besides, add-ins for QGIS and R-packages (RQGIS) are available, that enables the combination of R and QGIS (Muenchow and Schratz 2017). It is the target of this article to present some of the most important R-functionalities to download and process geodata from OSM and the Google Maps API. The focus of this paper is on functions that enable the natural usage of these APIs.
The processing of information related to a geographic location has long been difficult due to the... more The processing of information related to a geographic location has long been difficult due to the lack of (pertinent) data sources and computational power. However, the recent developments of web-based technologies like OpenStreetMap (OSM) and Google Maps change this fundamentally. With R it is possible to process a large amount of data and produce appropriate visualisations. The challenge is to find the necessary spatial information, like appropriate polygons and data corresponding to these polygons. In this paper ways are presented to access this information via internet and to combine and visualise these information.

In PIAAC (Programme for the International Assessment of Adult Competencies) inclusion probabiliti... more In PIAAC (Programme for the International Assessment of Adult Competencies) inclusion probabilities have to be known for every respondent at each sampling stage in all partici-pating countries. However, in some cases it is not possible to calculate inclusion probabili-ties for a sample survey analytically – although the underlying design is probabilistic. In such cases, simulation studies can help to estimate inclusion probabilities and thus ensure that the necessary basis for the calculation of design weights is available. In this section, we present a Monte Carlo simulation using the German sample data. During the selection pro-cess for PIAAC Germany an error had occurred. Because of that, it was not possible to de-termine the inclusion probabilities analytically. Therefore a simulation study with 10,000 runs of the erroneous selection process was set up. As a result it was possible to compute the inclusion probabilities for the sample of PIAAC Germany.
This contribution deals with the fundamentals of weighting and with the different types of weight... more This contribution deals with the fundamentals of weighting and with the different types of weights. Terms such as design weighting and adjustment weighting are explained, and the Horvitz-Thompson estimator and the GREG estimator are presented.

Survey research methods, 2017
Geographic information science (GIScience) offers survey researchers a plethora of rapidly evolvi... more Geographic information science (GIScience) offers survey researchers a plethora of rapidly evolving research strategies and tools for data acquisition and analysis. However, the potential for incorporating geographic information systems (GIS) tools into traditional survey research has not yet been fully appreciated by survey researchers. In this article, we provide a comprehensive overview of recent advances and challenges in leveraging this potential. First, we present state-of-the-art applications of GIS tools in traditional survey research, drawing mainly on examples from psychological survey research (e.g., socioecological psychology). We also discuss innovative GIS tools (e.g., wearables) and GIScience methods (e.g., citizen sensing) that expand the scope of traditional surveys. Second, we highlight a number of challenges and problems (e.g., choice of spatial scale, statistical issues, privacy concerns) and - where possible - suggest remedies. With increasing awareness of the p...
Austrian Journal of Statistics, Feb 29, 2016
The processing of information related to a geographic location has long been difficult due to the... more The processing of information related to a geographic location has long been difficult due to the lack of (pertinent) data sources and computational power. However, the recent developments of web-based technologies like OpenStreetMap (OSM) and Google Maps change this fundamentally. With R it is possible to process a large amount of data and produce appropriate visualisations. The challenge is to find the necessary spatial information, like appropriate polygons and data corresponding to these polygons. In this paper ways are presented to access this information via internet and to combine and visualise these information.
The processing of information related to a geographic location has long been dicult due to the la... more The processing of information related to a geographic location has long been dicult due to the lack of (pertinent) data sources and computational power. However, the recent developments of web-based technologies like OpenStreetMap (OSM) and Google Maps change this fundamentally. With R it is possible to process a large amount of data and produce appropriate visualisations. The challenge is to nd the necessary spatial information, like appropriate polygons and data corresponding to these polygons. In this paper ways are presented to access this information via internet and to combine and visualise these information.
Synthetic Data Generation of SILC Data. Research Project Report WP6, D6.2, FP7-SSH-2007-217322 AMELI
Uploads
Papers by Jan-Philipp Kolb