Papers by Nikolaos Laoutaris

arXiv (Cornell University), Jan 24, 2017
Online advertising is progressively moving towards a programmatic model in which ads are matched ... more Online advertising is progressively moving towards a programmatic model in which ads are matched to actual interests of individuals collected as they browse the web. Le ing the huge debate around privacy aside, a very important question in this area, for which li le is known, is: How much do advertisers pay to reach an individual? In this study, we develop a rst of its kind methodology for computing exactly that -the price paid for a web user by the ad ecosystem -and we do that in real time. Our approach is based on tapping on the Real Time Bidding (RTB) protocol to collect cleartext and encrypted prices for winning bids paid by advertisers in order to place targeted ads. Our main technical contribution is a method for tallying winning bids even when they are encrypted. We achieve this by training a model using as ground truth prices obtained by running our own "probe" ad-campaigns. We design our methodology through a browser extension and a back-end server that provides it with fresh models for encrypted bids. We validate our methodology using a one year long trace of 1600 mobile users and demonstrate that it can estimate a user's advertising worth with more than 82% accuracy.

arXiv (Cornell University), Jan 24, 2017
Online advertising is progressively moving towards a programmatic model in which ads are matched ... more Online advertising is progressively moving towards a programmatic model in which ads are matched to actual interests of individuals collected as they browse the web. Le ing the huge debate around privacy aside, a very important question in this area, for which li le is known, is: How much do advertisers pay to reach an individual? In this study, we develop a rst of its kind methodology for computing exactly that -the price paid for a web user by the ad ecosystem -and we do that in real time. Our approach is based on tapping on the Real Time Bidding (RTB) protocol to collect cleartext and encrypted prices for winning bids paid by advertisers in order to place targeted ads. Our main technical contribution is a method for tallying winning bids even when they are encrypted. We achieve this by training a model using as ground truth prices obtained by running our own "probe" ad-campaigns. We design our methodology through a browser extension and a back-end server that provides it with fresh models for encrypted bids. We validate our methodology using a one year long trace of 1600 mobile users and demonstrate that it can estimate a user's advertising worth with more than 82% accuracy.

arXiv (Cornell University), Jul 22, 2009
A substantial amount of work has recently gone into localizing BitTorrent traffic within an ISP i... more A substantial amount of work has recently gone into localizing BitTorrent traffic within an ISP in order to avoid excessive and often times unnecessary transit costs. Several architectures and systems have been proposed and the initial results from specific ISPs and a few torrents have been encouraging. In this work we attempt to deepen and scale our understanding of locality and its potential. Looking at specific ISPs, we consider tens of thousands of concurrent torrents, and thus capture ISP-wide implications that cannot be appreciated by looking at only a handful of torrents. Secondly, we go beyond individual case studies and present results for the top 100 ISPs in terms of number of users represented in our dataset of up to 40K torrents involving more than 3.9M concurrent peers and more than 20M in the course of a day spread in 11K ASes. We develop scalable methodologies that permit us to process this huge dataset and answer questions such as: "what is the minimum and the maximum transit traffic reduction across hundreds of ISPs? ", "what are the win-win boundaries for ISPs and their users? ", "what is the maximum amount of transit traffic that can be localized without requiring finegrained control of inter-AS overlay connections? ", "what is the impact to transit traffic from upgrades of residential broadband speeds? ".

arXiv (Cornell University), Nov 19, 2014
Online Behavioural targeted Advertising (OBA) has risen in prominence as a method to increase the... more Online Behavioural targeted Advertising (OBA) has risen in prominence as a method to increase the effectiveness of online advertising. OBA operates by associating tags or labels to users based on their online activity and then using these labels to target them. This rise has been accompanied by privacy concerns from researchers, regulators and the press. In this paper, we present a novel methodology for measuring and understanding OBA in the online advertising market. We rely on training artificial online personas representing behavioural traits like 'cooking', 'movies', 'motor sports', etc. and build a measurement system that is automated, scalable and supports testing of multiple configurations. We observe that OBA is a frequent practice and notice that categories valued more by advertisers are more intensely targeted. In addition, we provide evidences showing that the advertising market targets sensitive topics (e.g, religion or health) despite the existence of regulation that bans such practices. We also compare the volume of OBA advertising for our personas in two different geographical locations (US and Spain) and see little geographic bias in terms of intensity of OBA targeting. Finally, we check for targeting with do-not-track (DNT) enabled and discovered that DNT is not yet enforced in the web.

arXiv (Cornell University), Sep 17, 2015
We have collected and analysed prices for more than 1.4 million flight tickets involving 63 desti... more We have collected and analysed prices for more than 1.4 million flight tickets involving 63 destinations and 125 airlines and have found that common sense violation i.e., discrepancies between what consumers would expect and what truly holds for those prices, are far more frequent than one would think. For example, oftentimes the price of a single leg flight is higher than two-leg flights that include it under similar terms of travel (class, luggage allowance, etc.). This happened for up to 24.5% of available fares on a specific route in our dataset invalidating the common expectation that "further is more expensive". Likewise, we found several two-leg fares where buying each leg independently leads to lower overall cost than buying them together as a single ticket. This happened for up to 37% of available fares on a specific route invalidating the common expectation that "bundling saves money". Last, several single stop tickets in which the two legs were separated by 1-5 days (called multicity fares), were oftentimes found to be costing more than corresponding back-to-back fares with a small transit time. This was found to be occurring in up to 7.5% fares on a specific route invalidating that "a short transit is better than a longer one".

Proceedings 2023 Network and Distributed System Security Symposium
We present a Federated Learning (FL) based solution for building a distributed classifier capable... more We present a Federated Learning (FL) based solution for building a distributed classifier capable of detecting URLs containing GDPR-sensitive content related to categories such as health, sexual preference, political beliefs, etc. Although such a classifier addresses the limitations of previous offline/centralised classifiers, it is still vulnerable to poisoning attacks from malicious users that may attempt to reduce the accuracy for benign users by disseminating faulty model updates. To guard against this, we develop a robust aggregation scheme based on subjective logic and residual-based attack detection. Employing a combination of theoretical analysis, trace-driven simulation, as well as experimental validation with a prototype and real users, we show that our classifier can detect sensitive content with high accuracy, learn new labels fast, and remain robust in view of poisoning attacks from malicious users, as well as imperfect input from non-malicious ones.

Proceedings of the 1st International Workshop on Data Economy
A large number of Data Marketplaces (DMs) have appeared in the last few years to help owners mone... more A large number of Data Marketplaces (DMs) have appeared in the last few years to help owners monetise their data, and data buyers optimize their marketing campaigns, train their ML models, and facilitate other data-driven decision processes. In this paper, we present a first of its kind measurement study of the growing DM ecosystem, shedding light on several totally unknown facts about it. We show that data products listed in commercial DMs may cost from few to hundreds of thousands of US dollars. We analyse the prices of different categories of data and the challenges of comparing across DMs. We also analise the pricing of specific sellers and products to identify features that apparently correlate with prices, and we point to the need and the challenges of building a quotation tool for data products based on market data. CCS CONCEPTS • Information systems → Data extraction and integration; • General and reference → Measurement; • Applied computing → Electronic data interchange.

Proceedings of the 30th International Conference on Advances in Geographic Information Systems
Spatio-temporal information is used for driving a plethora of intelligent transportation, smart-c... more Spatio-temporal information is used for driving a plethora of intelligent transportation, smart-city and crowd-sensing applications. Data is now a valuable production factor and data marketplaces have appeared to help individuals and enterprises bring it to market and the ever-growing demand. Such marketplaces are able to combine data from different sources to meet the requirements of different applications. In this paper we study the problem of estimating the relative value of spatio-temporal datasets combined in marketplaces for predicting transportation demand and travel time in metropolitan areas. Using large datasets of taxi rides from Chicago, Porto and New York we show that simplistic but popular approaches for estimating the relative value of data, such as splitting it equally among the data sources, more complex ones based on volume or the "leave-one-out" heuristic, are inaccurate. Instead, more complex notions of value from economics and game-theory, such as the Shapley value, need to be employed if one wishes to capture the complex effects of mixing different datasets on the accuracy of forecasting algorithms. This does not seem to be a coincidental observation related to a particular use case but rather a general trend across different use cases with different objective functions. CCS CONCEPTS • Information systems → Geographic information systems; • Human-centered computing → Ubiquitous and mobile computing.

Proceedings of the International AAAI Conference on Web and Social Media
As recent events have demonstrated, disinformation spread through social networks can have dire p... more As recent events have demonstrated, disinformation spread through social networks can have dire political, economic and social consequences. Detecting disinformation must inevitably rely on the structure of the network, on users particularities and on event occurrence patterns. We present a graph data structure, which we denote as a meta-graph, that combines underlying users' relational event information, as well as semantic and topical modeling. We detail the construction of an example meta-graph using Twitter data covering the 2016 US election campaign and then compare the detection of disinformation at cascade level, using well-known graph neural network algorithms, to the same algorithms applied on the meta-graph nodes. The comparison shows a consistent 3-4% improvement in accuracy when using the meta-graph, over all considered algorithms, compared to basic cascade classification, and a further 1% increase when topic modeling and sentiment analysis are considered. We carry o...

arXiv (Cornell University), Aug 27, 2019
The idea of paying people for their data is increasingly seen as a promising direction for resolv... more The idea of paying people for their data is increasingly seen as a promising direction for resolving privacy debates, improving the quality of online data, and even offering an alternative to labour-based compensation in a future dominated by automation and self-operating machines. In this paper we demonstrate how a Human-Centric Data Economy would compensate the users of an online streaming service. We borrow the notion of the Shapley value from cooperative game theory to define what a fair compensation for each user should be for movie scores offered to the recommender system of the service. Since determining the Shapley value exactly is computationally inefficient in the general case, we derive faster alternatives using clustering, dimensionality reduction, and partial information. We apply our algorithms to a movie recommendation data set and demonstrate that different users may have a vastly different value for the service. We also analyse the reasons that some movie ratings may be more valuable than others and discuss the consequences for compensating users fairly.

arXiv (Cornell University), Oct 25, 2021
A large number of Data Marketplaces (DMs) have appeared in the last few years to help owners mone... more A large number of Data Marketplaces (DMs) have appeared in the last few years to help owners monetise their data, and data buyers fuel their marketing process, train their ML models, and perform other data-driven decision processes. In this paper, we present a first of its kind measurement study of the growing DM ecosystem and shed light on several totally unknown facts about it. For example, we show that the median price of live data products sold under a subscription model is around US$1,400 per month. For one-off purchases of static data, the median price is around US$2,200. We analyse the prices of different categories of data and show that products about telecommunications, manufacturing, automotive, and gaming command the highest prices. We also develop classifiers for comparing prices across different DMs as well as a regression analysis for revealing features that correlate with data product prices.

14th ACM Web Science Conference 2022
In recent years, governments worldwide have moved their services online to better serve their cit... more In recent years, governments worldwide have moved their services online to better serve their citizens. Benefits aside, this choice increases the danger of tracking via such sites. This is of great concern as governmental websites increasingly become the only interaction point with the government. In this paper, we investigate popular governmental websites across different countries and assess to what extent the visits to these sites are tracked by third-parties. Our results show that, unfortunately, tracking is a serious concern, as in some countries up to 90% of these websites create cookies of third-party trackers without any consent from users. Non-session cookies, that are created by trackers and can last for days or months, are widely present even in countries with strict user privacy laws. We also show that the above is a problem for official websites of international organizations and popular websites that inform the public about the COVID-19 pandemic. CCS CONCEPTS • Information systems → World Wide Web; • Security and privacy → Human and societal aspects of security and privacy.

Proceedings of the ACM on Measurement and Analysis of Computing Systems
The Real Time Bidding (RTB) protocol is by now more than a decade old. During this time, a handfu... more The Real Time Bidding (RTB) protocol is by now more than a decade old. During this time, a handful of measurement papers have looked at bidding strategies, personal information flow, and cost of display advertising through RTB. In this paper, we present YourAdvalue, a privacy-preserving tool for displaying to end-users in a simple and intuitive manner their advertising value as seen through RTB. Using YourAdvalue, we measure desktop RTB prices in the wild, and compare them with desktop and mobile RTB prices reported by past work. We present how it estimates ad prices that are encrypted, and how it preserves user privacy while reporting results back to a data-server for analysis. We deployed our system, disseminated its browser extension, and collected data from 200 users, including 12000 ad impressions over 11 months. By analyzing this dataset, we show that desktop RTB prices have grown 4.6x over desktop RTB prices measured in 2013, and 3.8x over mobile RTB prices measured in 2015. ...

ACM SIGCOMM Computer Communication Review, 2021
What if instead of having to implement controversial user tracking techniques, Internet advertisi... more What if instead of having to implement controversial user tracking techniques, Internet advertising & marketing companies asked explicitly to be granted access to user data by name and category, such as Alice→Mobility→05-11-2020? The technology for implementing this already exists, and is none other than the Information Centric Networks (ICN), developed for over a decade in the framework of Next Generation Internet (NGI) initiatives. Beyond named access to personal data, ICN's in-network storage capability can be used as a substrate for retrieving aggregated, anonymized data, or even for executing complex analytics within the network, with no personal data leaking outside. In this opinion article we discuss how ICNs combined with trusted execution environments and digital watermarking, can be combined to build a personal data overlay inter-network in which users will be able to control who gets access to their personal data, know where each copy of said data is, negotiate paymen...

ArXiv, 2019
We turn our attention to the elephant in the room of data protection, which is none other than th... more We turn our attention to the elephant in the room of data protection, which is none other than the simple and obvious question: "Who's tracking sensitive domains?". Despite a fast-growing amount of work on more complex facets of the interplay between privacy and the business models of the Web, the obvious question of who collects data on domains where most people would prefer not be seen, has received rather limited attention. First, we develop a methodology for automatically annotating websites that belong to a sensitive category, e.g. as defined by the General Data Protection Regulation (GDPR). Then, we extract the third party tracking services included directly, or via recursive inclusions, by the above mentioned sites. Having analyzed around 30k sensitive domains, we show that such domains are tracked, albeit less intensely than the mainstream ones. Looking in detail at the tracking services operating on them, we find well known names, as well as some less known on...

ArXiv, 2015
We have collected and analysed prices for more than 1.4 million flight tickets involving 63 desti... more We have collected and analysed prices for more than 1.4 million flight tickets involving 63 destinations and 125 airlines and have found that common sense violation i.e., discrepancies between what consumers would expect and what truly holds for those prices, are far more frequent than one would think. For example, oftentimes the price of a single leg flight is higher than two-leg flights that include it under similar terms of travel (class, luggage allowance, etc.). This happened for up to 24.5% of available fares on a specific route in our dataset invalidating the common expectation that "further is more expensive". Likewise, we found several two-leg fares where buying each leg independently leads to lower overall cost than buying them together as a single ticket. This happened for up to 37% of available fares on a specific route invalidating the common expectation that "bundling saves money". Last, several single stop tickets in which the two legs were separat...

Proceedings of the 2016 Internet Measurement Conference, 2016
Tracking users within and across websites is the base for profiling their interests, demographic ... more Tracking users within and across websites is the base for profiling their interests, demographic types, and other information that can be monetised through targeted advertising and big data analytics. The advent of HTTPS was supposed to make profiling harder for anyone beyond the communicating end-points. In this paper we examine to what extent the above is true. We first show that by knowing the domain that a user visits, either through the Server Name Indication of the TLS protocol or through DNS, an eavesdropper can already derive basic profiling information, especially for domains whose content is homogeneous. For domains carrying a variety of categories that depend on the particular page that a user visits, e.g., news portals, e-commerce sites, etc., the basic profiling technique fails. Still, accurate profiling remains possible through traffic fingerprinting that uses network traffic signatures to infer the exact page that a user is browsing, even under HTTPS. We demonstrate that transport-layer fingerprinting remains robust and scalable despite hurdles such as caching, dynamic content for different device types etc. Overall our results indicate that although HTTPS makes profiling more difficult, it does not eradicate it by any means.
Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies, 2019
In the above publication, there was a typo in the contract number of the CONCORDIA EU funding pro... more In the above publication, there was a typo in the contract number of the CONCORDIA EU funding project that partially supported this work. The following acknowledgement is the correct one: The research leading to these results has received funding from the European Union's Horizon 2020 Research and Innovation Programme under grant agreements No 653449 (project TYPES), No 830927 (project CONCORDIA), No 786741 (project SMOOTH) and Marie Sklodowska-Curie grant agreement No 690972 (project PROTASIS). The paper reflects only the authors' views and the Agency and the Commission are not responsible for any use that may be made of the information it contains.
Uploads
Papers by Nikolaos Laoutaris