Checking Websites' GDPR Consent Compliance For Marketing Emails
Checking Websites' GDPR Consent Compliance For Marketing Emails
Karel Kubíček*, Jakob Merane*, Carlos Cotrini, Alexander Stremitzer, Stefan Bechtold, and David
Basin
Alexa sample Registration pages 1000 annotated 1234 accounts, 119 consent violations
>5000 emails
4k EN, 4k DE for annotation websites 666 websites 162 email violations
Fig. 1. Overview of the process involved in our study and the (intermediate) results.
“marketing” or “servicing.” Based on the legal proper- tecting potential violations of the ePrivacy Directive’s
ties and presence of marketing emails, we defined a de- opt-in requirement and the GDPR’s notion of consent
cision procedure for detecting potential violations. We in website registration forms.
observed at least one potential violation in 148 (22.2%) Violation statistics. We observe at least one potential
of these websites. violation in 22% of websites. Namely, 17.3% of websites
Previous studies examined the privacy threats send marketing emails without obtaining proper consent
posed by specific kinds of marketing campaigns. Hamin and 17.7% of services send emails that are potential
and Mathur et al. [31, 54] studied US political cam- violations because content required by law is missing,
paign emails. They observed malicious practices with passwords are sent in plaintext, or the service shares
regard to both email content and the handling of per- the email address with third parties.
sonal information, which was often shared with third Annotated datasets. We offer the privacy research
parties. Engelhart et al. [18] analyzed user tracking by community a dataset with annotations of the legal prop-
marketing emails and observed that over 70% of emails erties of registration forms for 1000 websites. We re-
contain trackers. Other studies [52, 60] reported how lease these annotations, the registration page source
websites use dark patterns to trick users into consenting code, and post-processed features of the registration
to cookie policies. In contrast, this is the first study to form upon request. We also release a dataset of 5000
systematically analyze the extent to which companies emails labeled with their purpose. Both datasets are
sending marketing emails comply with the GDPR’s suitable for other studies, such as email or registration
notion of consent. form tracking analysis [8, 18], or marketing email con-
tent analysis.1
Feature analysis. We conduct a statistical analysis to
Terminology identify which features are most influential when decid-
Throughout this study, we report our observations as ing potential violations. We illustrate how these features
potential violations for three reasons. First, as a matter can simplify manual compliance analysis.
of legal formality, only a legal proceeding can determine
a violation. Second, while we were conservative in defin-
ing the types of potential violations, and our analysis is Organization
informed by the relevant statutes, judicial precedent, We review the legal requirements for email marketing
and articles by legal experts, there remains some legal in Section 2. We then describe the registration process
uncertainty as to how courts will decide specific cases. and the content of the dataset of annotated websites in
Third, we faced factual uncertainties during our assess- Section 3. In Section 4, we present legal requirements
ment. This is addressed in the appropriate sections. We on email content and we report on potential viola-
remain confident that possible labeling disagreements tions of these requirements. Afterward, we undertake
are not of a magnitude or type that should affect our a legal analysis of both datasets in Section 5 and we
reported results. present the first steps towards automating such analysis
in Section 6. Finally, we consider related work, draw
conclusions, and propose future steps.
Contributions
Legal taxonomy for marketing. We summarize the
legal requirements for sending marketing emails based 1 The datasets, intermediate results, and other materials are
on the German implementation of the ePrivacy Direc- available on request at https://forms.gle/dTGpfs5vKqdLz8sQ7.
tive. We further propose a decision procedure for de- A page with an overview of this study is at https://
karelkubicek.github.io/post/reg-pets.
Checking Website’s GDPR Consent Compliance for Marketing Emails 284
of the main service. Accordingly, the consent declara- Marketing checkbox (ma_checkbox): There is a
tion for marketing emails should be unbundled from the checkbox that the user must tick to give consent for
main registration [33, par. 24]. marketing emails.
Second, the declaration of consent must be specific. Privacy policy checkbox (pp_checkbox): There is
As early as 2008, well before the GDPR entered into a checkbox for consent for the website’s privacy policy.
force, the German Federal Court of Justice (BGH) held Terms and conditions checkbox (tc_checkbox):
in its Payback judgment relating to § 7(2) No. 3 UWG There is a checkbox for consent for the website’s terms
that a separate declaration of consent, relating only to and conditions.
marketing emails, is required [39]. Although the BGH Pre-checked checkbox (a_pre_checked): The cor-
has recently ruled that a consent declaration can in- responding checkbox is already ticked by default.
clude several advertising communication channels (such Forced checkbox (a_forced): It is required to tick
as telephone, e-mail, and text messages), the require- the corresponding checkbox to successfully register.
ment of a specific and separate declaration of consent is This is often indicated with asterisks on the registra-
still established case law [37]. The mere acceptance of tion forms.
general terms and conditions or privacy policies is also #tying_b: There is only one checkbox asking for
deemed insufficient [22, par. 81]. (tying) two or three consents together. Therefore,
Lastly, consent must be unambiguous. In the con- b ∈ {ma_pp, ma_tc, pp_tc, ma_pp_tc}.
text of marketing emails, consent must be given through #forced_c: The website does not ask for consent
an affirmative act or declaration. According to Recital to the privacy policy and/or terms and conditions,
32 of the GDPR, actively ticking an optional checkbox but assumes it through the registration process. Hence
can constitute a clear affirmative act. Conversely, infer- c ∈ {pp, tc, pp_tc}.
ring consent from inactivity, presenting users with pre- #settings: Refusing consent requires more clicks,
checked boxes, or other opt-out solutions are considered therefore the consent is assumed by default.
ambiguous [35, 36]. It must be obvious that the user has #age: The user’s age or the date of birth are required
consented. Nudging users to provide consent with visual for registration.
features such as color tricks or hidden consent declara- #colortrick: The colors on the website nudge the user
tions is also not enough to fulfil this requirement. to consent. For example, giving consent is highlighted
with green, while refusing it is red.
#hidden: The declaration of consent can be easily
2.2 Legal taxonomy missed by users.
To operationalize the legal requirements of free, spe- All these legal properties are Boolean, i.e., either
cific, and unambiguous consent, we have developed a a website has the property or not. We call the prop-
legal taxonomy. We have tested the taxonomy in an ex- erties with a hashtag sign hashtags, and the remaining
ploratory pilot (see Appendix A.1.1 for more informa- checkboxes. Note that the last two properties are sub-
tion about the pilot study), and refined the legal prop- jective. We have therefore provided the annotators with
erties accordingly. In Section 5, we present a decision many examples, so that their annotations will be more
method that determines whether a website potentially in agreement. Annotators can also comment on anno-
violates the legal requirements based on an evaluation tations, which clarify the annotation of the subjective
of these properties. properties.
Let A = {ma, pp, tc} be a set of pre-/suf-fixes for
marketing, privacy policy, and a terms and conditions
checkbox, respectively. Also let a ∈ A denote a single 3 Website dataset
checkbox type. We define the following legal properties.
We manually collected a training dataset of 1000 anno-
Marketing consent (ma_consent): The website
tated websites. For each website, we retrieved its regis-
asks for consent from the user for marketing emails on
tration form and manually annotated it based on how
the registration page.
it asks users for consent to marketing emails and for
Marketing purpose (ma_purpose): Registering
agreement to the website’s privacy policy and terms and
with the website is only, or mainly, for receiving mar-
keting emails.
Checking Website’s GDPR Consent Compliance for Marketing Emails 286
conditions. To the best of our knowledge, this is the first Table 1. Website selection process.
dataset on registration practices across the Internet.
In this section, we describe in detail the process we Processing step Size EN Size DE
use for creating this dataset. We start with a short sum- Sampled 4000 3694
mary (see also Figure 1): Pre-filtering crawl 662 436
Randomly sampled for annotators 607 393
1. We collected a set of websites from Alexa’s ranking Registered successfully 343 325
(Section 3.1).
2. We designed a website annotation procedure (Sec-
tion 3.2). filtered websites that are not available in English or Ger-
3. We had a group of six legally-trained annotators man, malfunctioning websites, and websites without a
execute this procedure on the set of websites (Sec- registration. Table 1 shows the website selection process.
tion 3.3). We analyzed 100 filtered websites to inspect
4. We had each website annotated a second time by a whether the filtering causes a bias in our study. From
second annotator. This allowed us to measure the 50 randomly selected DE and 50 randomly selected
annotators’ consistency. Any conflicts were subse- EN websites that were filtered out, it was possible to
quently resolved by a third annotator. (Section 3.4). register for thirteen of them and subscribe to one of
them (seven EN and seven DE websites). These web-
sites were mostly rejected due to advanced bot detection
3.1 Website collection (seven websites),2 which can cause under-representation
of more complex websites. However, these websites were
Alexa (alexa.com) ranks websites according to page uniformly distributed in the Alexa rank. The authors
views and site users, and maintains a list of the most manually registered to all fourteen filtered websites and
popular websites based on this ranking for the last three found no statistical deviation from any presented obser-
months. We used Alexa’s top 1 million websites world- vations in this study. The bachelor thesis of Kast [43],
wide from May 25th, 2020. which was working with the crawler used for the pre-
Our goal is to inspect websites with varying pop- filtering, provides similar analysis of the filtered web-
ularity, so we split this set into four groups: the top sites. Its results are aligned with ours.
1000, the next 9000, the next 90 000, and the rest. From
each group, we randomly selected 1000 unique websites.
This sampling ensures that we analyze many of the most 3.2 Annotation procedure
popular websites, in contrast to an entirely random se-
lection. We call this the EN set of websites, as it is the Every website was manually annotated with the legal
starting point for detecting websites in English. properties described in Section 2.2. To determine these,
Considering that the underlying legal analysis uses a human annotator would register for the website, us-
German law and court cases as an example of the im- ing fictitious personal information like name, address,
plementation of the EU’s ePrivacy Directive, we focused or phone number. Only the email address provided is
on websites that allowed registration for people located real, as we use its inbox to detect unsolicited marketing
in Germany. Therefore, we also created a separate set of emails. In addition to the properties, annotators marked
3694 websites, the DE set, by taking from Alexa’s top 1 the registration as either successful or unsuccessful, de-
million, those websites with the domain “.de.” Since the pending on whether they successfully registered to the
notion of consent in German law is interpreted according website. When unsuccessful, they provided the reason
to the GDPR, our dataset is still likely representative for not completing the registration, for example, by stat-
of how websites across Europe ask users for consent. ing that there was no registration form on the website,
Based on the study by Chatzimpyrros et al. [8], or that the registration required a payment.
who observed that only one third of websites have login We developed a support tool to facilitate the manual
or registration forms, we did not expect to find more process of registration and annotation. Our tool features
websites with available registration in our selected lan-
guages. To reduce the number of annotations where reg-
istration was not possible, we pre-filtered both the EN 2 Confirmed by the Wayback Machine, which was also unable
and DE sets of websites using a crawler. This crawler to visit these websites.
Checking Website’s GDPR Consent Compliance for Marketing Emails 287
a graphical interface for recording the legal properties, annotations and, if necessary, he could modify the se-
according to the legal taxonomy defined in Section 2.2. lected annotation. The third annotator was not part of
Our tool uses Firefox, which we extended by Selenium the original set of annotators and also had a law degree.
to also help annotators by automatically filling in reg- We measured the agreement between annotators
istration form fields with the generated credentials. We with Cohen’s κ [9]. Like a correlation, it takes values
describe this tool in Appendix A.1.2. between -1 to 1, where κ = 0 indicates the absence
For each website, our support tool retrieved the of agreement, κ = 1 indicates perfect agreement, and
HTML source of the entire page and the registration κ = −1 indicates perfect disagreement. For legal prop-
form’s HTML subtree. If the webpage contains multiple erties that were satisfied by at least 10% of the websites,
forms, such as a login and a registration form next to the average κ in our sample was 0.74. All the individual
each other, we detect the form with which the annotator κ’s are given in Appendix A.1.3.
interacted and collect only its HTML subtree. All Inter- Our annotation procedure was more rigorous than
net traffic was routed via a German VPN endpoint, so those procedures used in most other related studies. For
our requests appeared to originate from Germany. example, in Zimmeck et al. [68], 350 policies were la-
beled by two law students. Only 35 of them were dou-
bly annotated and their Krippendorff’s α was 0.78 (text
3.3 Annotators labeling requires this metric for inter-annotator agree-
ment, but it has the same range and a similar inter-
Six scientific research assistants, all with a law degree, pretation as Cohen’s κ). In Bannihatti et al. [44], a
annotated the 1000 websites. The annotators were com- law student labeled 2692 opt-out statements from pri-
pensated fairly, according to the hourly wage for teach- vacy policies. Only a subsample (50) was labeled in-
ing assistants. To avoid biasing them, we did not inform dependently by two additional annotators. The inter-
them about our research objectives. annotator agreement was measured with Fleiss’ κ, and
The annotators were randomly assigned the web- its value was 0.7 (in this context, Fleiss’ and Cohen’s κ
sites from the EN and DE datasets. The amount of work are identical). To the best of our knowledge, the only
each annotator performed depended on their availabil- other study with an annotation procedure as rigorous
ity, and ranged from 95 to 453 annotated websites per as ours is Wilson et al. [67], who used two law students
annotator. to annotate 115 privacy policies with an average Krip-
The website annotation process was manual, but it pendorff’s α of 0.71, and had a third law student resolve
was precisely defined by instructions we provided. These any inconsistencies.
included legal and technical guidelines and examples of
22 annotated websites with justifications for the anno-
tations. We had previously tested the instructions in an 3.5 Resolved annotations
independent pilot study.
For 666 of the 1000 websites, the annotators agreed on
successful registration. The most common reasons for
3.4 Resolving disagreements unsuccessful registration was that there was no registra-
tion form (9%), the registration required a membership
Following empirical social science standards, every web- (7%), or the registration required payment (5%). We re-
site was validated by a second independent annota- port the reasons for other failed registration in Fig. 11 in
tor [19, p. 114]. The second annotator was randomly the Appendix. Figure 2 depicts the resolved annotations
chosen for every website and was different from the first for websites with successful registration. Each bar rep-
annotator, but from the same group of six annotators. resents the percentage of websites satisfying that prop-
We observed only a single website that changed the reg- erty. Note that more than half of the websites do not
istration form by the time the second annotator anno- mention marketing emails in the registration form. Only
tated the website, so website modifications were not a 6.6% (44) of websites provide for marketing email sub-
significant source of inter-annotator disagreement. scription (mark_purpose), which indicates the number
In case of inconsistencies between the annotations, of websites we can expect to send us marketing emails
we provided a third annotator with screenshots of the with properly granted consent.
registration forms seen by the first two annotators and
their annotations. He would then choose one of the two
Checking Website’s GDPR Consent Compliance for Marketing Emails 288
he d
d
nt
he ed
ty tc
pp
ch se
tc
ed
he ox
en
23
gs
pp bo
bo
ke
# cke
ric
ce
ag
g2
g1
g1
pp
m ed
a_ nse
in
b
ck
c
rc
ed
dd
g1
that fewer than 60% of websites follow this best
rp
tc for
or
ec
pp eck
ck
ck
rt
#
in
_p tyin
in
tt
rc
ed
fo
_p he
rc
hi
in
lo
pu
co
_f
ty
se
fo
_
a_
#
rc
fo
ty
co
ch
_c
_c
tc
#
a_
#
a_
#
e_
_c
_c
fo
re
re
#
m
pr
#
m
a_
practice.
tc
pp
m
Basic property
tration confirmations, invoices, and updates on changed link or a one-time password or code), or requires the
terms and conditions. As our only interaction with the user to ask for account activation by sending an email,
website is the registration and its confirmation, the which is used by less than 0.5% of the websites where
number of servicing emails is limited. we registered. Marketing emails can only be sent after
We annotated the dataset of over 5000 emails with consent is obtained using the previous actions. In con-
these email types, and we present their distribution trast to a single opt-in, this procedure prevents users
in Figure 3. The annotation was done by one of the from registering, accidentally or maliciously, with an
authors and one research assistant using the email’s email address for an account not under their control.
subject and body and information from the annotator’s The company offering registration must ensure that the
website registration. email addresses belong to the registered users and must
keep clear records of consent.
Survey 0.3%
For the purpose of this study, we conservatively clas-
Legal updates 0.4%
Notification 4.4%
sify services that only provide single opt-in as GDPR
Servicing compliant, even through they fail to follow best prac-
22.2%
tices. In contrast, we classify services that directly send
Confirmation marketing emails without any confirmation email as po-
10.5%
tential GDPR violations. However, there is increasing
case law requiring a proper double opt-in as a legal
obligation. In a recent Austrian case [4], a minor was
Newsletter
73.1% registered for a dating website by others. This registra-
Marketing tion caused the website to send him targeted market-
77.8%
ing emails without confirming the email address before-
Double opt-in
11.3% hand. The Austrian Data Protection Authority decided
that such a sign up procedure did not satisfy the re-
quirements under Art. 32 GDPR.
For double opt-in registrations, we developed a
Type Marketing Servicing script that classifies confirmation emails and automat-
ically completes the registration. The script classifies
Fig. 3. Email classification of the 5030 annotated emails, where
the email by keyword-search in the subject, body, and
we zoom into the marketing and servicing subclasses.
other email headers (e.g., Reply-To or X-Headers). A
manual inspection of 1000 emails shows that the classi-
fication works correctly in 96.8% of the cases (see Ap-
pendix A.2.2). The script extracts the link or confirma-
4.2 Double opt-in tion code also by pattern matching. The extracted link
or code is then used to complete the registration. We
In case of legal disputes, the company that sends mar- inspected all registrations, and those that the confirma-
keting emails must be able to demonstrate that the re- tion script could not finish were completed manually.
cipient knowingly consented [33, par. 6]. For this pur- Figure 4a presents the first email sent by each ser-
pose, the double opt-in has been established as a best vice. Only 59% of websites that sent us at least one email
practice procedure that is not legally obligatory, but first sent us a double opt-in email. Moreover, 5.5% of
highly recommended by legal scholars and the market- services sent us an unsolicited marketing email without
ing industry [45]. Alternative procedures, such as requir- any confirmation or double opt-in email.
ing users to send the service an email before registration,
are not widely used. Such procedures can only be par-
tially automated by mailto links, which would harm the 4.3 Sending passwords in plaintext
usability of the registration procedure.
Double opt-in emails require an additional user ac- After registration, some servicing emails contain either
tion after registration to activate the account. This ac- a user-provided password, a generated password, or a
tion serves as the user’s proof of ownership of the email password reset link. Sending users the user-provided
address and can be implemented in various ways. The password by email risks exposure of the user’s poten-
email contains either unique information (an activation tially reused password to anyone capable of reading the
Checking Website’s GDPR Consent Compliance for Marketing Emails 290
60
occurrence of this phenomenon underscores the impor-
tance of using password managers to prevent the leakage
% of services
40 of reused passwords.
20
12.5% 2.8% 0.7%
tin
io
t-
unsubscribe
at
op
ke
ar
e
bl
M
nf
ou
Co
ing us any emails as the baseline, whereas we check the We distinguish three scenarios based on sender’s do-
design requirements only in marketing emails. mains: (i) all the sender’s domains match exactly, (ii)
Finally, the email dataset can reveal additional in- only their second-level domains match, and (iii) their
formation about marketing emails, such as insight into domains differ completely.3 The first bar in Figure 5
the marketing trends, which are out of the scope of this reports how often we encountered each scenario in the
work. We present examples in Appendix A.2.4. dataset. In the second bar in Figure 5, we focus on the
senders whose domains are entirely different. We inspect
how the website discloses how third parties can use the
4.5 Third party email sharing user’s email address for sending marketing emails. In
particular, we manually check the registration form con-
The collection of email addresses and the sharing of tent, the website’s privacy policy, and the terms and
these addresses to third parties for marketing purposes conditions. If none of these inform the user about third
is subject to the same legal restrictions mentioned in parties, then we check if all sender domains are oper-
Section 2.1 [40]. Hence, third parties that send mar- ated by the same group of companies based on pub-
keting emails must be able to demonstrate that prior licly available sources such as corporate annual reports,
consent was obtained. This requires that a user must be Crunchbase, or the WHOIS database.
specifically informed about whom their email address is We conclude that services share email addresses
shared with and for which marketing purposes [3]. Third of their users mostly within the same corporation, al-
parties must therefore be specifically named. though very few of them disclose the practice of sharing
We check if all emails come from addresses whose email addresses with subsidiaries openly in their regis-
first- and second-level domains match with that of tration forms. Most disclose this only in their terms and
the visited domain. We consider the combined top- conditions, which is legally insufficient. Furthermore, it
level domains such as co.uk as the first-level domains. is well known that such documents are rarely read by
However, checking only the domain difference (as is the users [6, 29, 56]. During the fourteen months of our
the term “third party” used in CS) for our viola- study, we observed that one of our email addresses re-
tion decision procedure is insufficient, since the do- ceived emails from nine different domains. Some of these
main name does not reflect the legal entity. Many web- domains were not stated in the registration form or in
sites have a dedicated domain name for sending emails, the terms and conditions. From another service, we re-
for example, facebook.com sends all the emails from ceived fraudulent emails without being notified about
facebookmail.com. potential data breaches by the service.
5.1 Opt-in violations of ePD of potential violations from our study to dark patterns
in Appendix A.1.4.
Under the ePrivacy Directive, marketers must obtain At least 43.5% of websites that sent marketing
an individual’s consent (opt-in) before they can send emails did not meet one of these requirements on con-
marketing emails. Figure 6 reports the adherence to the sent (“Email after invalid consent” in Fig. 7). Surpris-
opt-in requirement. The leaf “No marketing” shows that ingly, we received marketing emails even from websites
80% of websites never sent us a marketing email in the that did not violate any of our selected consent require-
first place, and hence our violation decision procedure is ments. As we instructed annotators not to provide con-
not relevant for them. From the remaining websites in sent during the registration, such marketing emails most
our analysis, 52.3% of them sent marketing emails de- likely lack valid consent (“Email despite user did not
spite their registration forms not mentioning marketing opt-in” in Fig. 7).
emails (“Email despite no consent” in Fig. 6). This con-
stitutes a potential violation of Art. 13(1) of the ePD.
GDPR consent
evaluation required
(46)
Marketing
email
No (534) Yes (132) Marketing
checkbox
Marketing No (4) present Yes (42)
No marketing consent
No (69) Yes (63)
#tying1_
unspecific
No (40)
Email despite Marketing Yes (2)
no opt-in is purpose
No (46) Yes (17) Marketing
checkbox pre- unfree
No checked
GDPR consent (30)
Proper newsletter Yes (12)
evaluation required
#colortrick
or #hidden ambiguous
Fig. 6. Decision procedure about opt-in validity based on legal
Yes (6)
properties.
No nudging
(40)
For 12.9% of websites that sent marketing emails, a
newsletter subscription was the main purpose of the reg-
unspecific or unfree
istration (“Proper newsletter” in Fig. 6). The remaining or ambiguous or nudging
34.8% had to be further assessed for consent require-
No (26) Yes (20)
ments under the GDPR, as we explain next.
Email despite user Email after invalid
did not consent consent
5.2 GDPR consent violations Fig. 7. Decision procedure about consent validity based on legal
properties.
As mentioned in Section 2.1, consent must be freely
given, unambiguous, and specific. Based on this, we
Our decision procedure detects a selection of poten-
present selected potential violations of the GDPR’s con-
tial violations. Note that when our procedure identifies
sent requirements. We describe the combination of legal
no potential violations, a website may still fail to comply
properties that leads to a potential violation in Fig. 7.
with consent requirements. For example, our procedure
Initially, we defined that obtaining consent without
does not analyze the specific wording of consent decla-
providing a specific marketing email checkbox is unspe-
rations. Nevertheless, our procedure detects a substan-
cific. Also, in line with case law, we classify the bundling
tial number of potential violations. Indeed, it finds that
of the marketing email consent with other purposes such
17.3% of websites have at least one potential violation.
as terms and conditions as unfreely obtained. In addi-
tion, we classify the practices of pre-checked marketing
checkboxes and the nudging with visual features as am-
biguous consent (see Section 2). Nudging is a typical ex-
ample of a dark pattern; we summarize the similarities
Checking Website’s GDPR Consent Compliance for Marketing Emails 293
78.1% 3.0%
5.4%
Email after
500 invalid consent
4.9%
0.0%
3.1%
Number of websites
400 3.9%
4.1%
Email despite user 8.8%
did not consent 4.5%
300 2.2% Base is 666 websites
100 10.5% Fig. 9. Summary of all potential violations of this study and the
7.1% split into popularity groups by rank.
3.8%
0.3% 0.1%
0
0 1 2 3 4 5
Number of violations larger than the websites of other ranks (including high-
rank websites). This observation has a p-value of 0.054.
Fig. 8. Histogram showing number of websites with the given
number of potential violations. We report the potential violations
from 676 websites, i.e., the union of websites where the annota-
tion resulted in successful registration and websites that sent us 6 Potential for automation
an email. Note that we are conservative in determining potential
violations, so the reported number does not imply that 78.1% of We found that 22% of websites have potential viola-
websites are fully compliant. tions. Given this lack of compliance, regulators should
take note and might wish to step in. However, our proce-
dure still relies on many manual steps. Regulators would
In Fig. 9 we summarize the presence of all potential
therefore benefit from an automated tool that scales
violations discussed in this study. In addition, we split
up the detection of violations from sending marketing
the graph into groups by website’s ranking according to
emails. In this section, we offer a statistical analysis that
their Alexa rank. Note that more popular websites are
speaks to the feasibility of such automation.
not more compliant than lower ranked websites. More-
We study the statistical properties of our annotated
over, for the potential violation “Email despite no opt-
datasets. We define features that we extract from raw
in,” the websites with high rank show more potential
HTML, and afterwards, we train logistic models and
violations than those with low rank (p-value of the two
compute which of these features are the most influen-
proportions Z-Test of the rank < 1k against data of all
tial for deciding if a website satisfies a legal property.
other ranks is 0.156 after adjustment for multiple mea-
We also show how these features facilitate detecting po-
surements by Holm–Bonferroni method). The number of
tential violations.
websites of rank 1k-10k not sending legal notices is far
Checking Website’s GDPR Consent Compliance for Marketing Emails 294
6.1 Registration form and email features Table 3. Results of logistic regression for legal properties and
for whether an email is marketing. The last column represents
During the website annotation process, we collect from the percentage of positive samples (ps). The confidence intervals
are based on five-fold cross-validation. The results are from EN
each successfully registered website the registration
dataset; the DE dataset is reported in Table 7 in the Appendix.
form HTML subtree. A classification of this entire form
is not possible, so we instead extract first-level features, Property Precision Recall F1 ps
described in Table 2. ma_consent 82.5% ± 3.4% 70.8% ± 6.6% 76.0% ± 3.7% 38%
ma_purpose 19.6% ± 5.3% 51.7% ± 26.0% 27.4% ± 8.3% 7%
ma_checkbox 79.8% ± 11.6% 73.3% ± 7.3% 75.5% ± 3.0% 31%
Table 2. Features that we extract from each input, select, and ma_pre_checked 25.1% ± 13.8% 58.3% ± 33.3% 34.7% ± 18.7% 7%
button tag of the registration form. ma_forced 1.8% ± 3.6% 10.0% ± 20.0% 3.1% ± 6.2% 2%
pp_checkbox 62.8% ± 6.7% 77.6% ± 3.9% 69.3% ± 5.2% 21%
pp_forced 61.3% ± 20.4% 71.8% ± 14.9% 64.6% ± 14.6% 20%
Feature type Individual features tc_checkbox 89.4% ± 7.6% 84.9% ± 7.0% 87.0% ± 6.4% 28%
tc_forced 77.5% ± 6.8% 76.3% ± 10.8% 76.2% ± 6.3% 26%
HTML tags tag name, accompanying label text
#hidden 18.5% ± 2.7% 73.3% ± 17.0% 29.5% ± 4.7% 12%
HTML attributes class, type, attribute, placeholder, value #forced 52.4% ± 4.4% 68.9% ± 8.3% 59.0% ± 2.2% 37%
Is required? HTML required attribute, asterisk in the label #tying1 0.0% ± 0.0% 0.0% ± 0.0% 0.0% ± 0.0% 1%
Text processing tag purpose extracted by keyword matching Marketing email 97.4% ± 1.2% 98.0% ± 0.5% 97.7% ± 0.7% 80.6%
input field such that it resembles a password field, which emails (ma_purpose) corresponding to only 10% of the
might fool our model. The textual features used by the registrations. Even from these subscriptions, we did not
models are based on a bag-of-words model, which is un- observe as much email sharing as Mathur et al. The
able to represent word relations. The word selection and difference could be a result of EU privacy regulations
placement of invisible text might lead to both false pos- protecting user’s more than the US, or due to the po-
itives and false negatives. litical campaigns sending emails more aggressively than
There are multiple ways of preventing these ad- websites that mainly advertise their products, which are
versarial modifications. The feature extraction can use present in our study.
CSS, visual representation, and more advanced text
models as BERT [13]. We can add artificial adversarial
samples to our training dataset and force the model to Consent compliance analysis
use more reliable features. Lastly, Goodfellow et al. [30] We study consent with marketing emails, but websites
and Javanmard et al. [34] proposed defense mechanisms need to obtain consent for other processing purposes.
against adversarial manipulation for logistic regression. Oh et al. [61] state four conditions on consent according
to the GDPR. They inspect these conditions both man-
ually on 500 websites and by crawling 10 000 websites.
They show that their crawler is 96% aligned with the
7 Related work human decision. Their study partially overlaps with our
inspection of GDPR consent violations in Section 5.2.
Newsletter analysis
However, our study is focused on marketing emails and
Studies analyzing the content of emails either depend on
goes legally more in-depth, while their study is related
a publicly available email dataset or their authors must
to privacy policies and is more generic. Our decision
collect an email dataset by signing up for services similar
procedure requires observing the data misuse (receiv-
to our approach. The research closest to our study are
ing unsolicited marketing email), so we must complete
the following three publications. Englehardt et al. [18]
a registration, which is challenging to automate. In con-
subscribed to 902 newsletters by crawling 15 700 shop-
trast, their crawler detects violations solely by observing
ping and news websites. They analyze how loading the
the registration form without any interaction and before
email or following links in it causes information leak-
the act of data misuse.
age, and they show that 30% of emails leak the recipi-
Other researchers focus on consent for cookie usage.
ent’s email address to a third party. They also study the
A user study by Machuletz et al. [48] inspects how mis-
tracking protection of email servers and clients and pro-
leading cookie consent dialogs are. They confirm that
pose new privacy measures. Our study focuses on the
users are nudged into less favorable choices by mak-
legal aspects of sending marketing emails, mostly from
ing these choices more accessible. Our work quantifies
websites where the registration serves other purposes
nudging with legal properties #hidden, #settings, and
than only subscription to newsletters. Englehardt et al.
#colortrick, showing it is not as common in registra-
subscribed to emails exclusively at those websites that
tion forms as in cookie popups. Matte et al. [55] de-
we annotate as ma_purpose.
tected cookie banner privacy violations on 53% of web-
In the second study, Hamin [31] analyzes the con-
sites. Santos et al. [63] define 22 legal requirements on
tent of election campaigns. She crawled 4487 campaign
cookie banners and describe how to verify them. A simi-
websites, and successfully subscribed to 1778 newslet-
lar study by Trevisan et al. [65] summarizes EU require-
ters. A follow-up study of 2020 US elections by Mathur
ments on cookie consents that they can check automat-
et al. [54] observed that 348 out of 2800 email cam-
ically. Most notably, they report that 49% of websites
paigns shared the email address with a third party,
activate cookies before the user gives consent. Nouwens
while only 25% of those campaigns disclosed their email
et al. [60] study dark patterns of consent pop-ups, find-
sharing practice. Both of these studies also analyze the
ing that only 11.8% of websites comply with the GDPR.
email content, but their focus is on manipulative tac-
These studies are complementary to our work as we do
tics and political implications. As in the previous para-
not analyze cookie consents.
graph, these two studies target a narrow group of email
senders who send emails to a) subscribed users, b) who
are interested in elections, and c) located in the US.
Our study is generic, with a subscription to marketing
Checking Website’s GDPR Consent Compliance for Marketing Emails 296
Website compliance analysis and underfunded regulatory agencies to police the In-
Numerous studies have analyzed website compliance ternet more efficiently and increase compliance with
with privacy regulations that are complementary to our legal requirements.
analysis. As future work, we plan to train machine learn-
Linden et al. [47] and Degeling et al. [10] analyze ing models from the annotated dataset to automati-
how privacy policies changed with the GDPR com- cally detect potential violations. Combining this tool
ing into legal force. They observe an increase in the with an automated registration procedure, we could de-
length and the number of policies and an improvement tect potential violations in the wild, without any of the
in GDPR compliance. Amos et al. [2] observed simi- time-consuming manual work done by annotators for
lar results in their longitudinal study of privacy poli- the present study. This could open up novel and cost-
cies. Liepina et al. [46] present Claudette, a scanner effective ways for ensuring compliance of websites with
for GDPR violations in privacy policies. Harkous et legal rules that are aimed at protecting millions of con-
al. [32] propose Polisis, a privacy policy scanner that sumers on the Internet.
summarizes policies’ content. We used Polisis to analyze
whether websites disclose sharing email addresses with
third parties. A semantic text analysis of policies by Bui
et al. [7] can further improve the automation by extract-
Acknowledgment
ing the names of the third parties defined in the privacy
The authors would like to thank Martin Monsch, Nils
policy. However, this work was published after we fin-
Heinemann, Ozan Yildrim, and Patrick Krebs for excel-
ished our privacy policies analysis using only Polisis.
lent legal research and Joachim Posegga for an access to
a web proxy at the University of Passau. This research
received no specific grant from any funding agency in
8 Conclusions and future the public, commercial, or not-for-profit sectors.
directions
We manually registered on 666 out of 1000 websites and References
annotated the registration procedures and emails that
these websites sent. We proposed a decision procedure [1] F. Al Maqbali and C. J. Mitchell. “Web Password Re-
covery: A Necessary Evil?” In: Proceedings of the Future
that, based on the annotated legal properties, detects
Technologies Conference. Springer. 2018, pp. 324–341.
potential violations of opt-in and consent for sending
[2] R. Amos, G. Acar, E. Lucherini, M. Kshirsagar, A.
marketing emails. We then evaluated the emails that Narayanan, and J. Mayer. “Privacy Policies over Time:
we received, finding services that send marketing emails Curation and Analysis of a Million-Document Dataset.” In:
without valid consent in 17.3% of the cases. Further- Proceedings of The Web Conference 2021. WWW ’21. As-
more, 17.7% of the services sent us an email that poten- sociation for Computing Machinery, Apr. 19, 2021, p. 22.
doi: 10.1145/3442381.3450048.
tially violated the legal requirements on email content.
[3] Art. 29 Data Protection Working Party. Opinion 5/2004
In total, 21.9% of the websites committed at least one on unsolicited communications for marketing purposes
potential violation. under Article 13 of Directive 2002/58/EC. Feb. 2004.
The results of our study indicate that a substantial [4] Austrian Data Protection Authority (Datenschutzbehörde).
number of websites may be violating European privacy DSB-D130.073/0008-DSB/2019. https : / / gdprhub . eu /
index.php?title=DSB_-_DSB-D130.073/0008-DSB/2019.
and unfair competition rules as far as marketing emails
2019.
are concerned. The non-compliance with such rules is
[5] Baden-Württemberg Data Protection Authority (LfDI
not too surprising, given that it is cumbersome to detect Baden-Württemberg). LfDI - O 1018/115. https : / /
violations and enforce these rules. gdprhub . eu / index . php ? title = LfDI_ - _O _ 1018 / 115.
Our study can inform the policy and regulatory de- 2018.
bate about privacy and unfair competition law on the [6] Y. Bakos, F. Marotta-Wurgler, and D. R. Trossen.
“Does anyone read the fine print? Consumer attention to
Internet in several ways. First, it provides policymak-
standard-form contracts.” In: The Journal of Legal Studies
ers and regulators with an estimate of the prevalence 43.1 (2014), pp. 1–35.
of non-compliance. Second, it shows a path of how to [7] D. Bui, K. G. Shin, J.-M. Choi, and J. Shin. “Automated
increase compliance: A next step is to automate the Extraction and Presentation of Data Practices in Privacy
procedure outlined in this study, helping overloaded
Checking Website’s GDPR Consent Compliance for Marketing Emails 297
Policies.” In: Proceedings on Privacy Enhancing Technolo- Council of 24 October 1995 on the protection of individu-
gies 2021.2 (2021), pp. 88–110. als with regard to the processing of personal data and on
[8] M. Chatzimpyrros, K. Solomos, and S. Ioannidis. “You the free movement of such data. 1995.
Shall Not Register! Detecting Privacy Leaks Across Reg- [24] European Parliament, Council of the European Union. Di-
istration Forms.” In: Computer Security. Springer, 2019, rective 2000/31/EC of the European Parliament and of the
pp. 91–104. Council of 8 June 2000 on certain legal aspects of informa-
[9] J. Cohen. “A coefficient of agreement for nominal scales.” tion society services, in particular electronic commerce, in
In: Educational and psychological measurement 20.1 the Internal Market (’Directive on electronic commerce’).
(1960), pp. 37–46. June 8, 2000.
[10] M. Degeling, C. Utz, C. Lentzsch, H. Hosseini, F. Schaub, [25] European Parliament, Council of the European Union. Di-
and T. Holz. “We Value Your Privacy... Now Take Some rective 2002/58/EC of the European Parliament and of the
Cookies: Measuring the GDPR’s Impact on Web Privacy.” Council of 12 July 2002 concerning the processing of per-
In: Network and Distributed Systems Security (NDSS) sonal data and the protection of privacy in the electronic
Symposium. 2019. communications sector (Directive on privacy and electronic
[11] Deutsche Bundestag. German Act against Unfair Compe- communications). 2002.
tition (Gesetz gegen den unlauteren Wettbewerb) in the [26] European Parliament, Council of the European Union.
version published on 3 March 2010 (Federal Law Gazette Directive 2005/29/EC of the European Parliament and of
I p. 254), as last amended by Article 1 of the Act of 10 the Council of 11 May 2005 concerning unfair business-
August 2021 (Federal Law Gazette I, p. 3504). 2021. to-consumer commercial practices in the Internal Market
[12] Deutsche Bundestag. German Telemedia Act (Telemedi- and amending Council Directive 84/450/EEC, Directives
engesetz) in the version published on 26 February 2007 97/7/EC, 98/27/EC and 2002/65/EC of the European
(Federal Law Gazette I p. 179, 251), as last amended Parliament and of the Council and Regulation (EC) No
by Article 3 of the Act of 12 August 2021 (Federal Law 2006/2004 of the European Parliament and of the Council
Gazette I, p. 3544). 2021. (‘Unfair Commercial Practices Directive’). May 11, 2005.
[13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. [27] European Parliament, Council of the European Union.
“BERT: Pre-training of deep bidirectional transform- Directive 2006/114/EC of the European Parliament and of
ers for language understanding.” In: arXiv preprint the Council of 12 December 2006 concerning misleading
arXiv:1810.04805 (2018). and comparative advertising. Dec. 12, 2006.
[14] Directorate-General for the Information Society and Media [28] N. Gelernter, S. Kalma, B. Magnezi, and H. Porcilan. “The
(European Commission). ePrivacy Directive, assessment password reset MitM attack.” In: 2017 IEEE Symposium on
of transposition, effectiveness and compatibility with the Security and Privacy (SP). IEEE. 2017, pp. 251–267.
proposed data protection regulation. doi:10.2759/419180. [29] J. Gluck, F. Schaub, A. Friedman, H. Habib, N. Sadeh,
2015. L. F. Cranor, and Y. Agarwal. “How short is too short?
[15] K. Drakonakis, S. Ioannidis, and J. Polakis. “The Cookie Implications of length and framing on the effectiveness of
Hunter: Automated Black-box Auditing for Web Authen- privacy notices.” In: Twelfth Symposium on Usable Privacy
tication and Authorization Flaws.” In: Proceedings of the and Security (SOUPS 2016). 2016, pp. 321–340.
2020 ACM SIGSAC Conference on Computer and Commu- [30] I. J. Goodfellow, J. Shlens, and C. Szegedy. “Explaining
nications Security. 2020, pp. 1953–1970. and harnessing adversarial examples.” In: arXiv preprint
[16] L. Edwards. The New Legal Framework for E-Commerce in arXiv:1412.6572 (2014).
Europe. ISBN 978-1-847-31261-7, Hart Publishing, 2005. [31] M. Hamin. "don’t ignore this:" Automating the Collection
[17] V. Emmerich and K. W. Lange. Unfair competition (Un- and Analysis of Campaign Emails. Tech. rep. Princeton
lauterer Wettbewerb). ISBN 978-3-406-72639-2, C.H. University, 2018.
Beck, 2019. [32] H. Harkous, K. Fawaz, R. Lebret, F. Schaub, K. G. Shin,
[18] S. Englehardt, J. Han, and A. Narayanan. “I never signed and K. Aberer. “Polisis: Automated analysis and presen-
up for this! Privacy implications of email tracking.” In: tation of privacy policies using deep learning.” In: 27th
Proceedings on Privacy Enhancing Technologies 2018.1 USENIX Security Symposium (USENIX Security 18). 2018,
(2018), pp. 109–126. pp. 531–548.
[19] L. Epstein and A. D. Martin. An introduction to empirical [33] D. Jahnel. Legal commentary on the General Data Pro-
legal research. Oxford University Press, 2014. tection Regulation (GDPR) (Kommentar zur Datenschutz-
[20] European Commission. Guidance on the implementa- Grundverordnung (DSGVO)), Art. 7 Conditions for con-
tion/application of Directive 2005/29/EC on Unfair Com- sent (Bedingungen für die Einwilligung). ISBN 978-3-709-
mercial Practices. May 25, 2016. 70178-2, Jan Sramek Verlag, 2021.
[21] European Data Protection Board. Opinion 5/2019 on the [34] A. Javanmard and M. Soltanolkotabi. “Precise statistical
interplay between the ePrivacy Directive and the GDPR, in analysis of classification accuracies for adversarial training.”
particular regarding the competence, tasks and powers of In: arXiv preprint arXiv:2010.11213 (2020).
data protection authorities. Mar. 2019. [35] Judgement of the Court of Justice of the European Union
[22] European Data Protection Board. Guidelines 05/2020 on from November 11, 2020. C-61/19, EU:C:2020:901. 2020.
consent under Regulation 2016/679 (GDPR). May 2020. [36] Judgement of the Court of Justice of the European Union
[23] European Parliament, Council of the European Union. Di- from October 1, 2019. C-673/17, EU:C:2019:801. 2019.
rective 95/46/EC of the European Parliament and of the
Checking Website’s GDPR Consent Compliance for Marketing Emails 298
[37] Judgement of the Federal Court of Justice (BHG) from the 2020 U.S. election cycle. https://electionemails2020.
February 1, 2018. III ZR 196/17. 2018. org. 2020.
[38] Judgement of the Federal Court of Justice (BHG) from [55] C. Matte, N. Bielova, and C. Santos. “Do Cookie Ban-
July 10, 2018. VI ZR 225/17. 2018. ners Respect my Choice? Measuring Legal Compliance of
[39] Judgement of the Federal Court of Justice (BHG) from Banners from IAB Europe’s Transparency and Consent
July 16, 2008. VIII ZR 348/06. 2008. Framework.” In: 2020 IEEE Symposium on Security and
[40] Judgement of the Federal Court of Justice (BHG) from Privacy (SP). IEEE. 2020, pp. 791–809.
March 14, 2017. VI ZR 721/15. 2017. [56] A. M. McDonald and L. F. Cranor. “The cost of reading
[41] Judgement of the Federal Court of Justice (BHG) from privacy policies.” In: ISJLP 4 (2008), p. 543.
May 28, 2020. I ZR 7/16. 2020. [57] D. Mederle. The regulation of spam and unsolicited com-
[42] Judgement of the Higher Regional Court of Munich (OLG mercial emails (Die Regulierung von Spam und uner-
München) from February 15, 2018. 29 U 2799/17. 2018. betenen kommerziellen E-Mails). Heymanns, 2010. isbn:
[43] P. Kast. Automating website registration for GDPR com- 3452272680.
pliance analysis, Bachelor’s thesis, ETH Zurich. Bachelor’s [58] H. Micklitz and M. Schirmbacher. Legal commentary on
Thesis. 2021. the German Act against Unfair Competition (Kommentar
[44] V. B. Kumar, R. Iyengar, N. Nisal, Y. Feng, H. Habib, P. zum Gesetz gegen den unlauteren Wettbewerb (UWG)), §
Story, S. Cherivirala, M. Hagan, L. Cranor, S. Wilson, et 7 UWG Unacceptable nuisance (Unzumutbare Belästigun-
al. “Finding a Choice in a Haystack: Automatic Extrac- gen), Par. 203 in G. Spindler and F. Schuster, Electronic
tion of Opt-Out Statements from Privacy Policy Text.” In: Media Law, 4th edition 2019, (Recht der elektronischen
Proceedings of The Web Conference 2020. 2020. Medien, 4. Aufl. 2019). 2019.
[45] Legal team of the Certified Senders Alliance. DOI: if not [59] H. Micklitz and M. Schirmbacher. Legal commentary on
now, then when?! https://certified-senders.org/blog/doi-if- the German Telemedia Act (Kommentar zum Teleme-
not-now-then-when/. 2017. (Visited on 08/25/2021). diengesetz (TMG)), § 4-6 TMG, in G. Spindler and F.
[46] R. Liepin, G. Contissa, K. Drazewski, F. Lagioia, M. Schuster, Electronic Media Law, 4th edition 2019, (Recht
Lippi, H.-W. Micklitz, P. Palka, G. Sartor, and P. Torroni. der elektronischen Medien, 4. Aufl. 2019). 2019.
“GDPR privacy policies in CLAUDETTE: Challenges of [60] M. Nouwens, I. Liccardi, M. Veale, D. Karger, and L. Ka-
omission, context and multilingualism.” In: 3rd Workshop gal. “Dark Patterns after the GDPR: Scraping Consent
on Automated Semantic Analysis of Information in Legal Pop-ups and Demonstrating their Influence.” In: Proceed-
Texts, ASAIL 2019. Vol. 2385. CEUR-WS. 2019. ings of the 2020 CHI conference on human factors in com-
[47] T. Linden, R. Khandelwal, H. Harkous, and K. Fawaz. puting systems. 2020, pp. 1–13.
“The privacy policy landscape after the GDPR.” In: Pro- [61] J. Oh, J. Hong, C. Lee, J. J. Lee, S. S. Woo, and K. Lee.
ceedings on Privacy Enhancing Technologies 2020.1 “Will EU’s GDPR Act as an Effective Enforcer to Gain
(2020), pp. 47–64. Consent?” In: IEEE Access (2021).
[48] D. Machuletz and R. Böhme. “Multiple purposes, multiple [62] C. Routh, B. DeCrescenzo, and S. Roy. “Attacks and vul-
problems: A user study of consent dialogs after GDPR.” nerability analysis of e-mail as a password reset point.”
In: Proceedings on Privacy Enhancing Technologies 2020.2 In: 2018 Fourth International Conference on Mobile and
(2020), pp. 481–498. Secure Services (MobiSecServ). IEEE. 2018, pp. 1–5.
[49] P. Mankowski. Legal commentary on the German Act [63] C. Santos, N. Bielova, and C. Matte. “Are cookie banners
against Unfair Competition (Kommentar zum Gesetz gegen indeed compliant with the law? Deciphering EU legal re-
den unlauteren Wettbewerb (UWG)), § 7 UWG Unaccept- quirements on consent and technical means to verify com-
able nuisance (Unzumutbare Belästigungen), Par. 238, in pliance of cookie banners.” In: Technology and Regulation
K. Fezer, W. Büscher and E. Obergfell. Unfair competition (2020). 2019, pp. 91–135.
law (Lauterkeitsrecht). 2016. [64] J. Sim and C. C. Wright. “The kappa statistic in reliability
[50] Is email marketing dead? https://optinmonster.com/is- studies: use, interpretation, and sample size requirements.”
email-marketing-dead-heres-what-the-statistics-show/. In: Physical therapy 85.3 (2005), pp. 257–268.
[51] Marketing email tracker 2019. https : / / dma . org . uk / [65] M. Trevisan, S. Traverso, E. Bassi, and M. Mellia. “4
uploads/misc/marketers-email-tracker-2019.pdf. years of EU cookie law: Results and lessons learned.” In:
[52] A. Mathur, G. Acar, M. J. Friedman, E. Lucherini, J. Proceedings on Privacy Enhancing Technologies 2019.2
Mayer, M. Chetty, and A. Narayanan. “Dark patterns at (2019), pp. 126–145.
scale: Findings from a crawl of 11K shopping websites.” In: [66] J. Weiser. “The possibility of using a partnership ex-
Proceedings of the ACM on Human-Computer Interaction change can be “selling a service” in the sense of the UWG
3.CSCW (2019), pp. 1–32. (Nutzungsmöglichkeit einer Partnerschaftsbörse kann
[53] A. Mathur, M. Kshirsagar, and J. Mayer. “What makes a “Verkauf einer Dienstleistung” im Sinne des UWG sein).”
dark pattern... dark? Design attributes, normative consid- In: GRUR-Prax, (Gewerblicher Rechtsschutz und Urheber-
erations, and measurement methods.” In: Proceedings of recht, Praxis im Immaterialgüter- und Wettbewerbsrecht)
the 2021 CHI Conference on Human Factors in Computing 2018.10 (2018), p. 291.
Systems. 2021, pp. 1–18. [67] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala,
[54] A. Mathur, A. Wang, C. Schwemmer, M. Hamin, B. M. P. G. Leon, M. S. Andersen, S. Zimmeck, K. M. Sathyen-
Stewart, and A. Narayanan. Manipulative tactics are the dra, N. C. Russell, et al. “The creation and analysis of a
norm in political emails: Evidence from 100K emails from website privacy policy corpus.” In: Proceedings of the 54th
Checking Website’s GDPR Consent Compliance for Marketing Emails 299
A Appendix
A.1 Annotation process
Table 4. The individual Cohen’s κs of legal properties. Note that Table 6. Contingency tables of hashtag values. Rows represent
κ = 1 implies full agreement, while κ = −1 is full disagreement. the first annotation, the second annotation is depicted by the
column.
Checkbox κ Hashtag κ
(a) #tying12 (b) #tying13 (c) #tying23 (d) #tying123
mark_consent 0.77 #tying12 0.12 True False True False True False True False
mark_purpose 0.61 #tying13 1.00 True 1 7 True 0 0 True 86 26 True 0 4
ma_checkbox 0.77 #tying23 0.74 False 7 985 False 0 1000 False 25 863 False 1 995
resolving annotation. Moreover, we would not have sub- Table 7. Results of logistic regression for legal properties based
scribed to many of the 701 websites. on DE dataset, with the percentage of positive samples (ps) in
the last column. The confidence intervals are based on five-fold
cross-validation.
A.2 Datasets content Table 8. The figure shows the five most important features based
on the model that decides ma_purpose property, i.e., if the form
We now elaborate on our analyses in Section 6, show- serves as an email subscription. We identify these features by the
highest absolute values of the coefficients of a logistic regression
ing insights that help to understand the content of the
model. A coefficient value interprets similarly as a correlation.
datasets and to illustrate other potential applications of A positive coefficient means the feature needs to be true for an
the dataset. email subscription form, while a negative coefficient signalizes a
Note that following ethical principles, we had to negative correlation between the feature and decision. The model
redact our datasets. We removed all the URLs and cre- correctly identified that forms without passwords serve more likely
dentials within both the email and website datasets. The only as an email subscription. Such forms also more often con-
tained multiple checkboxes.
redacted datasets suit the goals of automated potential
violation detection as well as the full dataset.
Feature Coefficient
mark_consent 1.0 .17 .84 .14 .08 .36 .01 .34 .30 .01 .29 .03 .00 .18 .01 .11 .02 .22 .06 .14
0.6 mark_purpose 1.0 1.0 .33 .04 .17 .29 .00 .29 .04 .00 .04 .02 .00 .04 .00 .17 .00 .08 .02 .04
ma_checkbox 1.0 .07 1.0 .17 .09 .38 .01 .36 .35 .01 .33 .03 .00 .21 .01 .11 .03 .23 .06 .14
0.5
ma_pre_checked 1.0 .05 1.0 1.0 .03 .20 .03 .17 .33 .03 .30 .00 .00 .20 .00 .12 .05 .33 .10 .17
0.4 ma_forced 1.0 .38 1.0 .05 1.0 .67 .00 .67 .38 .00 .33 .24 .00 .19 .10 .14 .00 .00 .00 .10
0.3 pp_checkbox .45 .06 .40 .04 .06 1.0 .01 .97 .60 .01 .59 .04 .00 .50 .01 .01 .00 .01 .01 .17
pp_pre_checked .67 .00 .67 .33 .00 1.0 1.0 .67 1.0 1.0 .67 .00 .00 1.0 .00 .00 .00 .00 .00 .33
0.2
pp_forced .44 .07 .39 .03 .07 1.0 .01 1.0 .60 .01 .60 .02 .00 .50 .01 .01 .00 .01 .01 .17
0.1 tc_checkbox .44 .01 .43 .07 .04 .70 .02 .68 1.0 .03 .97 .00 .01 .57 .01 .04 .00 .00 .02 .16
tc_pre_checked .50 .00 .50 .17 .00 .50 .50 .33 1.0 1.0 .50 .00 .00 .50 .00 .00 .00 .00 .00 .17
0.0
tc_forced .44 .01 .42 .07 .04 .71 .01 .71 1.0 .02 1.0 .00 .01 .58 .01 .04 .00 .00 .02 .16
e
t
d
te
n
e
s
en
es
e
tio
on
hi
ia
ua
at
#tying12 1.0 .12 1.0 .00 .62 1.0 .00 .62 .00 .00 .00 1.0 .00 .00 .00 .00 .00 .12 .00 .12
rs
ym
cc
Ph
pr
ra
pe
ng
be
Su
st
ro
Pa
Re
la
em
#tying13 1.0 .00 .00 .00 .00 .00 .00 .00 1.0 .00 1.0 .00 1.0 .00 .00 1.0 .00 .00 1.0 1.0
gi
p
ap
re
ng
M
In
#tying23 .45 .02 .44 .07 .04 .99 .03 .97 .98 .03 .96 .00 .00 1.0 .00 .00 .00 .01 .03 .13
o
ro
N
Registration state #tying123 1.0 .00 .67 .00 .67 .67 .00 .67 .67 .00 .67 .00 .00 .00 1.0 .00 .00 .33 .00 .00
#forcedpp .53 .14 .44 .09 .05 .05 .00 .04 .12 .00 .12 .00 .02 .00 .00 1.0 .00 .00 .07 .18
#forcedtc .17 .00 .17 .06 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 1.0 .00 .29 .09
Fig. 11. Registration state of the resolved annotations. For agree- #forcedpptc .49 .03 .42 .10 .00 .02 .00 .02 .00 .00 .00 .01 .00 .01 .01 .00 .00 1.0 .29 .20
ment, Cohen’s κ for distinction between successful and failed #hidden .31 .02 .26 .07 .00 .06 .00 .06 .07 .00 .07 .00 .02 .06 .00 .07 .19 .67 1.0 .19
registrations is 0.64. #age .36 .02 .31 .07 .02 .36 .01 .35 .30 .01 .29 .01 .01 .13 .00 .10 .03 .24 .10 1.0
pp a_f ed
tc _fo ed
o d
_p he ced
_p he ed
in d
# for pp
ed tc
hi c
x
he x
he x
i 2
3
in 23
rc 3
e
# n
m a_ urp t
pr ec se
m _p en
# ppt
ch o
_c bo
_c bo
tc cke
# rce
ag
# g1
# ng1
fo 12
e
rc d
a_ ch o
e_ kb
m eck
pp ck
tc _c rc
# ed
dd
# ing
fo ce
s
pp _c or
re ck
re ck
g
ar on
_f
ty
ty
ty
m k_c
ty
#
marketing. In addition to these misclassifications, the
k
ar
m
information was incomplete in an additional 30 cases:
Fig. 12. Interdependence of legal properties as a ratio of anno-
– missed a confirmation code in 3 emails, tations with the property of the row that has also the property
– collected a wrong code in 1 email, of the column. A cell in the first row, second column, marks how
– found a wrong confirmation link in 16 emails, and many websites with marketing consent (row label) have the mar-
– another method of activation was specified in 10 keting purpose (column label).
4000
Email type
3500 Marketing (all)
Servicing (all)
3000 Marketing: newsletter
Marketing: notification
2500 Marketing: survey
Servicing: opt-in
# of emails
Last
2000 Servicing: confirmation annotations
Servicing: updates and password
reset mails
1500 2nd round
{
1000
1st annotation
{
round
{
500