On purpose and by necessity: compliance under the GDPR

On purpose and by necessity: compliance under the GDPR

On purpose and by necessity: compliance under the GDPR Basin et al., FC’18

A year ago it seemed like hardly anyone in a technical role had heard of GDPR. Now it seems to be front of mind for everyone! Not surprising perhaps, as it comes into force on the 25th May this year. In today’s paper choice, Basin et al. consider how an organisation might go about establishing compliance with the law.

[The GDPR] requires not only that data is only collected after obtaining consent from the user, but also that data is collected and used only for specific purposes, and must be deleted when those purposes are no longer applicable. The GDPR spells out these requirements in its notions of purpose limitation and data minimisation, its treatment of consent, and the right to be forgotten.

Personal data shall be collected for specified, explicit and legitimate purposes (and not further processed in a manner incompatible with them)— Article 5, §1
Personal data shall be adequate, relevant, and limited to what is necessary in relation to the purposes…— Article 5, §1
Consent should be given by a clear affirmative act establishing a freely given, specific, informed and unambiguous indication of the data subject’s agreement to the processing of personal data relating to him or her … when the processing has multiple purposes, consent should be given for all of them — Recital 32
In short, data must:

be collected for a purpose,
to which the user has consented, and
be necessary to achieve that purpose;
moreover the collected data must be deleted when it is no longer necessary for any purpose
As with any regulation that applies to computer systems, we are faced with two key questions: (1) How do we build a computer system in a manner guaranteeing compliance? (2) How do we analyse or audit a computer system for compliance?

In simple access control schemes, access rights are independent of context. In the world of GDPR, access control becomes relative to a purpose. I.e., it’s not just ‘can I access this data?,’ but ‘am I allowed to access this data for this purpose?’ If you equate purpose with role, then that looks a lot like RBAC. But the flows of data through an organisation can quickly become more complex.

The central idea in the paper is that business process modelling can be a powerful tool for managing purpose and compliance. I’m not sure how practical that is when your architecture looks like a death star and you’re on a journey towards 10,000 deploys a day, but you’re certainly going to need some way of tracking and managing data, consent, and purpose.

We propose exploiting the formal notion of a business process model as a bridge between a system implementation and the GDPR. In doing so, we exploit that a business process model by its very nature embodies a particular purpose, while at the same time it specifies at what points data is collected and used.

Online retailer example
Consider an online retailer where customers order online, pay with their credit cards, and receive orders by post. The retailer may also engage in marketing — targeted or otherwise — using a variety of channels. Sufficiently zoomed out, the retailer has four core processes:

Registering customers: customers sign up and provide email, mailing address and credit card information. (Best not to store credit card information unless you really have to of course, and typically the sign-up would be deferred until the customer’s first checkout, but let’s roll with it… )
Purchase: a registered customer pays using their recorded credit card number, and the retailer sends the product and invoice.
Mass marketing: customers’ email and/or physical addresses are used to send otherwise untargeted advertising.
Targeted marketing: customers’ email and/or physical addresses are used to send individually targeted advertisements based on past purchase history
You can lay these out in a high-level BPMN diagram, with one pool for each process (containing a mixture of human and automated process steps).

Here the assumption is that a process is associated with a purpose: the purchase process (purpose); the mass marketing process (purpose); the customer satisfaction evaluation process (purpose); and so on.

Interacting processes
In practice a company may collect data about customers in one process and use that data in another… This disconnect mirrors a challenge faced by many companies: whereas the individual processes within a company are usually well-understood by the staff undertaking them, including the interfaces to other processes, the global picture of all processes in the company is rarely well understood. But the GDPR requires such a global understanding: data collected in one process may migrate to other processes, and end-user consent is required for all involved processes.

I take a somewhat different viewpoint here: for many organisations, I contend that an approach requiring a complete, accurate, up-to-date, global understanding of all processes and data flows is probably doomed to failure, as this is an impossible task! And yet we can’t escape the challenges the authors (and the regulations) place before us. My personal sense is that we need to track the provenance of data as it flows through an organisation, including proof of consent and the purposes for which consent was given, and then match the purpose of a process against that consent record. (Having the provenance flow with the data allows for local decisions, rather than requiring a centralised master plan).

Regardless of how you do it, you are at some point going to end up with a collection of business processes and a set of data classes. You’ll need to know which data is collected by which processes, and which data is used by which processes (i.e., for which purposes). The data production (collection) and usage relations can be used to derive the user consents that are needed.

Instead of the full-blown BPMN model, you can represent the essential information in a high level inter-process diagram which hides the internal details of the processes:

Establishing GDPR compliance
Given such a top level set of business processes/purposes, then demonstrating compliance involves:

showing that the processes as implemented collect and use data as specified by the process definitions and privacy policy
showing that the processes follow the GDPR, for example, by deleting data as appropriate
showing that the privacy policy conforms to the GDPR. For example, it does not make vague statements about the purposes for which data will be used.
(Aside: maybe this helps a little bit, but I’m not sure it really helps a whole lot. It’s not much deeper than simply saying, “you’re compliant if you have a compliant policy and you follow it correctly.”)

Given the usage relation between data and processes (purposes), then the consent statement can take the form “we collect d_1, d_2 for purpose p_1 and we collect d_1, d_3, and d_4 for purpose p_2” (and so on).

Furthermore the model can show us if there is personal data which is collected but not used by any process: “such data is clearly unnecessary, violating data minimisation.”

Third parties
An important area for future work concerns data transfers to third parties. The GDPR has precise rules about who may transfer data to other parties, when these transfers can occur, and under what circumstances other parties can or must delete, produce, or store data. Naturally this opens up questions about audits and compliance similar to the ones addressed in this paper.

Discussion
As I indicated earlier in this post, I’m personally not convinced that centralised business process mapping is going to be the answer here. Nevertheless, the paper sets out nicely the concerns relating to processes, data, consent, and purpose and provides a framework for thinking about them. For more GDPR related material on The Morning Paper, see ‘European Union regulations on algorithmic decision making and a right to explanation’ and ‘Ground: A data context service.’

Advertisements

Linux Performance Analysis in 60,000 Milliseconds

Linux Performance Analysis in 60,000 Milliseconds

Homepage
Netflix TechBlog
Applause from you and
Go to the profile of Netflix Technology Blog
Netflix Technology Blog
Learn more about how Netflix designs, builds, and operates our systems and engineering organizations
Nov 30, 2015
Linux Performance Analysis in 60,000 Milliseconds
You log in to a Linux server with a performance issue: what do you check in the first minute?

At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance. These include Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. While those tools help us solve most issues, we sometimes need to login to an instance and run some standard Linux performance tools.

First 60 Seconds: Summary
In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available. In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization. Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting.

uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
Some of these commands require the sysstat package installed. The metrics these commands expose will help you complete some of the USE Method: a methodology for locating performance bottlenecks. This involves checking utilization, saturation, and error metrics for all resources (CPUs, memory, disks, e.t.c.). Also pay attention to when you have checked and exonerated a resource, as by process of elimination this narrows the targets to study, and directs any follow on investigation.

The following sections summarize these commands, with examples from a production system. For more information about these tools, see their man pages.

1. uptime
$ uptime
23:51:26 up 21:31, 1 user, load average: 30.02, 26.43, 19.02
This is a quick way to view the load averages, which indicate the number of tasks (processes) wanting to run. On Linux systems, these numbers include processes wanting to run on CPU, as well as processes blocked in uninterruptible I/O (usually disk I/O). This gives a high level idea of resource load (or demand), but can’t be properly understood without other tools. Worth a quick look only.

The three numbers are exponentially damped moving sum averages with a 1 minute, 5 minute, and 15 minute constant. The three numbers give us some idea of how load is changing over time. For example, if you’ve been asked to check a problem server, and the 1 minute value is much lower than the 15 minute value, then you might have logged in too late and missed the issue.

In the example above, the load averages show a recent increase, hitting 30 for the 1 minute value, compared to 19 for the 15 minute value. That the numbers are this large means a lot of something: probably CPU demand; vmstat or mpstat will confirm, which are commands 3 and 4 in this sequence.

2. dmesg | tail
$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[…]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP counters.
This views the last 10 system messages, if there are any. Look for errors that can cause performance issues. The example above includes the oom-killer, and TCP dropping a request.

Don’t miss this step! dmesg is always worth checking.

3. vmstat 1
$ vmstat 1
procs ———memory———- —swap– —–io—- -system– ——cpu—–
r b swpd free buff cache si so bi bo in cs us sy id wa st
34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 0
32 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 0
32 0 0 200890112 73708 591860 0 0 0 0 9501 2154 99 1 0 0 0
32 0 0 200889568 73712 591856 0 0 0 48 11900 2459 99 0 0 0 0
32 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0
^C
Short for virtual memory stat, vmstat(8) is a commonly available tool (first created for BSD decades ago). It prints a summary of key server statistics on each line.

vmstat was run with an argument of 1, to print one second summaries. The first line of output (in this version of vmstat) has some columns that show the average since boot, instead of the previous second. For now, skip the first line, unless you want to learn and remember which column is which.

Columns to check:
r: Number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an “r” value greater than the CPU count is saturation.
free: Free memory in kilobytes. If there are too many digits to count, you have enough free memory. The “free -m” command, included as command 7, better explains the state of free memory.
si, so: Swap-ins and swap-outs. If these are non-zero, you’re out of memory.
us, sy, id, wa, st: These are breakdowns of CPU time, on average across all CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen time (by other guests, or with Xen, the guest’s own isolated driver domain).
The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time. A constant degree of wait I/O points to a disk bottleneck; this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle.

System time is necessary for I/O processing. A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.

In the above example, CPU time is almost entirely in user-level, pointing to application level usage instead. The CPUs are also well over 90% utilized on average. This isn’t necessarily a problem; check for the degree of saturation using the “r” column.

4. mpstat -P ALL 1
$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)

07:38:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78
07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99
07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00
07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
07:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03
[…]
This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance. A single hot CPU can be evidence of a single-threaded application.

5. pidstat 1
$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)

07:41:02 PM UID PID %usr %system %guest %CPU CPU Command
07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0
07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave
07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java
07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java
07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java
07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat

07:41:03 PM UID PID %usr %system %guest %CPU CPU Command
07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 15 mesos-slave
07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 java
07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java
07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass
07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 pidstat
^C
Pidstat is a little like top’s per-process summary, but prints a rolling summary instead of clearing the screen. This can be useful for watching patterns over time, and also recording what you saw (copy-n-paste) into a record of your investigation.

The above example identifies two java processes as responsible for consuming CPU. The %CPU column is the total across all CPUs; 1591% shows that that java processes is consuming almost 16 CPUs.

6. iostat -xz 1
$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
73.96 0.00 3.73 0.03 0.06 22.21

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09
xvdb 0.01 0.00 1.02 8.94 127.97 598.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25
xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26
dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04
dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00
dm-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03
[…]
^C
This is a great tool for understanding block devices (disks), both the workload applied and the resulting performance. Look for:

r/s, w/s, rkB/s, wkB/s: These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device. Use these for workload characterization. A performance problem may simply be due to an excessive load applied.
await: The average time for the I/O in milliseconds. This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems.
avgqu-sz: The average number of requests issued to the device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back-end disks.)
%util: Device utilization. This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device. Values close to 100% usually indicate saturation.
If the storage device is a logical disk device fronting many back-end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from saturated, and may be able to handle much more work.

Bear in mind that poor performing disk I/O isn’t necessarily an application issue. Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g., read-ahead for reads, and buffering for writes).

7. free -m
$ free -m
total used free shared buffers cached
Mem: 245998 24545 221453 83 59 541
-/+ buffers/cache: 23944 222053
Swap: 0 0 0
The right two columns show:

buffers: For the buffer cache, used for block device I/O.
cached: For the page cache, used by file systems.
We just want to check that these aren’t near-zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. The above example looks fine, with many Mbytes in each.

The “-/+ buffers/cache” provides less confusing values for used and free memory. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column, which this line does. There’s even a website, linuxatemyram, about this confusion.

It can be additionally confusing if ZFS on Linux is used, as we do for some services, as ZFS has its own file system cache that isn’t reflected properly by the free -m columns. It can appear that the system is low on free memory, when that memory is in fact available for use from the ZFS cache as needed.

8. sar -n DEV 1
$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)

12:16:48 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00
12:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00
12:16:49 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

12:16:49 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00
12:16:50 AM lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00
12:16:50 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
^C
Use this tool to check network interface throughput: rxkB/s and txkB/s, as a measure of workload, and also to check if any limit has been reached. In the above example, eth0 receive is reaching 22 Mbytes/s, which is 176 Mbits/sec (well under, say, a 1 Gbit/sec limit).

This version also has %ifutil for device utilization (max of both directions for full duplex), which is something we also use Brendan’s nicstat tool to measure. And like with nicstat, this is hard to get right, and seems to not be working in this example (0.00).

9. sar -n TCP,ETCP 1
$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)

12:17:19 AM active/s passive/s iseg/s oseg/s
12:17:20 AM 1.00 0.00 10233.00 18846.00

12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:20 AM 0.00 0.00 0.00 0.00 0.00

12:17:20 AM active/s passive/s iseg/s oseg/s
12:17:21 AM 1.00 0.00 8359.00 6039.00

12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:21 AM 0.00 0.00 0.00 0.00 0.00
^C
This is a summarized view of some key TCP metrics. These include:

active/s: Number of locally-initiated TCP connections per second (e.g., via connect()).
passive/s: Number of remotely-initiated TCP connections per second (e.g., via accept()).
retrans/s: Number of TCP retransmits per second.
The active and passive counts are often useful as a rough measure of server load: number of new accepted connections (passive), and number of downstream connections (active). It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true (e.g., consider a localhost to localhost connection).

Retransmits are a sign of a network or server issue; it may be an unreliable network (e.g., the public Internet), or it may be due a server being overloaded and dropping packets. The example above shows just one new TCP connection per-second.

10. top
$ top
top – 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92
Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie
%Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers
KiB Swap: 0 total, 0 used, 0 free. 554208 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20248 root 20 0 0.227t 0.012t 18748 S 3090 5.2 29812:58 java
4213 root 20 0 2722544 64640 44232 S 23.5 0.0 233:35.37 mesos-slave
66128 titancl+ 20 0 24344 2332 1172 R 1.0 0.0 0:00.07 top
5235 root 20 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 java
4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java
1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0
8 root 20 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched
The top command includes many of the metrics we checked earlier. It can be handy to run it to see if anything looks wildly different from the earlier commands, which would indicate that load is variable.

A downside to top is that it is harder to see patterns over time, which may be more clear in tools like vmstat and pidstat, which provide rolling output. Evidence of intermittent issues can also be lost if you don’t pause the output quick enough (Ctrl-S to pause, Ctrl-Q to continue), and the screen clears.

Follow-on Analysis
There are many more commands and methodologies you can apply to drill deeper. See Brendan’s Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing.

Tackling system reliability and performance problems at web scale is one of our passions. If you would like to join us in tackling these kinds of challenges we are hiring!

See Also:
Introducing Atlas

Netflix’s Primary Telemetry Platform
medium.com
Introducing Vector

Netflix’s On-Host Performance Monitoring Tool
medium.com
Netflix at Velocity 2015: Linux Performance Tools

There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them?
medium.com
Originally published at techblog.netflix.com on November 30, 2015.

LinuxPerformance
One clap, two clap, three clap, forty?
By clapping more or less, you can signal to us which stories really stand out.

Go to the profile of Netflix Technology Blog
Netflix Technology Blog
Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

Netflix TechBlog
Netflix TechBlog
Learn about Netflix’s world class engineering efforts, company culture, product developments and more.

More from Netflix TechBlog
Engineering to Improve Marketing Effectiveness (Part 1)
Go to the profile of Netflix Technology Blog
Netflix Technology Blog
More from Netflix TechBlog
Project Nimble: Region Evacuation Reimagined
Go to the profile of Netflix Technology Blog
Netflix Technology Blog
More from Netflix TechBlog
Netflix Hack Day — Winter 2018
Go to the profile of Netflix Technology Blog
Netflix Technology Blog
Responses

Hessische Zentrale für Datenverarbeitung

Hessische Zentrale für Datenverarbeitung

Die HZD ist ein Landesbetrieb nach § 26 der Hessischen Landeshaushaltsordnung. Sie untersteht der Dienst- und Fachaufsicht des Hessischen Ministeriums der Finanzen. Soweit sie Aufgaben der Verwaltung oder der Gerichte und Staatsanwaltschaften wahrnimmt, die nicht zu dem Geschäftsbereich des Hessischen Ministeriums der Finanzen gehören, untersteht sie der Fachaufsicht der dafür zuständigen obersten Landesbehörde. Die Kontrolle der Tätigkeit der HZD auf die Einhaltung aller Bestimmungen, die der Gewährleistung der IT-Sicherheit der Daten der hessischen Justiz dienen, erfolgt durch die IT-Stelle der hessischen Justiz.

Welche Rechenleistung tatsächlich im Rechenzentrum erbracht wird, erahnt man erst, wenn man die „heiligen“ Hallen der HZD betritt. Unter höchsten Sicherheitsstandards mit elektronischen, videoüberwachten Zugangskontrollsystemen werden die Mitarbeiterinnen und Mitarbeiter per Chipkarte mit abgespeichertem Foto von sicherheitsüberprüftem Personal einzeln in Schleusen für den Einlass geprüft.

Bei einem Sauerstoffgehalt von nur ca. 17 % besteht außergewöhnlich hoher Brandschutz. Bei diesem Sauerstoffgehalt kann z.B. kein Streichholz mehr gezündet werden. Der Sauerstoffgehalt kann bei Notwendigkeit durch eine Raumflutung mittels Stickstoff auf nahezu 13 % gesenkt wird. Vollredundante Stromzuführung mit kompletter Ausfallabsicherung, die perfektionierte Kühlungsprozedur, flankiert durch die zwei hauseigenen Blockheizkraftwerke und Notstromdieselaggregate, vervielfacht die Ausfall- und Datensicherheit im Wiesbadener Rechenzentrum.

Dass die Betriebsprozesse so störungsfrei wie möglich ablaufen, ist den für die HZD geltenden sogenannten ITIL-Standards als Regel- und Definitionswerk zu verdanken.

Die Administration der ca. 3.000 Server mit ca. 500 Terabyte (TB) zentralem Speicherplatz und der 4.500 Netzwerkkomponenten und 150.000 Netzwerkanschlüsse bei einem Umfang von rund 25.000 Changes pro Jahr für die 230 zu betreuenden Produkte und Verfahren fordert die Kompetenz der Mitarbeiterinnen und Mitarbeiter und offenbart die effektive Rechenleistung der HZD.

On purpose and by necessity: compliance under the GDPR

On purpose and by necessity: compliance under the GDPR

The morning paper
an interesting/influential/important paper from the world of CS every weekday morning, as selected by Adrian Colyer
On purpose and by necessity: compliance under the GDPR
On purpose and by necessity: compliance under the GDPR Basin et al., FC’18

A year ago it seemed like hardly anyone in a technical role had heard of GDPR. Now it seems to be front of mind for everyone! Not surprising perhaps, as it comes into force on the 25th May this year. In today’s paper choice, Basin et al. consider how an organisation might go about establishing compliance with the law.

[The GDPR] requires not only that data is only collected after obtaining consent from the user, but also that data is collected and used only for specific purposes, and must be deleted when those purposes are no longer applicable. The GDPR spells out these requirements in its notions of purpose limitation and data minimisation, its treatment of consent, and the right to be forgotten.

Personal data shall be collected for specified, explicit and legitimate purposes (and not further processed in a manner incompatible with them)— Article 5, §1
Personal data shall be adequate, relevant, and limited to what is necessary in relation to the purposes…— Article 5, §1
Consent should be given by a clear affirmative act establishing a freely given, specific, informed and unambiguous indication of the data subject’s agreement to the processing of personal data relating to him or her … when the processing has multiple purposes, consent should be given for all of them — Recital 32
In short, data must:

be collected for a purpose,
to which the user has consented, and
be necessary to achieve that purpose;
moreover the collected data must be deleted when it is no longer necessary for any purpose
As with any regulation that applies to computer systems, we are faced with two key questions: (1) How do we build a computer system in a manner guaranteeing compliance? (2) How do we analyse or audit a computer system for compliance?

In simple access control schemes, access rights are independent of context. In the world of GDPR, access control becomes relative to a purpose. I.e., it’s not just ‘can I access this data?,’ but ‘am I allowed to access this data for this purpose?’ If you equate purpose with role, then that looks a lot like RBAC. But the flows of data through an organisation can quickly become more complex.

The central idea in the paper is that business process modelling can be a powerful tool for managing purpose and compliance. I’m not sure how practical that is when your architecture looks like a death star and you’re on a journey towards 10,000 deploys a day, but you’re certainly going to need some way of tracking and managing data, consent, and purpose.

We propose exploiting the formal notion of a business process model as a bridge between a system implementation and the GDPR. In doing so, we exploit that a business process model by its very nature embodies a particular purpose, while at the same time it specifies at what points data is collected and used.

Online retailer example
Consider an online retailer where customers order online, pay with their credit cards, and receive orders by post. The retailer may also engage in marketing — targeted or otherwise — using a variety of channels. Sufficiently zoomed out, the retailer has four core processes:

Registering customers: customers sign up and provide email, mailing address and credit card information. (Best not to store credit card information unless you really have to of course, and typically the sign-up would be deferred until the customer’s first checkout, but let’s roll with it… )
Purchase: a registered customer pays using their recorded credit card number, and the retailer sends the product and invoice.
Mass marketing: customers’ email and/or physical addresses are used to send otherwise untargeted advertising.
Targeted marketing: customers’ email and/or physical addresses are used to send individually targeted advertisements based on past purchase history
You can lay these out in a high-level BPMN diagram, with one pool for each process (containing a mixture of human and automated process steps).

Here the assumption is that a process is associated with a purpose: the purchase process (purpose); the mass marketing process (purpose); the customer satisfaction evaluation process (purpose); and so on.

Interacting processes
In practice a company may collect data about customers in one process and use that data in another… This disconnect mirrors a challenge faced by many companies: whereas the individual processes within a company are usually well-understood by the staff undertaking them, including the interfaces to other processes, the global picture of all processes in the company is rarely well understood. But the GDPR requires such a global understanding: data collected in one process may migrate to other processes, and end-user consent is required for all involved processes.

I take a somewhat different viewpoint here: for many organisations, I contend that an approach requiring a complete, accurate, up-to-date, global understanding of all processes and data flows is probably doomed to failure, as this is an impossible task! And yet we can’t escape the challenges the authors (and the regulations) place before us. My personal sense is that we need to track the provenance of data as it flows through an organisation, including proof of consent and the purposes for which consent was given, and then match the purpose of a process against that consent record. (Having the provenance flow with the data allows for local decisions, rather than requiring a centralised master plan).

Regardless of how you do it, you are at some point going to end up with a collection of business processes and a set of data classes. You’ll need to know which data is collected by which processes, and which data is used by which processes (i.e., for which purposes). The data production (collection) and usage relations can be used to derive the user consents that are needed.

Instead of the full-blown BPMN model, you can represent the essential information in a high level inter-process diagram which hides the internal details of the processes:

Establishing GDPR compliance
Given such a top level set of business processes/purposes, then demonstrating compliance involves:

showing that the processes as implemented collect and use data as specified by the process definitions and privacy policy
showing that the processes follow the GDPR, for example, by deleting data as appropriate
showing that the privacy policy conforms to the GDPR. For example, it does not make vague statements about the purposes for which data will be used.
(Aside: maybe this helps a little bit, but I’m not sure it really helps a whole lot. It’s not much deeper than simply saying, “you’re compliant if you have a compliant policy and you follow it correctly.”)

Given the usage relation between data and processes (purposes), then the consent statement can take the form “we collect d_1, d_2 for purpose p_1 and we collect d_1, d_3, and d_4 for purpose p_2” (and so on).

Furthermore the model can show us if there is personal data which is collected but not used by any process: “such data is clearly unnecessary, violating data minimisation.”

Third parties
An important area for future work concerns data transfers to third parties. The GDPR has precise rules about who may transfer data to other parties, when these transfers can occur, and under what circumstances other parties can or must delete, produce, or store data. Naturally this opens up questions about audits and compliance similar to the ones addressed in this paper.

Discussion
As I indicated earlier in this post, I’m personally not convinced that centralised business process mapping is going to be the answer here. Nevertheless, the paper sets out nicely the concerns relating to processes, data, consent, and purpose and provides a framework for thinking about them. For more GDPR related material on The Morning Paper, see ‘European Union regulations on algorithmic decision making and a right to explanation’ and ‘Ground: A data context service.’