Data science is an interdisciplinary field where the focus is on extracting information and knowledge from structured or unstructured data. By combining aspects of mathematics, statistics, computer science, behavioural science and visualization, data science generates new insights on data that is collected by companies and other entities. The possibility of discovering patterns or some other useful information can result in an increase of a company’s competitive advantage, increases in efficiency or identification of market opportunities.
Generally, the knowledge produced by data science represents a new type of company asset. This data, however, needs to be collected and protected according to the new EU GDPR.
The data science process includes five steps: (1) defining what matters to the data scientist, (2) collecting the data, (3) exploring the data, (4) modelling the data, (5) communicating and visualizing the result (see more here). It is evident that GDPR rules will introduce the biggest challenge in first, second and fifth step of the data science process, so these are the ones we will cover in this article.
Purpose of data processing activities in data science
The first step of the data science process includes the identification of a goal for the data science process (e.g. a company wants to identify or refine its target audience). In this manner, the purpose of data processing activities is defined. The company should also state what data will be collected and in what form. Data such as key performance indicators, consumer behaviour data or raw data (directly from a computer, e.g. server logs or sensor data) can help a company gain new insights. This raises a challenge from a privacy compliance perspective: GDPR states that if the personal data will be processed, the data subject should be informed about data processing activities and give consent for every data processing activity.
The right to be informed. Data science techniques can help draw conclusions or make profiles based on various data like individual location and offline/online shopping behaviour. In some cases, individuals are voluntarily giving away their data/personal data, but in most of cases they are observed without knowing it. The GDPR emphasizes that individuals have the right to be informed, prior the data collection, about how their data is to be used and, most importantly, why it is being collected. This requirement can be accomplished in communication with individuals (in face-to-face or online communication) or by preparing the procedure where items like the purpose of collection, categories of personal data, or contact of controller are presented. It is also very important to define what new information about the individual the company expects to create in the data science process.
Consent. To comply with the GDPR principles data scientists should identify what personal data will be collected, in which manner and for what purpose. They should then inform the data subject and ask for consent. The request for consent must be given in an intelligible and easily accessible form, with the purpose for data processing attached to that consent. It should also be presented in manner which is clearly distinguishable from the other matters, using clear and plain language, and must list all the applications of the collected data. When taking in account the main goal of any data scientist, this will introduce a huge change. Creating new applications for existing data will not be possible without consent. For possible exceptions, see the article Is Consent needed? Six Legal Basis to Process Data According to GDPR.
GDPR principles for collecting activities in data science
The second step of the data science process is collecting the data. When it comes to collecting personal data the main GDPR principles that should be accomplished are:
Data minimization. As stated before, new applications of collected data without consent is not permitted. GDPR requires that only data relevant to the purpose can be collected. For data scientists, this means defining the amount of data that will be collected. When performing scientific research, some exceptions are allowed but if data is collected for a purpose such as marketing, data minimization should be observed. For example, if a data controller is interested in information about a customer’s preference regarding particular product, data such as marital status, salary, e-mail address, or telephone number are not necessary to find out a customer’s perception of the product.
Storage limitation. Because they are responsible for the protection of collected data (storage, all backup files, all versions and copies, etc.), data scientists should be aware of the implications of the data storage principle. Collected data should be accurate and kept up-to-date. It should also be kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data is processed. Anonymization is demanded by GDPR, so data scientists must be prepared to demonstrate that they use mechanisms that make de-anonymization impossible. Data scientists should have insights about data storage and evaluate whether they are exposed to privacy violations. Their main concern should be the identification of data type (structured, unstructured, generated by data science techniques) and define a road map of data that includes steps such as entrance, storage and especially determination of data.
GDPR principle for automated processing and profiling
The fifth step of the data science process is communication and visualization of the result, followed by implementation of the results. According to GDPR the data subject has “the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or significantly affects him or her.” This requirement is directly connected to customer profiling and the use of profiling algorithms on collected data. Individuals must be informed if customer profiling will be implemented, and what possible material impact it could have on them. If a customer is requesting credit, they cannot opt out of the profiling process in a bank, but they can ask to not be profiled for automated decisions regarding new bank services. An example of this would be: when a customer has a specific amount of cash deposited in a bank, an automated decision can be made to offer them credit.
Start preparing
A business’ interest in data analysis is usually in finding correlation across multiple data sources, predicting customer behaviour or predicting product sales. Businesses are depending more and more on the data and there is a trend for using data science to handle enormous amounts of data. GDPR introduces measures to secure and protect data and data subjects. Additionally, it helps a business to avoid an embarrassing situation if this data is leaked.
Click here to access the full text of the EU GDPR to learn more about the right to be informed, consent, data minimisation and storage limitation.