4  Open Data

Publishing open data comes in many shapes: in other words, in the buffet of open data sharing possibilities, there’s bound to be something you like. In general, like with other open science practices, it’s good to keep the “as open as possible, as closed as necessary” principle in mind.

4.1 Privacy

From this, it follows that there are circumstances where you cannot publish data at all. This is in cases where anonymization is impossible. The General Data Protection Regulation (GDPR) defines personal data as data about an identified or identifiable person. Anonymous data, then, are data that are not about an identified or identifiable person.

A person is identified when it is clear who the person is. A person is identifiable when it is possible, with reasonable effort, to find out who they are. For example, if I tell you that a person’s age is 62, that they identify as Buddhist, and that they live in a rural location, that doesn’t tell you anything. When I also add that they identify as nonbinary, lost a leg in a war, and if I name the village where they live with a population of 500 inhabitants, they become identifiable.

For research, this means that whether data are personal data are the consequence of three things:

  1. how much data you collected;
  2. in which ‘resolution’ you collected the data; and
  3. your sampling frame.

The more data you collect, the easier identification becomes. For example, video data is so rich (‘much’) that it’s often considered personal data. Answers to questions in an online questionnaire, however, lend themselves much to identification.

If you register age in years, the resolution of your data is higher than if you use decade-wide categories. In the latter situation, age is unlikely to contribute to identifiability of your participants.

Finally, if you sample from all residents in a country with millions of residents, the risk of identification is much lower than when you sample from, for example, all first-year psychology students.

In addition, there’s also the k-anonimity approach to anonymity: that’s explained in a bit more detail here.

Ideally, then, you sample as widely as possible, and register as little as possible. In cases where you’re not certain, make sure you consult your organization’s privacy officer.

4.2 Being FAIR

As anybody who’s ever tried to work with published data can attest, publishing data isn’t just a matter of uploading some files. The files have to be intelligible to others (and maybe more importantly, to future you), with as little effort as possible.

To cover your bases in this respect, the FAIR acronym is helpful. It captures the mandate to publish data that are Findable, Accessible, Interoperable, and Reusable.

4.2.1 Reusable

Of these, the R (reuse) is easiest to realize, so we’ll start there. Because you data cannot own data you collect this means that most cases everybody can reuse your data. But, it helps to be explicit about this by attaching an explicit license. The Open Database License (ODbL) is a good choice if you don’t want to dive into this.

4.2.2 Interoperable

The I (interoperable) is hardest for psychology. That is, it has a very minimal interpretation which just means that the data has to be stored in files in open formats (see Accessibility); but true interoperability requires that the data set can be understood by machines. This requires unique identifiers for columns that can be widely understood by machines (e.g. linked back to constructs). Therefore, in practice, you can ignore this for now — which makes it easy, too 😬 If you would like to work on this, see the link to the blog post in the Findable section.

4.2.3 Accessible

To be accessible, the data has to be stored in a file format that is, ideally, an open format (also see Chapter 2).

4.2.4 Findable

To make data findable, it has to be deposited in a well-known repository. For example, the Open Science Framework (https://osf.io) can serve this purpose.

Ideally, data are also deposited along with in relevant metadata, and in a repository that uses these metadata for indexing the data.

Finally, to make data really findable, these metadata should be semantic, but that is as yet out of reach for much of social sciences research (although, see this blog post for recent developments in this direction).