Data classification

Lucas Eustache
Paris Dauphine University – PSL

Classification by size

One of the most popular classifications is by size. This classification has existed since the concept of Big Data came to the fore, popularized by the use of algorithms and machine learning.

  • Big Data: Big data refers to large and complex datasets that exceed the capabilities of traditional data processing methods and require advanced techniques and technologies for storage, management, and analysis. It is characterized by the three V's: volume, velocity, and variety, representing the massive amount of data generated, the speed at which it is produced, and the diverse formats and sources it comes from. Big data holds the potential to uncover valuable insights and drive decision-making in various domains, including business, science, and healthcare) (Gandomi & Haider, 2015).

  • Small Data: Small data refers to data sets characterized by a limited volume, i.e. a small number of observations characterized by a restricted number of variables, these data present a unified structure. These data sets are manageable and can easily be analyzed using traditional statistical methods. (Kitchin & Lauriault, 2015).

Classification by licence

License classification is based on two criteria: the ownership regime governing the data, and the conditions under which it can be reused. This classification is gaining in importance in the age of machine learning, as public data is targeted as the preferred resource for training algorithms.

  • Open data: Open data refers to data produced by public and private players, distributed in a structured manner under an open license guaranteeing free access and reuse by all, without technical, legal or financial restrictions (Data.gouv.fr)

  • Private data: Private data can be defined as opposed to open data. There is a continuous spectrum from freely and publicly accessible data with reuse restrictions, to data that is not accessible and not reusable.

Classification by the subject

This classification is the only one fully recognized by the legislator. It was introduced by the RGPD, and forms the basis of the major distinction in terms of data processing.

  • Personal data : Personal data means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person (gdpr-info.eu).

  • Non personal data: Personal data refers to two types of data: data which initially did not concern an identified or identifiable natural person. Those which were initially personal data but have subsequently been rendered anonymous (eur-lex.europa.eu).

Classifiction by the structure

This classification is more technical in nature, stemming from the creation of SQL (structured query language). This distinction is used to classify data according to accessibility and processing flexibility.

  • Structured data: Structured data refers to organized, easily searchable data. A simple example is a database where observations are rows and their characteristics (variables) columns. Another example is relational databases, which are structured data.

  • Semi-structured data/ Semi-structured data refers to data that has some form of structure (hierarchy), but does not have a fixed pattern. They cannot be represented in matrix form. What distinguishes them from unstructured data is the presence of tags or metadata to describe and organize them in a very general way.

  • Non-structured data Unstructured data refers to information that doesn't have a predefined pattern and therefore has no information. Qualitative data cannot be represented in matrix form, and is difficult to interpret using traditional tools. Unannotated text, for example, is unstructured data (Education, 2021).

Classification by the object

This is the only classification that does not contain all the data. This thematic classification can be useful in that it provides a basis for certain regulatory exemptions. These exemptions are based either on the use of a certain type of data, or on the very nature of the data collection.

  • Research data : Research data is more about usage than data type. They correspond to data used for research purposes; they can benefit from certain exemptions in case of personal data to the GDPR.

  • I.o.T data; Internet of things data, corresponds to all the data collected by and for connected objects (connected watches, on-board sensors...) this data makes up a significant part of big data, and is regulated mainly by the RGPD and the data act.

  • Public data: Public data refers to data collected and published by legal entities governed by public law, or by those delegated to carry out public service missions. These data are increasingly linked to government open data. They are regulated mainly by the PSI (public services information) directives.

References

Related Keywords

No related keywords in this publication.

© 2025 GovRegPedia. All rights reserved.