Data management for AI systems

Using Saidot, you will be able to

Create a catalogue of the datasets used to train, test, validate or operate the AI systems.
Focus the dataset documentation on the information that is relevant for AI governance.
Link any data-related documentation to the dataset from other data catalogs or compliance tools.
Identify data-related risks and determine appropriate mitigations.
Comply with AI standards and regulations regarding AI system data management.

Data management is an important step in AI Governance and Lifecycle Management. Data management in the context of AI systems refers to managing the accuracy, relevance, representativeness and completeness of the data used to train, test, validate or operate AI systems. In addition, lawfulness and ethics of data used for AI should be considered.

The management of data used in AI systems requires understanding of:

How data has been collected and processed and what is the context and cost of its use
What kind of risks are identified related to data security, privacy, misuse, bias or loss of business
What kind of legal requirements, contractual obligations or commercial interests need to be considered
What kind of data management processes and risk mitigations have been implemented

Dataset Card

The purposeful and transparent dataset documentation is one of the ways in following AI Governance best practices. Datasets can be documented and linked to AI systems as Dataset Cards, a concept established and explored in wide variety of research. A clear and thorough understanding of a dataset’s origins, development, intent, ethical considerations and evolution is a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains.

Dataset Cards are recommended to be:

Accessible by a variety of users with different data proficiency levels
Comparable to one another
Easy to create and keep updated as the data lifecycle evolves
Assigned clearly to the most suitable individuals
Clear descriptions and justifications

The basic information of a dataset should include the dataset owner, dataset modality and dataset subject. Other technical details to be documented are the data source, version and license.

Dataset documentation	Description
Owner	Dataset owner determines who owns the datasets. Dataset can be owned by the company, a stakeholder in the value chain such as a partner or a customer, or a third party data or AI system provider.
Dataset owner	Dataset owner determines the owner of the dataset that can be a data steward of owner of the system providing the data
Modality	Dataset modality determines the format of the data, such as language, vision, audio, video, tabular, documents, time series, graph or geospatial data
Subject	Dataset subject determines the subject of the data such as people, natural phenomena, places or objects, synthetic data, systems or products
Data source	Data source determines the system that is providing the data for the AI system
License	License determines the rights and restrictions to use the data
Version	Version determines the data lifecycle phase and the version used for the AI system

The dataset can be linked to one or several systems providing opportunity to identify and manage datasets that are used for several AI systems. Datasets can have different roles in AI systems. Understanding the role of the data helps to identify AI system related risks, restrictions and limitations. For example third party data owners may enforce their rights to restrict the use of data for AI based on confidentiality, copyright laws and other data related regulation or data can be used to test potential bias.

Dataset role	Description
Train	Datasets used to train the AI systems and models. A training data is a set of examples used to fit the parameters of the model.
Test	Data used to test the model and compare to other models.
Validate	Validation data is a specific dataset used to provide an unbiased evaluation of a model fit on the training dataset while tuning model parameters
Operate	Data that is used to operate the AI system, such as prompt or finetune the model to provide a more specific output

The Dataset Card on our platform should also include information of what kind of personal data is used, if any, and the categories of personal data. The special categories of personal data are race, gender, ethnicity, culture, health, genetics, biometrics, socio-economic status, location language, socio-economic status, sex life or sexual orientation, religion, philosophical beliefs, politics, age, disability, trade union membership. The special category can also be unknown or not applicable.

Source: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). (2016). OJ L, 119. http://dataeuropaeu/eli/reg/2016/679/2016-05-04/eng.

Personal data values	Description
Identifiable personal data	A combination of personal data that can be used together to identify a particular person
Pseudonymised personal data	Data that has been encrypted, de-identified or pseudonymised but can be used to re-identify a person
Anonymised data	Personal data that has been rendered anonymous in such a way that the individual is no longer identifiable and anonymisation is irreversible

The Dataset Card should provide information to the AI system owner who is analysing and identifying the AI system related risks. The description should include information on what the dataset contains and the reason for collecting the data. We also recommend providing details of data collection and data processing methods and processes, what is included and excluded from the data and how the dataset has been filtered.

In addition, we suggest linking any further information about the dataset that is stored in a separate data catalog, data quality testing, master data management tool or other compliancy management tool. This information can be added as a link or provided as an attached document to our platform.

The Dataset Card documentation can be used to identify risks and limitations related to data quality, quantity and suitability and these risks can be mitigated with corrective actions to data collection, processing and quality management. Dataset may have issues such as having incomplete, inappropriate, biased or not representative data. Datasets can also be the source of business risks, privacy risks or information security risks. Risks identified as a part of the AI lifecycle management and AI Governance process in the Saidot Governance Platform can have data as their source and can be linked to the specific data sets. Read more about risk management here.

Source: Based on ISO/IEC 38507:2022, Information technology — Governance of IT — Governance implications of the use of artificial intelligence by organizations; ISO/IEC 42001:2023, Information technology — Artificial intelligence — Management system.