Skip to main content
Skip table of contents

Data management for AI systems

Using Saidot, you will be able to

  • Create a catalogue of the datasets used to train, test, validate or operate the AI systems.

  • Focus the dataset documentation on the information that is relevant for AI governance.

  • Link any data-related documentation to the dataset from other data catalogs or compliance tools.

  • Identify data-related risks and determine appropriate mitigations.

  • Comply with AI standards and regulations regarding AI system data management.

Data management is an important step in AI Governance and Lifecycle Management. Data management in the context of AI systems refers to managing the accuracy, relevance, representativeness and completeness of the data used to train, test, validate or operate AI systems. In addition, lawfulness and ethics of data used for AI should be considered.

The management of data used in AI systems requires understanding of:

  • How data has been collected and processed and what is the context and cost of its use

  • What kind of risks are identified related to data security, privacy, misuse, bias or loss of business

  • What kind of legal requirements, contractual obligations or commercial interests need to be considered

  • What kind of data management processes and risk mitigations have been implemented

Dataset Card

The purposeful and transparent dataset documentation is one of the ways in following AI Governance best practices. Datasets can be documented and linked to AI systems as Dataset Cards, a concept established and explored in wide variety of research. A clear and thorough understanding of a dataset’s origins, development, intent, ethical considerations and evolution is a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains.

Dataset Cards are recommended to be:

  • Accessible by a variety of users with different data proficiency levels

  • Comparable to one another

  • Easy to create and keep updated as the data lifecycle evolves

  • Assigned clearly to the most suitable individuals

  • Clear descriptions and justifications

The basic information of a dataset should include the dataset owner, dataset modality and dataset subject. Other technical details to be documented are the data source, version and license.

Dataset documentation

Description

Owner

Dataset owner determines who owns the datasets. Dataset can be owned by the company, a stakeholder in the value chain such as a partner or a customer, or a third party data or AI system provider.

Dataset owner

Dataset owner determines the owner of the dataset that can be a data steward of owner of the system providing the data

Modality

Dataset modality determines the format of the data, such as language, vision, audio, video, tabular, documents, time series, graph or geospatial data

Subject

Dataset subject determines the subject of the data such as people, natural phenomena, places or objects, synthetic data, systems or products

Data source

Data source determines the system that is providing the data for the AI system

License

License determines the rights and restrictions to use the data

Version

Version determines the data lifecycle phase and the version used for the AI system

The dataset can be linked to one or several systems providing opportunity to identify and manage datasets that are used for several AI systems. Datasets can have different roles in AI systems. Understanding the role of the data helps to identify AI system related risks, restrictions and limitations. For example third party data owners may enforce their rights to restrict the use of data for AI based on confidentiality, copyright laws and other data related regulation or data can be used to test potential bias.

Dataset role

Description

Train

Datasets used to train the AI systems and models. A training data is a set of examples used to fit the parameters of the model.

Test

Data used to test the model and compare to other models.

Validate

Validation data is a specific dataset used to provide an unbiased evaluation of a model fit on the training dataset while tuning model parameters

Operate

Data that is used to operate the AI system, such as prompt or finetune the model to provide a more specific output

The Dataset Card on our platform should also include information of what kind of personal data is used, if any, and the categories of personal data. The special categories of personal data are race, gender, ethnicity, culture, health, genetics, biometrics, socio-economic status, location language, socio-economic status, sex life or sexual orientation, religion, philosophical beliefs, politics, age, disability, trade union membership. The special category can also be unknown or not applicable.

Source: EU General Data Protection Regulation.

Personal data values

Description

Identifiable personal data

A combination of personal data that can be used together to identify a particular person

Pseudonymised personal data

Data that has been encrypted, de-identified or pseudonymised but can be used to re-identify a person

Anonymised data

Personal data that has been rendered anonymous in such a way that the individual is no longer identifiable and anonymisation is irreversible

The Dataset Card should provide information to the AI system owner who is analysing and identifying the AI system related risks. The description should include information on what the dataset contains and the reason for collecting the data. We also recommend providing details of data collection and data processing methods and processes, what is included and excluded from the data and how the dataset has been filtered.

In addition, we suggest linking any further information about the dataset that is stored in a separate data catalog, data quality testing, master data management tool or other compliancy management tool. This information can be added as a link or provided as an attached document to our platform.

The Dataset Card documentation can be used to identify risks and limitations related to data quality, quantity and suitability and these risks can be mitigated with corrective actions to data collection, processing and quality management. Dataset may have issues such as having incomplete, inappropriate, biased or not representative data. Datasets can also be the source of business risks, privacy risks or information security risks. Risks identified as a part of the AI lifecycle management and AI Governance process in the Saidot Governance Platform can have data as their source and can be linked to the specific data sets. Read more about risk management here.

Based on the ISO/IEC 38507 and ISO/IEC 42001.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.