Key Takeaways
- GDPR Articles 13 and 14 require privacy notices to specify AI training as a distinct purpose. Generic 'service improvement' language is not sufficient and creates enforcement exposure.
- Legitimate interest as a lawful basis for AI training requires a documented balancing test. Controllers must demonstrate that AI development interests are not overridden by data subject rights.
- Privacy notices for AI training data must specify retention periods tied to the training lifecycle, not just general data retention policy. Voice and biometric data trigger additional Article 9 obligations.
- Data subjects must be informed of their right to erasure and how it applies to training datasets. Controllers must have a technical procedure for acting on erasure requests before collection begins.
- Vague purpose descriptions such as 'to improve our products' have been cited in supervisory authority enforcement decisions. Specificity about AI training purposes is both a legal and a risk management requirement.
Most privacy notices were not written with AI training in mind. When regulators audit an AI provider’s data collection practices, the first document they examine is the privacy notice that was in force at the point of collection. What they find there, or fail to find, determines whether the entire training dataset carries a legal basis problem.
Understanding what gdpr privacy notices examples for AI use cases must contain, under Articles 13 and 14, is not a legal formality. It is the foundation of a defensible AI training data pipeline.
This guidance is designed to help compliance officers and data protection officers understand the requirements. It does not constitute legal advice. Consult your DPO and, where appropriate, your supervisory authority before finalising your privacy notice approach.
What GDPR Articles 13 and 14 actually require
Article 13 applies when personal data is collected directly from the data subject. Article 14 applies when data is obtained from a third party rather than from the individual directly. Both articles establish information obligations. The difference is timing: Article 13 requires disclosure at the time of collection, while Article 14 requires it within one month of obtaining the data (or at the point of first contact with the data subject, if contact occurs within that window).
For AI training data, the core obligations under both articles are the same. The controller must identify itself and provide contact details. The controller must name the data protection officer if one has been appointed. The controller must specify the purposes of processing and the lawful basis for each purpose. The controller must disclose recipients or categories of recipients. The controller must specify retention periods. The controller must inform data subjects of their rights, including access, rectification, erasure, restriction, and portability. Where legitimate interest is the lawful basis, the controller must also disclose the specific legitimate interest being pursued.
None of these requirements are new. What changes when AI training enters the picture is the level of specificity required to satisfy each of them.
Purpose specification: where most privacy notices fail for AI training data
The most common failure point in privacy notices for AI systems is the purpose description. Controllers routinely describe AI training under general headings such as “to improve our services”, “to develop new features”, or “to conduct research and development”. Supervisory authorities, including the Irish Data Protection Commission and the French CNIL, have found that these descriptions do not satisfy the specificity requirement of Article 13(1)(c).
A compliant gdpr privacy notices examples approach for AI training purposes requires the notice to state, clearly and plainly, that personal data will be used to train AI models. The description should identify the type of AI system being trained, such as a speech recognition model or a natural language processing system. Where the trained models will be used in products or licensed to third parties, the notice should say so.
Practical purpose descriptions look like this: “We collect voice recordings to train automatic speech recognition models that are used in our voice AI products. The models learn from the acoustic patterns and linguistic content of your recordings. Trained models may be incorporated into products made available to enterprise customers.”
That level of specificity may feel uncomfortable from a commercial perspective. However, vague purpose descriptions create a different kind of risk: they expose the controller to challenge on whether any valid lawful basis existed at the time of collection. Enforcement actions are significantly harder to defend when the original notice did not name AI training as a purpose.
Lawful basis: consent versus legitimate interest for AI training
Two lawful bases are commonly relied upon for AI training data collection: consent under Article 6(1)(a) and legitimate interest under Article 6(1)(f). Each carries different obligations and different risks.
Consent for AI training
Consent must be freely given, specific, informed, and unambiguous. For AI training purposes, this means the consent request must name AI training explicitly and must not be bundled with other service terms. Pre-ticked boxes and blanket agreement to terms of service do not constitute valid consent.
Consent-based collection gives data subjects clear control, simplifies the legal basis documentation, and provides a strong foundation for claims of GDPR compliance. The cost is that consent can be withdrawn, and withdrawal must trigger erasure of the relevant data from the training pipeline. Controllers must have a technical architecture that supports this before offering consent as the mechanism.
Legitimate interest for AI training
Legitimate interest requires a documented legitimate interest assessment covering three steps: identifying the specific interest, assessing whether processing is necessary to pursue it, and conducting a balancing test between the controller’s interest and the data subject’s rights.
The European Data Protection Board’s guidance on legitimate interest indicates that commercial interests, including AI development, can in principle constitute a legitimate interest. What the assessment must demonstrate is that data subjects would reasonably expect their data to be used for AI training in the context in which it was collected, and that the processing does not override their fundamental rights.
Legitimate interest is harder to establish for novel AI training purposes where data subjects would not reasonably anticipate that use. Controllers relying on legitimate interest for AI training should document the assessment carefully and have it reviewed by a qualified DPO before collection begins.
Retention periods: the overlooked requirement
Article 13(2)(a) requires controllers to specify the period for which personal data will be stored, or the criteria used to determine that period. For AI training data, controllers frequently cite a general data retention policy rather than a retention period specific to the training purpose.
A compliant privacy notice for AI training data should specify:
- How long the raw data will be retained before deletion or anonymisation
- How long derived models or embeddings trained on the data will be retained
- Whether the data will be deleted after training or retained for retraining purposes
- What triggers deletion, whether a fixed schedule or project completion
These are distinct questions. Raw training data and a model trained on that data are different assets with different retention implications. A controller that deletes the raw audio but retains an embedding containing identifiable vocal characteristics may still be processing personal data. The privacy notice should be explicit about this distinction.
Data subject rights in AI training contexts
Privacy notices must inform data subjects of their rights. For AI training data, three rights require particular attention.
The right of access under Article 15 means data subjects can request confirmation that their data is being processed and obtain a copy. Controllers with large training datasets must have a search and retrieval capability to respond to access requests within the 30-day deadline.
The right to erasure under Article 17 is the most operationally demanding right for AI controllers. Data subjects can request deletion of their data when the data is no longer necessary for the original purpose, when consent is withdrawn, or when the processing was unlawful. Controllers must be able to identify and remove individual contributions from training datasets. Controllers who cannot demonstrate this capability before collection begins may find that their chosen lawful basis is not defensible.
The right to object under Article 21 applies where legitimate interest is the lawful basis. Data subjects can object to processing on grounds relating to their particular situation. Controllers must cease processing the objecting individual’s data unless the controller can demonstrate compelling legitimate grounds that override the individual’s interests.
The privacy notice must describe how data subjects can exercise each of these rights and the timeframe for controller response.
What a compliant gdpr privacy notices examples structure looks like
A privacy notice for AI training data collection should follow a clear structure. The following elements are required:
Controller identity and contact details. Full legal name, registered address, and email or phone for privacy queries.
DPO contact. If a DPO has been appointed, their contact details are mandatory. Controllers who are required to appoint a DPO but have not done so face a compliance gap separate from the notice content itself.
Processing purposes and lawful basis, stated per purpose. Each distinct purpose should be listed with its associated lawful basis. AI training should not be grouped with analytics or product development under a single entry.
Recipients and processors. Any organisation that will receive the data, including cloud infrastructure providers, annotation vendors, and sub-processors in the training pipeline. The notice can list categories of recipients rather than named organisations, but categories must be specific enough to be meaningful.
International transfers. If data will be processed outside the EEA, the transfer mechanism must be named. Standard Contractual Clauses, adequacy decisions, and Binding Corporate Rules each have different documentation requirements.
Retention periods. Specific to each processing purpose, including the distinction between raw data retention and model or embedding retention.
Data subject rights. Each applicable right listed with the mechanism and timeframe for exercising it.
Right to lodge a complaint. Data subjects must be informed of their right to complain to a supervisory authority. The notice should name the lead supervisory authority for the controller.
Automated decision-making. If training data feeds a system that makes automated decisions with significant effects, Article 22 obligations must be addressed.
Common mistakes that create enforcement exposure
Four patterns appear repeatedly in privacy notices that have attracted regulatory scrutiny or have created legal challenges for AI controllers.
Vague purpose descriptions that bundle AI training under general improvement language. This has been the basis for enforcement action in multiple European jurisdictions.
Failure to name AI training as a purpose at the time of collection, followed by a later attempt to claim the existing data can be used for a new AI purpose. Repurposing requires a compatibility assessment under Article 6(4) and, in practice, usually requires fresh consent or a new lawful basis.
Retention periods that are copied from a general data retention policy without considering the specific dynamics of training pipelines. A general “we retain data for 3 years” statement does not address the question of when trained models are deleted or what happens to embeddings.
Missing or inadequate erasure procedures. Controllers that collect data for AI training without first building a technical capability to act on erasure requests are exposing themselves to enforcement action from the first collection event.
YPAI’s approach to GDPR-compliant data collection
YPAI’s speech data collection uses consent-first collection for all contributors. Contributors are informed of the specific AI training use cases their recordings will be applied to before any recording takes place. Consent is granular and use-case specific: a contributor consenting to automatic speech recognition training is not consenting to voice biometric identification.
YPAI maintains right-to-erasure-ready data architecture, meaning individual contributor recordings can be traced and removed from delivered datasets on request. No synthetic data is mixed into corpora, which means lineage from original consent to delivered data is clean and auditable. Collection is EEA-only, with data residency maintained in the EEA throughout the collection, processing, and delivery pipeline.
For organisations building AI systems that require EU speech training data, this architecture is designed to be compatible with the Article 13/14 obligations described in this guide.
For more detail on how GDPR applies to speech data collection specifically, see our GDPR-compliant speech data collection guide for Europe. For the interaction with EU AI Act obligations on high-risk AI training data, see EU AI Act high-risk AI training data requirements and EU AI Act Article 10 requirements for speech data vendors.
Getting started
If your current privacy notice uses generic improvement language to cover AI training, the first step is a purpose audit: list every AI system being trained and confirm that each one has an explicit, named purpose in the active privacy notice.
If your organisation is building a new AI training data collection pipeline, the privacy notice should be drafted and reviewed before the first collection event, not after. Retroactive notice amendment does not cure a lawful basis problem at the point of original collection.
Consult your DPO to assess whether your current notices satisfy the specificity requirements described above, and to design an erasure procedure that is technically implementable before collection begins. If you are procuring training data from a third party, review the data provider’s privacy notices and collection documentation to verify that AI training was a named purpose at the point of original collection.
To discuss how YPAI’s consent-first collection and erasure-ready data architecture can support your compliance requirements, contact our data team.
Sources:
- GDPR Article 13 - Information to be provided where personal data are collected from the data subject (EUR-Lex)
- GDPR Article 14 - Information to be provided where personal data have not been obtained from the data subject (GDPR-info.eu)
- GDPR Article 17 - Right to erasure (GDPR-info.eu)
- EDPB Guidelines 06/2020 on the interplay of the Second Payment Services Directive and the GDPR (European Data Protection Board)
- CNIL enforcement action on AI training transparency (Commission Nationale de l’Informatique et des Libertes)
- EU AI Act Article 10 - Data and data governance (artificialintelligenceact.eu)