DATA MINIMIZATION IN THE AGE OF BIG AI MODELS
‘Are you a thief?’ The message flashed across Tunde’s phone screen before he could even blink. He had not stolen anything from anyone except from himself, the freedom to decide what remained private. A month earlier, Tunde downloaded a loan app to obtain ₦5000 naira in minutes. What he did not know was that behind the simple app was a large AI system excessively collecting all the data on his phone, including his contacts. By clicking allow, Tunde had also unwittingly shared the names and phone numbers on his phone with an algorithm his friends had never met or consented to.
Humanity has gone through many ages of development with shifting notions of what is valuable. The Stone Age blessed those with tools, strength, and survival skills; the Agricultural Age favoured those who wielded land and labour; the Industrial Age rewarded those with energy and machinery; and the Information Age empowered those with information. In the era of Big AI, data is the new gold. It is an age of daily surveillance, where every action generates data that organizations make it their priority to collate in pursuit of ‘better accuracy’ and future profits.
In the early formation stages of Artificial intelligence, the question was ‘Can Machines think?’ With the advancement of Artificial intelligence the question has shifted to ‘Should machines think efficiently at the expense of individual privacy and ethics?’
This essay navigates the principle of data minimization in an era driven by data-hungry Big AI models, where bigger data seems to mean better results.
The Era of Big AI
To understand the term, ‘Big AI models’, we should consider what the word, AI truly means.
Artificial Intelligence (AI) is the science and engineering of making intelligent machines, especially intelligent programs. AI is created to mimic human intelligence in terms of learning, reasoning, problem solving, decision making, perception, creativity, understanding language and autonomy. To achieve this, AI needs training data. The generative and generalist intelligence capabilities of AI require training on large amounts of information. This is where the term ‘Big or Large AI models’ is introduced.
Big AI models are artificial intelligence systems that have been pre-trained on extremely large amounts of data and utilize computer neurons (deep neural networks) and internal parameters to learn patterns and perform tasks with minimal additional training. Examples include large language models (GPT-4 by Open AI, Claude 3, Mistral, PaLM 2 by Google and Google Gemini), large vision models (DINOv2 by Meta), multimodal models (Gpt-4o by Open Ai, Gemini 1.5 Pro by Google, Claude 3 Opus), domain-specific models (trained for one area e.g Biology, Law, Robotics, Finance such as AlphaFold for Biology, med-PaLM2 for Medicine, BloombergGPT for Finance, PrimsolGPT for Law). By design, these models need and depend on large data sets because it helps the AI better learn to generalize, predict and generate accurate outputs with more data.
Big AI models have many benefits such as enhanced decision making and enhanced efficiency but do these benefits outweigh the right to privacy? This gives rise to the need for data minimization.
Historical and Legal Roots of Data Minimization
Data minimization can begin in no better place than at one of the inalienable rights of every human being. The Data Minimization principle is rooted in the human right to privacy. The right of every human to non-interference with their privacy, the right to control access and be free from intrusion or observation into their personal life, information or activities is a right recognized and protected under various legal documents such as the Universal Declaration on Human Rights, the African Union Convention on Cyber Security and Personal Data Protection, and the Constitution of the Federal Republic of Nigeria.
In the digital economy, there is a popular fear that the user is the product when the product (application or software) is free. The data is usually collected under the notion of improving user experience, showing relevant adverts or improving app performance and services generally. However, it is argued that the majority of the data being collected under this aegis is not core for app functionality and may be shared or sold to third parties. It is not just the collection of excessive data from users but its indefinite retention and future repurposing for user surveillance or advertisement that breach the principle of data minimization.
With the digital shift of the 20th century, a new type of concern raised the need for privacy protection online. Coined as data protection, humans could also expect their sensitive data to be free from corruption, damage, loss, or unauthorized access. Data minimization arose as a principle under data protection to reduce data collected and enhance user privacy.
One of the governing principles of Data processing in Nigeria is that personal data can only be collected and processed under the specific, legitimate and lawful purpose consented to by the data subject, such data must only be stored for the period reasonably needed and cannot be transferred to any person in most cases.
Data Minimization is simply making sure that data collected is only what is necessary for as long as is necessary. This means an AI model should be allowed to collect, use, and keep data only if such data is reasonably necessary for the specific purpose.
Various legislation enshrine this principle of data minimization such as the European Union’s General Data Protection Regulation (GDPR), the UK Data Protection Act, the Nigerian Data Protection Act, the Nigerian Data Protection Regulation and the Nigerian Data Protection Regulation Implementation Framework.
Data minimization promotes accountability from those collecting data and trust from users whose data is collected.
The average man, Tunde who would like to use the services of a loan app for instance, should not have to worry about supplying or granting access to data that is not directly needed for the loan transaction in the name of making the Big AI model make a more accurate prediction of his credit worthiness. Whatever data is collected must be justifiably necessary to the purpose.
A loan app whose main purpose should be assessing creditworthiness, issuing and recovering loans does not need full contact list access, SMS and call logs, location tracking, Microphone or camera access in order to fulfill the purpose. If Tunde’s loan app had followed the principle of data minimization, there would be no need for access to contact lists.
Implementing Data Minimization in Big AI Models
Bigger is better may not be the right approach for training large AI models. The quality of data is more important than the quantity of data obtained. More value can be obtained from smaller datasets. The implementation of Data Minimization principle for big AI models can result in improved AI performance.
While the NDPR does not specifically mention the term, ‘data minimization’, we can see from above that the purpose limitation and retention limitation stated in Article 2 imply that data collected and stored should be minimized to only what is adequate for the purpose and for the period it remains relevant for the purpose.
Organizations can thus implement data minimization by defining the parameters for information collection. This means that data should only be collected after a particular purpose has been chosen for its collection. Once the purpose has been chosen, it must be stated in order to ensure that users consent to provide only data essential to the purpose for which it is being collected. Even after purpose and relevance are created, it is still crucial to audit data processed by the Big AI model. There should be continuous evaluation of whether the collected data is still necessary. Regular tests can be conducted to determine which data improves the model’s performance. For instance, the loan app could have initially thought it required contact lists of users to predict creditworthiness but after a data minimization audit, it removed the contact list access. The model’s accuracy can still be maintained with this data reduction since what was collected was not crucial to its operation or learning.
Data minimization can also be implemented by deciding on the retention period for each data collected. This process can then be automated. Organizations that use Big AI models should create an internal policy, terms of service and privacy policy that states the retention period of all data collected. This is in line with the NDPR and the NDPR Implementation Framework.
The NDPR Implementation Framework provides terminal situations to destroy the data where organizations that use Big AI models (data controllers) do not state their retention period: 3 (three) years after the last active use of a digital platform, 6 (six) years after the last transaction in a contractual agreement, and request of a deceased’s relative (upon presentation of evidence of death) or the legal guardian of a data subject.
The use of Privacy by Design may also be useful in implementing data minimization in Big AI models. Data minimization can be improved when the models do not track user behavior, and do not allow third-parties to track users behaviour.
Data minimization can also be applied through Federated learning. This ensures that sensitive data is not centralized but remains on the devices they are generated from.
Synthetic data generation and differential privacy can also be used to promote data minimization in big AI models. Synthetic data generation would involve creating mock/fake datasets that maintain statistical properties but avoid using real personal information. Differential privacy would be utilizing statistical noise to anonymize data, protect individual identities and generate entirely new profiles that still preserve insights.
Data Minimization reduces cloud storage costs. Data that is not essential will not be collected and data which has fulfilled its purpose for collection can be deleted. With only necessary data collected, it becomes easier to improve the quality of responses generated by Big AI models.
Management costs are also reduced when data minimization is implemented. Keeping large amounts of data would involve accommodating more tools to protect the data and prevent data breaches. Reduction of data would thus reduce security and management costs.
Following Data Minimization principles saves organizations from compliance costs and negative business image for Big AI models and organizations that utilize big AI models. Users, clients or customers are more likely to trust Big AI models if they are sure their privacy is not being breached.
Future Outlook: Privacy in the Age of Big AI
Conflict of laws in data privacy may not be a serious issue in the future since most data protection laws are currently moving in the same direction to protect user privacy by data minimization principle. It is important that in the digital age where the internet is one vast playing field to access a wide range of data sets, that countries continue to collaborate to ensure that big AI models are monitored in terms of collecting, retaining and using data found online.
Implementing the Data minimization principle in Big AI models is not something that can be achieved only by the IT specialists of the Big AI models. It must be a collaboration between the software engineers and the legal/ compliance team to ensure that data is minimized at the beginning, in the middle and at the terminal date of each data collation process.
Machines have seemingly been granted the ability to think but with the implementation of data minimization principles, it would not be at the expense of eroded trust, privacy and ethics. Tunde can trust that when he signs up for a loan app that utilizes big AI models, only what is reasonably necessary is obtained and kept.
BIBLIOGRAPHY
Primary Sources
Constitution of the Federal Republic of Nigeria 1999 (as amended)
Nigerian Data Protection Act 2023
Nigerian Data Protection Regulation 2019 (issued by National Information Technology Development Agency, NITDA).
Nigerian Data Protection Regulation (NDPR) Implementation Framework 2020, (issued by National Information Technology Development Agency).
African Union Convention on Cybersecurity and Personal Data Protection (Adopted 27 June 2014, Malabo, Equatorial Guinea)
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation) [2016] OJL119/1.
Universal Declaration of Human Rights (Adopted 10 December 1948, UNGA rES 217 A(III)).
Secondary Sources
Turing A M, ‘Computing Machinery and Intelligence’ (1950) 59(236) Mind 433-460.
Stryker C and Kavlakoglu E, ‘What is artificial intelligence (AI)?’ (IBM, last updated 23 October 2025) <https://www.ibm.com/think/topics/artificial-intelligence>.
Tu X, He Z, Huang Y, Zhang Z, Yang M and Zhao J, ‘An overview of large AI models and their applications’, 2 Visual Intelligence, 34, (2024), <https://doi.org/10.1007/s44267-024-00065-8>
Yadav S, ‘What is Data Minimization and Why It’s Important’ (Bit Raser, last updated 16th January 2025) <https://www.bitraser.com/blog/what-is-data-minimization/?srsltid=AfmBOooLgSLxJccZd1Ps6ebu0Gd7h44W3d_djb7nvBYk_NwfsHeLcR8->


