January 1, 2026
At Samsung Electronics ("Samsung"), we are committed to advancing our ecosystem of products and services through the integration of generative AI technology. Below, we provide a high-level summary of the datasets used in the development and training of our generative AI systems. Please note that this summary does not cover generative AI systems or services that may be provided through third parties.
(1) Sources of the datasets
Samsung uses a mix of publicly available data, data licensed or purchased from third parties, and synthetic data to train its generative AI.
- Publicly Available Data: Open-source materials and publicly accessible information.
- Licensed/Purchased Data: Content acquired through agreements with third-party providers.
- Synthetic Data: Artificially generated data designed to enhance model performance.
(2) How the datasets further the intended purpose of the generative AI
The datasets are used to train and develop Samsung's generative AI, enabling seamless integration across our product ecosystem. These models facilitate on-device functionalities such as:
- Language processing and translation.
- Image editing and enhancement.
- Text summarization and analysis.
(3) Number of data points
Samsung’s generative AI has been trained on a corpus of data comprising approximately 1 trillion data points.
(4) Types of data points
Samsung uses both labeled and unlabeled data. Labels include image description, transcription, language, etc. Unlabeled data may be annotated by reviewers for enhanced training.
(5) Whether the datasets include data protected by copyright, trademark, or patent
The datasets may include data subject to copyright, trademark, or patent protection which are licensed to Samsung or made available under applicable law.
(6) Whether the datasets were purchased or licensed
The datasets include data licensed or purchased from third parties.
(7) Whether the datasets include personal information
Some of the datasets may include personal information. However, in accordance with law and our policies, and where applicable and/or appropriate, we utilize measures that help reduce or minimize personal information in the datasets.
(8) Whether the datasets include aggregated consumer information
The datasets may include aggregated consumer information.
(9) Whether there was any cleaning, processing, or modification to the datasets
To maintain the integrity and safety of our datasets, Samsung undertakes comprehensive cleaning, preprocessing, and modification processes. These include:
- Filtering inappropriate content.
- Aggregating or de-identifying data.
- Implementing robust safeguards.
(10) The time period during which the datasets were collected
Samsung has been collecting datasets since 2023 and continues this effort on an ongoing basis.
(11) The dates the datasets were first used
The datasets were first used in 2023 during the development phase of Samsung’s generative AI systems and services.
(12) Whether the generative AI uses synthetic data
Synthetic data is used to improve and enhance Samsung’s generative AI systems and services.