Article

May 7, 2026 · Last updated on June 1, 2026

Artificial Intelligence & Surveillance

# artificial intelligence

# analytics

# cities

# hafnia

# surveillance

# responsible ai

# responsible technology

# trust

The Importance of Ethically Sourced Data in AI Training

Søren Raagard Jensen

Søren Raagard Jensen outlines the importance of employing ethically sourced data to train AI for surveillance applications, explores how errors can emerge from poor data practices and explains what the industry can do to ensure high-quality datasets.

In our Artificial Intelligence (AI)-powered world, data is the new oil, but in order for it to be useful, that data needs to be curated and structured. When using any kind of advanced AI feature — including AI-based object/audio detection or behaviour analysis — on security cameras, you’re employing a tool that has been trained on huge datasets.

AI models cannot correctly identify an individual, classify a vehicle or detect a particular colour without first being trained on thousands of images and hours of video footage. Even relatively simple classification algorithms require tens of thousands of images to begin accurately classifying basic imagery. In video, where movement, direction, lighting and human variation are in play, those requirements expand significantly.

Video data is unique given that, alongside imagery, it contains temporal and spatial information. It shows how environments change over time and how objects move. This contextual data is extremely valuable, but it also introduces additional variables that AI systems must learn to process effectively.

With the right training data, AI systems using video can learn to understand context, recognise patterns and predict outcomes.

Responsible sourcing

Developers of video-based AI models require large volumes of data. That data must be responsibly and accurately sourced, categorised, cleaned, annotated and anonymised to be useful. This also ensures compliance with legislation such as the EU’s General Data Protection Regulation (GDPR).

However, sourcing such data is costly, time-consuming and difficult to achieve at scale. Three key challenges emerge:

Data quality issues caused by poor cameras, weather conditions or low resolution

Insufficient diversity in datasets, leading to biased or incomplete training

Difficulty capturing rare or sensitive scenarios (such as unusual movement patterns or dangerous situations)

There are also significant legal considerations. Dataset creation may infringe on privacy rights under legislation such as GDPR or the UK Data Protection Act. Additional concerns include consent, copyright and confidentiality. Emerging regulations may also require full transparency around dataset origins.

Accurate labelling

Handling large video datasets is resource-intensive. Storage, processing and management costs can escalate quickly. For AI to learn effectively, every object, event and behaviour must be labelled accurately.

This labelling process is labour-intensive and prone to human error. Automated labelling can help, but human oversight is still essential for accuracy.

Every object, event and behaviour in a video must have the correct labels. Doing this manually is extremely time-consuming and labour-intensive.

Using publicly sourced data introduces further risks. These datasets may include sensitive information, misinformation or biased content, resulting in flawed AI outputs. Inaccurate training data leads to poor decision-making and reduced trust in AI systems.

Complex use cases

Responsibly sourced datasets provide a solution to these challenges. One example is Project Hafnia by Milestone Systems, which demonstrates the potential of curated training data.

Key Facts: Project Hafnia	Details
Total labelled data points	569+ million
Video dataset size	150,000+ hours
Geographic scope	Europe and the United States
Data characteristics	Fully anonymised, traceable and licensed

Datasets are pre-labelled and curated, allowing developers to focus on model development rather than preparation. The project leverages NVIDIA tools for data preparation and Vision Language Models (VLMs), enabling large-scale and efficient AI training.

High-value work

Data preparation can account for up to 80% of AI development time. Access to high-quality datasets significantly reduces this burden, allowing developers to focus on building, deploying and refining AI models.

When trained correctly, AI systems deliver substantial business benefits across multiple domains:

Improved situational awareness and faster response times

Operational insights such as occupancy and traffic flow

Enhanced decision-making in retail, transport and infrastructure

Faster investigations through searchable video attributes

For example, AI can alert operators to unattended objects, detect boundary breaches or analyse customer flow patterns. These insights support planning, staffing and customer experience improvements.

Case study: Genoa

The city of Genoa is using AI-driven video analytics to support traffic management, safety and emergency response. Through Project Hafnia, the city gained access to high-quality, compliant datasets aligned with GDPR and the EU AI Act.

The initiative supports Vision Language Models capable of summarising video events automatically, improving efficiency in monitoring and response operations.

“AI is achieving extraordinary results, unthinkable until recently, and the research is in constant development.” — Andrea Sinisi, City of Genoa

Strong foundations

AI cannot advance without data, but the quality and origin of that data are critical. Poor or unethical data limits the effectiveness of AI systems and undermines trust.

Building AI on ethically sourced, high-quality and up-to-date datasets is essential if surveillance systems are to reliably interpret and respond to real-world scenarios.

Author: Søren Raagard Jensen, Executive Product Lead for Project Hafnia at Milestone Systems

Comments (0)

Popular

Table Of Contents