Validation Steps

Before onboarding all your content into our system, we like to perform a quick but thorough validation process to ensure everything meets our standards. This includes checking the BDDF against our predefined JSON schema to identify any structural issues and confirm compatibility with our processing pipeline. In addition to schema validation, we review the content for logical consistency, data quality, and correct rendering post-processing. Below is a list of steps we typically take - depending on your content type we might not go through all of them but, in most cases, these are the things we perform:

Initial Checks

Here’s a quick look at what we check upfront, often even before any work on preparing the content in files begins:

Metadata Review

We want to ensure that your metadata is accurate, complete, and compatible with our systems.What We Do

Check how your metadata lines up with ours - sometimes your fields and ours have the same name but mean different things in our own universes
Make sure all the required metadata is there and successfully ingested into our system
Look out for any metadata that didn’t get mapped:
- See if it can be matched to something we already have
- If not, and it’s useful, we can talk about adding a new field to the BDDF

What We May Ask From You

Information on how you handle updates, corrections, or deletions
Metadata documentation
Clarification for any unusual or custom fields

Outcome

Correct mapping of metadata fields
Valuable unmapped fields retained under custom
Identification of new field requirements

Identify Special Requirements

We aim to identify any special requirements or custom work needed to properly ingest your content. Don’t worry these things don’t put us off, we appreciate the honesty upfront. Below are some examples from past projects to help guide you.

Expiring Links to Original Documents: The links you provide to the original documents may expire. We need to be aware of this because we want users to have continuous access to the original files within BigData. This requires custom work to download and store the files in our database, ensuring they remain accessible at all times.
Providing Associated Files (e.g., PDFs, Audio): If you can provide associated files such as original PDFs or audio files, we will need special infrastructure to handle and integrate these additional assets properly
Timestamps in the Future: When using large language models (LLMs), hallucinations can occur, sometimes resulting in timestamps that are set in the future. While this might be acceptable from the provider’s perspective, accurate timestamps are critical for the BigData experience. It’s important for us to understand this upfront so we can develop appropriate solutions.

Understand Non-Functional Requirements

To ensure smooth ongoing operations, we need to understand how you handle non-functional aspects such as latency, scalability, and downtime. Specifically, we’d like to know:

How will we be notified if there are issues or downtime on your side?
Do you have an automated notification or monitoring system in place?
Will we receive regular status reports, or will notifications be sent via email?
What is your expected response time for resolving such issues?

Having clarity on these points will help us prepare and maintain a reliable BigData experience.

Use Case Study

We’d also like to ensure your content supports meaningful real-world applications for our products and clients.What We Do

Review potential use cases for your content
Define examples such as:
- Alerts for portfolio company mentions in podcasts
- Monitoring ESG-related discussions
- Automated summaries for company mentions

What We May Ask From You

Examples of how your content is typically used
Insights into content value for end-users

Content Sample Checks

Once we receive a content sample from you, we won’t check only whether it fits our schema definition - we will also do basic content evaluation:

Content Evaluation

We evaluate the actual content you provide and ensure it meets quality, and usability expectations.What We Do

Content Review: Check readability, formatting, and logical structure
Accuracy & Reliability: Verify that information is correct and consistent
Language & Localization: Ensure proper language, grammar, and style

What We May Ask From You

Additional content samples, if needed, to assess consistency

What It Means For You

Confirms that content aligns with our quality standards and expectations
Identifies potential improvements to maximize value for end-users

Real-time Data Checks

After real-time data starts flowing, we will perform the following checks:

Real-Time Data Evaluation

We assess how your content performs in real-time ingestion and verify quality, completeness, and reliability.What We Do

We review sample content against these criteria:

Criteria	What We Check	Expectation
Volume	Matches expected ingestion volume	Steady, statistically reliable coverage
Topic & Description	Main theme is clear and relevant	Clear and concise description
Readability	Content is structured and easy to read	Proper English, logical paragraphs
Formatting	Matches original source	No missing sections, no ad/sponsor text
Duplication	No repeated content	Unique stories
Language	Correct language tagging	English (or specified language)
Timeliness	Publication dates match reality	Accurate, no future dates (rare exceptions possible)
Completeness	Essential fields included	Title, body, timestamp always present
Contextual Quality	Content makes sense independently	Full context provided (e.g., speaker name in interviews)
Entity & Event Detection	Entities/events detected and relevant	Matches expected types/volumes
Content Enrichment	Potential for improving content usability	Identify opportunities for enrichment, tagging, or additional structure

What We May Ask From You

Clarifications on content structure, format, or schedule.
Additional samples to validate consistency.

Historical Data Checks

Once you deliver your full content set, it’s time to dive deep into end-to-end analysis. Here is what we typically do:

Coverage Evaluation

We evaluate the actual content you provide and ensure it meets coverage, and reliability expectations.What We Do

Coverage Assessment: Confirm content meets expected volume, frequency, and scope
Duplication & Uniqueness: Detect repeated or redundant content
Timeliness: Confirm content delivery aligns with expected schedules and timestamps

What We May Ask From You

Clarification on coverage or gaps
Documentation of content schedules or publishing frequency

What It Means For You

Provides transparency on coverage, timeliness, and formatting before full onboarding

Examples of the validation rules

Let’s go through several real life examples, to make theory closer to practice:

Language Checks

Do you correctly label the language of the content, or are Spanish articles sometimes incorrectly tagged as English?Example that would fail:

"metadata": {
    "language": "en",
    },
"content": {
    "title": {
        "content_type": "text/plain",
        "value": "Hoy en Marbella",
        "role": "HEADING"
    },
    "body": [
        {
        "content_type": "text/plain",
        "value": "Las últimas noticias en Marbella...",
        "role": "NORMAL"
        }
    ]
}

Timestamps Validation

Are there any timestamps in the future? We don’t typically expect this - please let us know if it’s intentional.Example that would fail:

"timestamps_utc": {
    "published": "2026-08-24 00:00:00",
    "created": "2025-01-01 00:00:00"
},
"metadata": {
    "language": "en",
},
"content": {
    "title": {
        "content_type": "text/plain",
        "value": "Example Title: January 1st 2025",
        "role": "HEADING"
    },
    "body": [
    {
        "content_type": "text/plain",
        "value": "Example body",
        "role": "NORMAL"
    }
    ]
}

Document Updates Handling

If you update any documents - are you following our chain and sequence id definitions?Example that would fail:

Version 1

{
    "schema": {
        "version": "1.3"
        },
    "document": {
        "id": "1000",
        "revision": {
            "chain_id": "1000",
            "sequence_id": "1"
        },
        "source": {
            "id": "ExampleSource",
            "name": "ExampleSource"
        },
        "timestamps_utc": {
            "published": "2040-08-24 00:00:00",
            "created": "2025-01-01 00:00:00"
        },
        "metadata": {
            "language": "en",
        }
    }
    "content": {
        "title": {
            "content_type": "text/plain",
            "value": "Title: January 1st 2025",
            "role": "HEADING"
        },
        "body": [
            {
            "content_type": "text/plain",
            "value": "Example body",
            "role": "NORMAL"
            }
        ]
    }
}

Version 2 - title updated but sequence_id stayed the same

{
    "schema": {
        "version": "1.3"
        },
    "document": {
        "id": "1000",
        "revision": {
            "chain_id": "1000",
            "sequence_id": "1"
        },
        "source": {
            "id": "ExampleSource",
            "name": "ExampleSource"
        },
        "timestamps_utc": {
            "published": "2040-08-24 00:00:00",
            "created": "2025-01-01 00:00:00"
        },
        "metadata": {
            "language": "en",
        }
    }
    "content": {
        "title": {
            "content_type": "text/plain",
            "value": "Updated Title: January 1st 2025",
            "role": "HEADING"
        },
        "body": [
            {
            "content_type": "text/plain",
            "value": "Example body",
            "role": "NORMAL"
            }
        ]
    }
}

Article Novelty

Older articles should not be incorrectly labeled with recent dates. It’s important that the creation date accurately reflects when the document was first made available, to prevent outdated content from appearing in recent search results.Example that would fail:

"timestamps_utc": {
    "published": "2025-01-01 00:00:00",
    "created": "2025-01-01 00:00:00"
},
"metadata": {
    "language": "en",
},
"content": {
    "title": {
        "content_type": "text/plain",
        "value": "News on January 1st 2020",
        "role": "HEADING"
    }
}   

Other things we look for:

Is there any junk or noise—like scraped links, HTML artifacts, or unnecessary metadata that needs cleaning up?
Is the content cleanly split into readable paragraphs following our suggestions?
Are there any empty or headline-only documents?
Are our entity and event detections on the expected level for the given content type?
How does our chunking look? Are there any changes you could make that could really improve it?

Applying the feedback

As we work through validation, we’ll likely share some feedback along the way. This is a collaborative process, and we count on providers to be responsive and proactive during this phase. The faster we can work through feedback together, the sooner your content can go live on Bigdata.com. We generally provide two types of feedback:

must-haves – essential fixes required before we can onboard your content
nice-to-haves – optional improvements that can boost content quality and potentially improve performance on Bigdata.com

Once you receive our feedback, you have a few options:

implement everything immediately - increases your chances of success on Bigdata.com and we’ll give you bonus points for that - sounds like a win-win to me!
focus on just the must-haves first to get your content live quickly, and come back to the nice-to-haves later
tackle the nice-to-haves later by:
- applying them only to new real-time (RT) content going forward
- reprocessing and resubmitting historical content if you’d like those improvements applied retroactively

We’ll go through this process together and agree on what’s best on a case-by-case basis, keeping in mind that our mutual goal is that your data gets as successful on Bigdata.com as it can. By the end of this evaluation:

your metadata is mapped and validated,
real-world use cases for your content are clearly defined,
real-time ingestion and content quality are confirmed,
coverage, reliability, and usability of your content are assessed.

Quick Links

Onboarding Overview - typical onboarding flow
Quick Start Guide - step by step guide to your first BDDF file
Bigdata Document Format - in-depth schema definition with examples

Introduction

Getting Started

Format Requirements

Upload Mechanisms

Developer Resources

Initial Checks

Content Sample Checks

Real-time Data Checks

Historical Data Checks

Examples of the validation rules

Other things we look for:

Applying the feedback

Quick Links

Introduction

Getting Started

Format Requirements

Upload Mechanisms

Developer Resources

​Initial Checks

​Content Sample Checks

​Real-time Data Checks

​Historical Data Checks

​Examples of the validation rules

​Other things we look for:

​Applying the feedback

​Quick Links

Initial Checks

Content Sample Checks

Real-time Data Checks

Historical Data Checks

Examples of the validation rules

Other things we look for:

Applying the feedback

Quick Links