Skip to main content
Before onboarding all your content into our system, we like to perform a quick but thorough validation process to ensure everything meets our standards. This includes checking the BDDF against our predefined JSON schema to identify any structural issues and confirm compatibility with our processing pipeline. In addition to schema validation, we review the content for logical consistency, data quality, and correct rendering post-processing. Below is a list of steps we typically take - depending on your content type we might not go through all of them but, in most cases, these are the things we perform:

Initial Checks

Here’s a quick look at what we check upfront, often even before any work on preparing the content in files begins:
We want to ensure that your metadata is accurate, complete, and compatible with our systems.What We Do
  • Check how your metadata lines up with ours - sometimes your fields and ours have the same name but mean different things in our own universes
  • Make sure all the required metadata is there and successfully ingested into our system
  • Look out for any metadata that didn’t get mapped:
    • See if it can be matched to something we already have
    • If not, and it’s useful, we can talk about adding a new field to the BDDF
What We May Ask From You
  • Information on how you handle updates, corrections, or deletions
  • Metadata documentation
  • Clarification for any unusual or custom fields
Outcome
  • Correct mapping of metadata fields
  • Valuable unmapped fields retained under custom
  • Identification of new field requirements
We aim to identify any special requirements or custom work needed to properly ingest your content. Don’t worry these things don’t put us off, we appreciate the honesty upfront. Below are some examples from past projects to help guide you.
  • Expiring Links to Original Documents: The links you provide to the original documents may expire. We need to be aware of this because we want users to have continuous access to the original files within BigData. This requires custom work to download and store the files in our database, ensuring they remain accessible at all times.
  • Providing Associated Files (e.g., PDFs, Audio): If you can provide associated files such as original PDFs or audio files, we will need special infrastructure to handle and integrate these additional assets properly
  • Timestamps in the Future: When using large language models (LLMs), hallucinations can occur, sometimes resulting in timestamps that are set in the future. While this might be acceptable from the provider’s perspective, accurate timestamps are critical for the BigData experience. It’s important for us to understand this upfront so we can develop appropriate solutions.
To ensure smooth ongoing operations, we need to understand how you handle non-functional aspects such as latency, scalability, and downtime. Specifically, we’d like to know:
  • How will we be notified if there are issues or downtime on your side?
  • Do you have an automated notification or monitoring system in place?
  • Will we receive regular status reports, or will notifications be sent via email?
  • What is your expected response time for resolving such issues?
Having clarity on these points will help us prepare and maintain a reliable BigData experience.
We’d also like to ensure your content supports meaningful real-world applications for our products and clients.What We Do
  • Review potential use cases for your content
  • Define examples such as:
    • Alerts for portfolio company mentions in podcasts
    • Monitoring ESG-related discussions
    • Automated summaries for company mentions
What We May Ask From You
  • Examples of how your content is typically used
  • Insights into content value for end-users

Content Sample Checks

Once we receive a content sample from you, we won’t check only whether it fits our schema definition - we will also do basic content evaluation:
We evaluate the actual content you provide and ensure it meets quality, and usability expectations.What We Do
  • Content Review: Check readability, formatting, and logical structure
  • Accuracy & Reliability: Verify that information is correct and consistent
  • Language & Localization: Ensure proper language, grammar, and style
What We May Ask From You
  • Additional content samples, if needed, to assess consistency
What It Means For You
  • Confirms that content aligns with our quality standards and expectations
  • Identifies potential improvements to maximize value for end-users

Real-time Data Checks

After real-time data starts flowing, we will perform the following checks:
We assess how your content performs in real-time ingestion and verify quality, completeness, and reliability.What We Do
  • We review sample content against these criteria:
CriteriaWhat We CheckExpectation
VolumeMatches expected ingestion volumeSteady, statistically reliable coverage
Topic & DescriptionMain theme is clear and relevantClear and concise description
ReadabilityContent is structured and easy to readProper English, logical paragraphs
FormattingMatches original sourceNo missing sections, no ad/sponsor text
DuplicationNo repeated contentUnique stories
LanguageCorrect language taggingEnglish (or specified language)
TimelinessPublication dates match realityAccurate, no future dates (rare exceptions possible)
CompletenessEssential fields includedTitle, body, timestamp always present
Contextual QualityContent makes sense independentlyFull context provided (e.g., speaker name in interviews)
Entity & Event DetectionEntities/events detected and relevantMatches expected types/volumes
Content EnrichmentPotential for improving content usabilityIdentify opportunities for enrichment, tagging, or additional structure
What We May Ask From You
  • Clarifications on content structure, format, or schedule.
  • Additional samples to validate consistency.

Historical Data Checks

Once you deliver your full content set, it’s time to dive deep into end-to-end analysis. Here is what we typically do:
We evaluate the actual content you provide and ensure it meets coverage, and reliability expectations.What We Do
  • Coverage Assessment: Confirm content meets expected volume, frequency, and scope
  • Duplication & Uniqueness: Detect repeated or redundant content
  • Timeliness: Confirm content delivery aligns with expected schedules and timestamps
What We May Ask From You
  • Clarification on coverage or gaps
  • Documentation of content schedules or publishing frequency
What It Means For You
  • Provides transparency on coverage, timeliness, and formatting before full onboarding

Examples of the validation rules

Let’s go through several real life examples, to make theory closer to practice:
Do you correctly label the language of the content, or are Spanish articles sometimes incorrectly tagged as English?Example that would fail:
"metadata": {
    "language": "en",
    },
"content": {
    "title": {
        "content_type": "text/plain",
        "value": "Hoy en Marbella",
        "role": "HEADING"
    },
    "body": [
        {
        "content_type": "text/plain",
        "value": "Las últimas noticias en Marbella...",
        "role": "NORMAL"
        }
    ]
}
Are there any timestamps in the future? We don’t typically expect this - please let us know if it’s intentional.Example that would fail:
"timestamps_utc": {
    "published": "2026-08-24 00:00:00",
    "created": "2025-01-01 00:00:00"
},
"metadata": {
    "language": "en",
},
"content": {
    "title": {
        "content_type": "text/plain",
        "value": "Example Title: January 1st 2025",
        "role": "HEADING"
    },
    "body": [
    {
        "content_type": "text/plain",
        "value": "Example body",
        "role": "NORMAL"
    }
    ]
}
If you update any documents - are you following our chain and sequence id definitions?Example that would fail:
Version 1

{
    "schema": {
        "version": "1.3"
        },
    "document": {
        "id": "1000",
        "revision": {
            "chain_id": "1000",
            "sequence_id": "1"
        },
        "source": {
            "id": "ExampleSource",
            "name": "ExampleSource"
        },
        "timestamps_utc": {
            "published": "2040-08-24 00:00:00",
            "created": "2025-01-01 00:00:00"
        },
        "metadata": {
            "language": "en",
        }
    }
    "content": {
        "title": {
            "content_type": "text/plain",
            "value": "Title: January 1st 2025",
            "role": "HEADING"
        },
        "body": [
            {
            "content_type": "text/plain",
            "value": "Example body",
            "role": "NORMAL"
            }
        ]
    }
}
Version 2 - title updated but sequence_id stayed the same

{
    "schema": {
        "version": "1.3"
        },
    "document": {
        "id": "1000",
        "revision": {
            "chain_id": "1000",
            "sequence_id": "1"
        },
        "source": {
            "id": "ExampleSource",
            "name": "ExampleSource"
        },
        "timestamps_utc": {
            "published": "2040-08-24 00:00:00",
            "created": "2025-01-01 00:00:00"
        },
        "metadata": {
            "language": "en",
        }
    }
    "content": {
        "title": {
            "content_type": "text/plain",
            "value": "Updated Title: January 1st 2025",
            "role": "HEADING"
        },
        "body": [
            {
            "content_type": "text/plain",
            "value": "Example body",
            "role": "NORMAL"
            }
        ]
    }
}
Older articles should not be incorrectly labeled with recent dates. It’s important that the creation date accurately reflects when the document was first made available, to prevent outdated content from appearing in recent search results.Example that would fail:
"timestamps_utc": {
    "published": "2025-01-01 00:00:00",
    "created": "2025-01-01 00:00:00"
},
"metadata": {
    "language": "en",
},
"content": {
    "title": {
        "content_type": "text/plain",
        "value": "News on January 1st 2020",
        "role": "HEADING"
    }
}   

Other things we look for:

  • Is there any junk or noise—like scraped links, HTML artifacts, or unnecessary metadata that needs cleaning up?
  • Is the content cleanly split into readable paragraphs following our suggestions?
  • Are there any empty or headline-only documents?
  • Are our entity and event detections on the expected level for the given content type?
  • How does our chunking look? Are there any changes you could make that could really improve it?

Applying the feedback

As we work through validation, we’ll likely share some feedback along the way. This is a collaborative process, and we count on providers to be responsive and proactive during this phase. The faster we can work through feedback together, the sooner your content can go live on Bigdata.com. We generally provide two types of feedback:
  • must-haves – essential fixes required before we can onboard your content
  • nice-to-haves – optional improvements that can boost content quality and potentially improve performance on Bigdata.com
Once you receive our feedback, you have a few options:
  • implement everything immediately - increases your chances of success on Bigdata.com and we’ll give you bonus points for that - sounds like a win-win to me!
  • focus on just the must-haves first to get your content live quickly, and come back to the nice-to-haves later
  • tackle the nice-to-haves later by:
    • applying them only to new real-time (RT) content going forward
    • reprocessing and resubmitting historical content if you’d like those improvements applied retroactively
We’ll go through this process together and agree on what’s best on a case-by-case basis, keeping in mind that our mutual goal is that your data gets as successful on Bigdata.com as it can. By the end of this evaluation:
  • your metadata is mapped and validated,
  • real-world use cases for your content are clearly defined,
  • real-time ingestion and content quality are confirmed,
  • coverage, reliability, and usability of your content are assessed.