Initial Checks
Here’s a quick look at what we check upfront, often even before any work on preparing the content in files begins:Metadata Review
Metadata Review
We want to ensure that your metadata is accurate, complete, and compatible with our systems.What We Do
- Check how your metadata lines up with ours - sometimes your fields and ours have the same name but mean different things in our own universes
- Make sure all the required metadata is there and successfully ingested into our system
- Look out for any metadata that didn’t get mapped:
- See if it can be matched to something we already have
- If not, and it’s useful, we can talk about adding a new field to the BDDF
- Information on how you handle updates, corrections, or deletions
- Metadata documentation
- Clarification for any unusual or custom fields
- Correct mapping of metadata fields
- Valuable unmapped fields retained under custom
- Identification of new field requirements
Identify Special Requirements
Identify Special Requirements
We aim to identify any special requirements or custom work needed to properly ingest your content. Don’t worry these things don’t put us off, we appreciate the
honesty upfront. Below are some examples from past projects to help guide you.
- Expiring Links to Original Documents: The links you provide to the original documents may expire. We need to be aware of this because we want users to have continuous access to the original files within BigData. This requires custom work to download and store the files in our database, ensuring they remain accessible at all times.
- Providing Associated Files (e.g., PDFs, Audio): If you can provide associated files such as original PDFs or audio files, we will need special infrastructure to handle and integrate these additional assets properly
- Timestamps in the Future: When using large language models (LLMs), hallucinations can occur, sometimes resulting in timestamps that are set in the future. While this might be acceptable from the provider’s perspective, accurate timestamps are critical for the BigData experience. It’s important for us to understand this upfront so we can develop appropriate solutions.
Understand Non-Functional Requirements
Understand Non-Functional Requirements
To ensure smooth ongoing operations, we need to understand how you handle non-functional aspects such as latency, scalability, and downtime. Specifically, we’d like to know:
- How will we be notified if there are issues or downtime on your side?
- Do you have an automated notification or monitoring system in place?
- Will we receive regular status reports, or will notifications be sent via email?
- What is your expected response time for resolving such issues?
Use Case Study
Use Case Study
We’d also like to ensure your content supports meaningful real-world applications for our products and clients.What We Do
- Review potential use cases for your content
- Define examples such as:
- Alerts for portfolio company mentions in podcasts
- Monitoring ESG-related discussions
- Automated summaries for company mentions
- Examples of how your content is typically used
- Insights into content value for end-users
Content Sample Checks
Once we receive a content sample from you, we won’t check only whether it fits our schema definition - we will also do basic content evaluation:Content Evaluation
Content Evaluation
We evaluate the actual content you provide and ensure it meets quality, and usability expectations.What We Do
- Content Review: Check readability, formatting, and logical structure
- Accuracy & Reliability: Verify that information is correct and consistent
- Language & Localization: Ensure proper language, grammar, and style
- Additional content samples, if needed, to assess consistency
- Confirms that content aligns with our quality standards and expectations
- Identifies potential improvements to maximize value for end-users
Real-time Data Checks
After real-time data starts flowing, we will perform the following checks:Real-Time Data Evaluation
Real-Time Data Evaluation
We assess how your content performs in real-time ingestion and verify quality, completeness, and reliability.What We Do
What We May Ask From You
- We review sample content against these criteria:
| Criteria | What We Check | Expectation |
|---|---|---|
| Volume | Matches expected ingestion volume | Steady, statistically reliable coverage |
| Topic & Description | Main theme is clear and relevant | Clear and concise description |
| Readability | Content is structured and easy to read | Proper English, logical paragraphs |
| Formatting | Matches original source | No missing sections, no ad/sponsor text |
| Duplication | No repeated content | Unique stories |
| Language | Correct language tagging | English (or specified language) |
| Timeliness | Publication dates match reality | Accurate, no future dates (rare exceptions possible) |
| Completeness | Essential fields included | Title, body, timestamp always present |
| Contextual Quality | Content makes sense independently | Full context provided (e.g., speaker name in interviews) |
| Entity & Event Detection | Entities/events detected and relevant | Matches expected types/volumes |
| Content Enrichment | Potential for improving content usability | Identify opportunities for enrichment, tagging, or additional structure |
- Clarifications on content structure, format, or schedule.
- Additional samples to validate consistency.
Historical Data Checks
Once you deliver your full content set, it’s time to dive deep into end-to-end analysis. Here is what we typically do:Coverage Evaluation
Coverage Evaluation
We evaluate the actual content you provide and ensure it meets coverage, and reliability expectations.What We Do
- Coverage Assessment: Confirm content meets expected volume, frequency, and scope
- Duplication & Uniqueness: Detect repeated or redundant content
- Timeliness: Confirm content delivery aligns with expected schedules and timestamps
- Clarification on coverage or gaps
- Documentation of content schedules or publishing frequency
- Provides transparency on coverage, timeliness, and formatting before full onboarding
Examples of the validation rules
Let’s go through several real life examples, to make theory closer to practice:Language Checks
Language Checks
Do you correctly label the language of the content, or are Spanish articles sometimes incorrectly tagged as English?Example that would fail:
Timestamps Validation
Timestamps Validation
Are there any timestamps in the future? We don’t typically expect this - please let us know if it’s intentional.Example that would fail:
Document Updates Handling
Document Updates Handling
If you update any documents - are you following our chain and sequence id definitions?Example that would fail:
Article Novelty
Article Novelty
Older articles should not be incorrectly labeled with recent dates. It’s important that the creation date accurately reflects when the document was first made available, to prevent outdated content from appearing in recent search results.Example that would fail:
Other things we look for:
- Is there any junk or noise—like scraped links, HTML artifacts, or unnecessary metadata that needs cleaning up?
- Is the content cleanly split into readable paragraphs following our suggestions?
- Are there any empty or headline-only documents?
- Are our entity and event detections on the expected level for the given content type?
- How does our chunking look? Are there any changes you could make that could really improve it?
Applying the feedback
As we work through validation, we’ll likely share some feedback along the way. This is a collaborative process, and we count on providers to be responsive and proactive during this phase. The faster we can work through feedback together, the sooner your content can go live on Bigdata.com. We generally provide two types of feedback:- must-haves – essential fixes required before we can onboard your content
- nice-to-haves – optional improvements that can boost content quality and potentially improve performance on Bigdata.com
- implement everything immediately - increases your chances of success on Bigdata.com and we’ll give you bonus points for that - sounds like a win-win to me!
- focus on just the must-haves first to get your content live quickly, and come back to the nice-to-haves later
- tackle the nice-to-haves later by:
- applying them only to new real-time (RT) content going forward
- reprocessing and resubmitting historical content if you’d like those improvements applied retroactively
- your metadata is mapped and validated,
- real-world use cases for your content are clearly defined,
- real-time ingestion and content quality are confirmed,
- coverage, reliability, and usability of your content are assessed.
Quick Links
- Onboarding Overview - typical onboarding flow
- Quick Start Guide - step by step guide to your first BDDF file
- Bigdata Document Format - in-depth schema definition with examples

