Schema
We use standard software versioning conventions, following the major.minor format.The major version goes up when there are big changes that could break backward compatibility - like overhauled APIs or removed features. We don’t expect this to happen often… but hey, never say never.The minor version increases when we add new features in a way that doesn’t disrupt existing integrations. You can expect to see a lot of these - especially as we expand the format with fields tailored to different content types like transcripts, podcasts, newsletters, market research, filings, and more.We’re working on an interactive editor that will let you select a version from a dropdown and view the corresponding schema. Until that’s ready, stick with 1.3, which is our latest version (and is fully detailed on these pages), and put that into your BDDF files.
Document
This section is all about document metadata — the key details that help describe what the document is, what it’s about, and how it should be handled.Chances are, you already assign a unique ID to each document in your system — this is where that ID goes, no matter the format. Including it makes it much easier for us to
trace, debug, or reprocess your documents if needed.For reference, we use UUIDs on our end.
Correct examples ✅
Correct examples ✅
Revision
Typically the ID of the first document in the update chain - for all future updated versions of the same document, please provide the same chain_id so we know it’s an update,
not a completely new document.
The document sequence identifier within the chain. As you keep publishing updates, keep us informed about the sequence of updates, this field tells us
which document version is newer - the higher the number, the newer the document. We’ve seen various ways to keep track of this but the most common ones
are sequential auto increment (sequence_id = 3, sequence_id = 4, sequence_id = 5…) and unix timestamp (sequence_id = 1720915200 , sequence_id = 1721260800,
sequence_id = 1721692800…). We don’t care what you use, as long as you follow the rule: the higher the number, the newer the document!
Correct examples ✅
Correct examples ✅
- Ideally, each document has a unique ID, you keep the chain_id the same and increase sequence_id. For example, if you send us two documents:
- However, we will accept a revision with the same document ID, as long as the chain_id stays the same and sequence_id is higher. For example, your first document could look like this:
Incorrect examples ❌
Incorrect examples ❌
- Chain ID not kept consistent - if you send us document updates please keep chain_id the same, otherwise we will not detect that it’s an update to an existing document but we will process it separately.
- Sequence ID not sequential - please treat this number as a high watermark - the higher the number, the newer the document.
Source
Whether you’re a content aggregator, a web scraper, or a proprietary content creator, you likely have a system for organizing your content based on where it comes from and/or what it is - this is where your source taxonomy goes. Think site IDs, feed IDs, publication IDs, podcast show IDs - whatever identifiers you use to group content by source on your side. If you don’t organize your content by where it comes from, you could consider categorizing it based on another dimension - such as document type or content theme. For example, if your data contains annual reports, interim reports, earnings call transcripts, sustainability reports… this is something you could use to organize your sources even if you get all of this content from the same website, for example. Why this matters: we use the concept of sources to package the content. If you are getting earnings call transcripts, filings and presentation slides from the same place, it might still be worth organizing it into three separate sources as, most likely, they will end up in different content packages at our end. If you are unsure about what to do, you can always reach out at data@bigdata.com and we’ll be more than happy to help you!Typically, this is a UUID or an alphanumerical string.
The official canonical name for the source - we suggest you use something descriptive here. You won’t be able to change it later on.
Correct examples ✅
Correct examples ✅
Timestamps
- we use ISO 8601 format in UTC: YYYY-MM-DDThh:mm:ssZ
- the created timestamp should reflect the moment the content itself came into existence - it should stay constant over time
- the published timestamp should indicate when that specific version of the document became publicly available. If you update and republish, update this timestamp accordingly.
- News article scraped from a website:
- created = when the article first went live online
- published = when your system scraped that version
- If the article is updated later and you re-scrape, keep the original created value, but update published to reflect when the new version was captured.
- Earnings call transcript:
- created = date/time when the call took place
- published = when the transcript was generated
- If you release a rough version right after the call and polish it later, created stays fixed, published gets updated.
- Podcast transcript:
- created = when the episode aired
- published = when the transcript was produced
- Follow the same logic as with earnings call transcripts.
- SEC filing:
- created = when it was originally filed with the SEC
- published = when you processed the PDF and generated the structured document
- If you later reprocess your archive (we know it happens), created remains the original filing date, published reflects the reprocessed version.
Correct example ✅
Correct example ✅
Incorrect examples ❌
Incorrect examples ❌
Metadata
Each document can be mapped to a single primary entity, typically this is a company that the content is mainly about (e.g., earnings calls, press releases, job postings).The entity ID is useful if it comes from structured taxonomy. This is because we map them to our RavenPack entity universe - that improves our entity recognition and other linguistics features.What makes sense to put in here:
- IDs from your internal Entity taxonomy - that’s something we can map to our IDs
- Public registries (e.g., LEI: Legal Entity Identifier or ISIN: International Securities Identification Number)
- Industry databases (e.g., FactSet, S&P Capital IQ, Crunchbase)
Make sure to provide the full name of the primary entity - especially when it’s a company. For example, we prefer “Apple Inc.” rather than just “Apple.”
Incorrect example ❌
Incorrect example ❌
- Primary entity is the same as the source ID
If you’re sending us the original file (e.g. a PDF of the annual report), enter the file name here. Typically, we’ve seen people building some logic based on the
filenames so it’s a good practice to have it inside the document as well.If the original document is available at a public URL and you’re only sharing the link, you can leave this blank.
Correct example ✅
Correct example ✅
Please provide the URL of the original resource of the content included in the document. It can be the URL of a news article, a website, a link to a PDF resource etc.
Please include the full URL when sharing assets like PDFs. If you’re using time-limited URLs, let us know so we can work together to find the best way for us to access the
files. Especially when sending your full historical data, there’s a good chance some links might expire before we finish processing everything.
Correct example ✅
Correct example ✅
Incorrect example ❌
Incorrect example ❌
The IANA media type of the resource specified in the document.metadata.url. This is specially relevant when the URL points to media assets like PDF documents, spreadsheets,
audio or video files.This will help us to apply the proper rendering mechanisms for our users to access the original information.
Correct example ✅
Correct example ✅
Incorrect example ❌
Incorrect example ❌
- Not using IANA type but just putting in pdf:
We expect the language parameter to be sent in ISO 639-1 standard, with 2 character code. See some examples below:
Correct example ✅
Correct example ✅
Incorrect examples ❌
Incorrect examples ❌
Any necessary copyright information for claiming the intellectual property for the content.
The document_type field is intended to capture the category or classification
of the content being provided. This can include — but is not limited to —
regulatory filings, financial reports, presentations, disclosures, or any other type of document.
Examples of document_type values might include:
Note: This field is still being defined. If you have specific content types you’d like to map, please let us know. We’re actively working with providers to understand their content and define a set of values that can be used for consistent mapping.
Examples of document_type values might include:
- 10-K
- 8-K
- Annual Report
- Investor Presentation
- Sustainability Report
- Risk Disclosure
Note: This field is still being defined. If you have specific content types you’d like to map, please let us know. We’re actively working with providers to understand their content and define a set of values that can be used for consistent mapping.
Correct examples ✅
Correct examples ✅
Additional metadata that is not modeled in the Bigdata Document Format can be included here for reference.Custom metadata must be provided as an object, with each entry represented as a key-value pair.
Correct example ✅
Correct example ✅
Incorrect example ❌
Incorrect example ❌
- Providing it as one single object
Reporting Period
The time period the document refers to. This is mostly relevant for financial content like
earnings reports, call transcripts, or filings. For example, if a document discusses earnings results,
which period is it covering — Q1, Q2, or the full year?Examples:Q1 2023FY 2022
Fiscal year, for example: 2024
Fiscal period - should be enum with the following values allowed: FY, H1, H2, Q1, Q2, Q3, Q4If you leave this field empty, we’ll assume it’s FY
Correct examples ✅
Correct examples ✅
Incorrect example ❌
Incorrect example ❌
Codes
A list of identifiers or codes linked to entities mentioned in the document — such as company tickers, ISINs, LEIs, or even person-level IDs like FactSet identifiers. For example, if you know the ISINs for all companies mentioned in the document, include them here. These codes help improve our entity recognition and other linguistic features.
This is the place where you specify the type of the identifier provided, such as ISIN, CUSIP, Ticker etc.
This is where you put the actual value, for example MSFT
Correct example ✅
Correct example ✅
Content
The content node is where the actual information of a document lives. The content node has two main parts:The title of the document. For example, the headline of a news article, the title of a PDF document, or subject of a transcripted event.
The main body of the content. Since a body usually has multiple pieces (paragraphs, sections, etc.), this is expressed as an array of contentBlock items.
<contentBlock> represents a single piece of information inside the document. Each block tells us not only what the content is but also how it should be interpreted
and where it belongs.
Title
Let’s start with one simple content block for the title. Here are the fields we expect:The IANA media type for the block. We currently support:
- text/plain
- application/html
- text/markdown
This is the place for the title of the document
Correct examples ✅
Correct examples ✅
Body
Content blocks can be used to split text, with the goal of organizing content clearly into paragraphs. Please follow these guidelines to help decide when to create a new content block:- when the blocks require different metadata (e.g., section, page, etc.)
- when the blocks use different content types (e.g., text/plain vs text/markdown)
The IANA media type for the block. We currently support:
- text/plain
- application/html
- text/markdown
This is where text of the content block goes. Check the Tips & Tricks page for details on how to organize this field.
The role of the content block in the document. This field is optional but can help us determine its importance.For content.body.role you can put NORMAL, FOOTER, HEADER..
If you have images or other asset elements as a part of your documents, this is the place to put their URL.
Array of page numbers - typically it will be a single int indicating the page of the original document (for example PDF) where this content block (paragraph) is. In case the paragraph spans pages, provide multiple page values in this array.
Correct example ✅
Correct example ✅
We’re looking for standardized section tagging that goes beyond splitting content into paragraphs. The goal is to semantically identify and label key sections that consistently
appear across documents.Below are some examples of common content types and the recurring sections typically found within them:
| Company Filings | Transcripts | Job Listings |
|---|---|---|
| Management Discussion | Question | About Employer |
| Risk Factors | Answer | Job Description |
| Forward-Looking Statements | Benefits | |
| Duties and Responsibilities |
If you have a hierarchical taxonomy of sections, use this field to specify the parent section.
| Document Type | Parent Section | Section |
|---|---|---|
| Job Listing | Experience | Desired Experience |
| Required Experience | ||
| Education | Desired Education | |
| Required Education | ||
| Transcript | Q&A | Question |
| Answer |
Correct examples ✅
Correct examples ✅
Metadata describing this content block - while we can think of many different use cases, we mainly use this field to identify LLM generated content. We’ve introduced
a generated_by property to flag if part of your content has been generated by an LLM - see an example below.
Correct examples ✅
Correct examples ✅
- if this is a summary or an image description generated by an LLM, please label it like this:
content.body.section can be structured:
Correct examples ✅
Correct examples ✅
Incorrect examples ❌
Incorrect examples ❌
- Avoid using document-specific section names. We’re looking for generic, reusable labels. For example, use “About Company” instead of “About RavenPack”
- There is no requirement to distinguish the title and body via the section fields, since this is done via content.title and content.body
- Stating the BDDF nodes as the section parents

