Bigdata Document Format

This page gives you a clear overview of the Bigdata Document Format (BDDF) schema - it defines how your data should be structured and what’s required when submitting it in JSON format. Check out our Resource Hub for JSON schema downloads, sample files, and other useful resources.

Schema

schema.version

string

required

We use standard software versioning conventions, following the major.minor format.The major version goes up when there are big changes that could break backward compatibility - like overhauled APIs or removed features. We don’t expect this to happen often… but hey, never say never.The minor version increases when we add new features in a way that doesn’t disrupt existing integrations. You can expect to see a lot of these - especially as we expand the format with fields tailored to different content types like transcripts, podcasts, newsletters, market research, filings, and more.We’re working on an interactive editor that will let you select a version from a dropdown and view the corresponding schema. Until that’s ready, stick with 1.3, which is our latest version (and is fully detailed on these pages), and put that into your BDDF files.

"schema": {
    "version": "1.3"
    }

Document

This section is all about document metadata — the key details that help describe what the document is, what it’s about, and how it should be handled.

document.id

string

required

Chances are, you already assign a unique ID to each document in your system — this is where that ID goes, no matter the format. Including it makes it much easier for us to trace, debug, or reprocess your documents if needed.For reference, we use UUIDs on our end.

Correct examples ✅

"id": "CC1CB50246DB9E924D1390AE239B2A67"

"id": "DN20230420009740"

"id": "34649:2160555"

"id": "8c474d8f-17ed-46a1-b422-c9d5867482f2"

Revision

document.revision.chain_id

string

required

Typically the ID of the first document in the update chain - for all future updated versions of the same document, please provide the same chain_id so we know it’s an update, not a completely new document.

document.revision.sequence_id

string

required

The document sequence identifier within the chain. As you keep publishing updates, keep us informed about the sequence of updates, this field tells us which document version is newer - the higher the number, the newer the document. We’ve seen various ways to keep track of this but the most common ones are sequential auto increment (sequence_id = 3, sequence_id = 4, sequence_id = 5…) and unix timestamp (sequence_id = 1720915200 , sequence_id = 1721260800, sequence_id = 1721692800…). We don’t care what you use, as long as you follow the rule: the higher the number, the newer the document!

Let’s go through a few examples:

Correct examples ✅

Ideally, each document has a unique ID, you keep the chain_id the same and increase sequence_id. For example, if you send us two documents:

"document": {
  "id": "6",
  "revision": {
       "chain_id": "6",
       "sequence_id": "1"
  }
},

and

"document": {
  "id": "11",
  "revision": {
       "chain_id": "6",
       "sequence_id": "2"
  }
},

We would treat document with ID = 11 as an update to a document ID = 6 because chain_id indicates it’s the same document chain and sequence_id indicates that document 11 is the newer one.

However, we will accept a revision with the same document ID, as long as the chain_id stays the same and sequence_id is higher. For example, your first document could look like this:

"document": {
  "id": "15",
  "revision": {
       "chain_id": "15",
       "sequence_id": "1"
  }
},

and

"document": {
  "id": "15",
  "revision": {
       "chain_id": "15",
       "sequence_id": "2"
  }
},

Incorrect examples ❌

Chain ID not kept consistent - if you send us document updates please keep chain_id the same, otherwise we will not detect that it’s an update to an existing document but we will process it separately.
Sequence ID not sequential - please treat this number as a high watermark - the higher the number, the newer the document.

Source

Whether you’re a content aggregator, a web scraper, or a proprietary content creator, you likely have a system for organizing your content based on where it comes from and/or what it is - this is where your source taxonomy goes. Think site IDs, feed IDs, publication IDs, podcast show IDs - whatever identifiers you use to group content by source on your side. If you don’t organize your content by where it comes from, you could consider categorizing it based on another dimension - such as document type or content theme. For example, if your data contains annual reports, interim reports, earnings call transcripts, sustainability reports… this is something you could use to organize your sources even if you get all of this content from the same website, for example. Why this matters: we use the concept of sources to package the content. If you are getting earnings call transcripts, filings and presentation slides from the same place, it might still be worth organizing it into three separate sources as, most likely, they will end up in different content packages at our end. If you are unsure about what to do, you can always reach out at data@bigdata.com and we’ll be more than happy to help you!

document.source.id

string

required

Typically, this is a UUID or an alphanumerical string.

document.source.name

string

required

The official canonical name for the source - we suggest you use something descriptive here. You won’t be able to change it later on.

Correct examples ✅

"source": {
        "id": "Ravenpacknews",
        "name": "Ravenpack News"
        }

"source": {
        "id": "f3c2a76b-8b63-4f74-b0ce-4b9b7b8fc397",
        "name": "Ravenpack News"
        }

If you group by content type:

"source": {
        "id": "f9f6425a-5827-4745-876a-4317a97f4c6d",
        "name": "IR Website - Earnings"
        }

"source": {
        "id": "a4d83929-760e-4194-909f-f97311aa67d6",
        "name": "IR Website - Regulatory Filings"
        }

"source": {
        "id": "4c8eb41f-8976-42a9-89e8-1e212a1d759b",
        "name": "IR Website - Sustainability Reports"
        }

Timestamps

document.timestamps_utc.published

timestamp

required

document.timestamps_utc.created

timestamp

required

We’ve already covered this topic on the Quick Start Guide page so here is a quick reminder of what we wrote there:

we use ISO 8601 format in UTC: YYYY-MM-DDThh:mm:ssZ
the created timestamp should reflect the moment the content itself came into existence - it should stay constant over time
the published timestamp should indicate when that specific version of the document became publicly available. If you update and republish, update this timestamp accordingly.

Let’s walk through a few examples:

News article scraped from a website:
- created = when the article first went live online
- published = when your system scraped that version
- If the article is updated later and you re-scrape, keep the original created value, but update published to reflect when the new version was captured.
Earnings call transcript:
- created = date/time when the call took place
- published = when the transcript was generated
- If you release a rough version right after the call and polish it later, created stays fixed, published gets updated.
Podcast transcript:
- created = when the episode aired
- published = when the transcript was produced
- Follow the same logic as with earnings call transcripts.
SEC filing:
- created = when it was originally filed with the SEC
- published = when you processed the PDF and generated the structured document
- If you later reprocess your archive (we know it happens), created remains the original filing date, published reflects the reprocessed version.

We expect the timestamps to be in ISO format, take a look at the examples below:

Correct example ✅

"timestamps_utc": {
  "published": "2021-02-20T00:00:00Z",
  "created": "2021-02-18T00:00:00Z"
}

Incorrect examples ❌

"timestamps_utc": {
  "published": "September 30, 2021, 2:45 PM",
  "created": "September 30, 2021, 2:45 PM"
}

"timestamps_utc": {
  "published": "17/10/2020 18:45",
  "created": "17/10/2020 18:45"
}

"timestamps_utc": {
  "published": "2023-05-01 12:45:00",
  "created": "2023-05-01 12:45:00"
}

Metadata

document.metadata.primary_entity_id

string

Each document can be mapped to a single primary entity, typically this is a company that the content is mainly about (e.g., earnings calls, press releases, job postings).The entity ID is useful if it comes from structured taxonomy. This is because we map them to our RavenPack entity universe - that improves our entity recognition and other linguistics features.What makes sense to put in here:

IDs from your internal Entity taxonomy - that’s something we can map to our IDs
Public registries (e.g., LEI: Legal Entity Identifier or ISIN: International Securities Identification Number)
Industry databases (e.g., FactSet, S&P Capital IQ, Crunchbase)

document.metadata.primary_entity_name

string

Make sure to provide the full name of the primary entity - especially when it’s a company. For example, we prefer “Apple Inc.” rather than just “Apple.”

Incorrect example ❌

Primary entity is the same as the source ID

"source": {
  "id": "A94637",
  "name": "Sports Central"
  },
  "timestamps_utc": {
    "published": "2024-07-01 00:00:00",
    "created": "2024-07-01 00:00:00",
  },
  "metadata": {
    "primary_entity_id": "A94637",
    "primary_entity_name": "Sports Central"
  }

document.metadata.filename

string

If you’re sending us the original file (e.g. a PDF of the annual report), enter the file name here. Typically, we’ve seen people building some logic based on the filenames so it’s a good practice to have it inside the document as well.If the original document is available at a public URL and you’re only sharing the link, you can leave this blank.

Correct example ✅

"document": {
  "id": "7871fffb-7cca-4a9b-83d9-796c0hh7utfu",
  "revision": {
    "chain_id": "7871fffb-7cca-4a9b-83d9-796c0hh7utfu",
    "sequence_id": "1"
  },
  "source": {
    "id": "89032752705720",
    "name": "Source 1"
  },
  "metadata": {
    "filename": "7871fffb-7cca-4a9b-83d9-796c0hh7utfu.pdf"
  }
}

document.metadata.url

string

Please provide the URL of the original resource of the content included in the document. It can be the URL of a news article, a website, a link to a PDF resource etc. Please include the full URL when sharing assets like PDFs. If you’re using time-limited URLs, let us know so we can work together to find the best way for us to access the files. Especially when sending your full historical data, there’s a good chance some links might expire before we finish processing everything.

Correct example ✅

"metadata": {
  "url": "https://www.provider.com/the-environment-matters"
}

Incorrect example ❌

"metadata": {
      "url": "the-environment-matters"
}

document.metadata.media_type

string

The IANA media type of the resource specified in the document.metadata.url. This is specially relevant when the URL points to media assets like PDF documents, spreadsheets, audio or video files.This will help us to apply the proper rendering mechanisms for our users to access the original information.

Correct example ✅

  "metadata": {
    "url": "my.company.com/assets/document.pdf",
    "filename": "document.pdf",
    "media_type": "application/pdf"
  }

Incorrect example ❌

Not using IANA type but just putting in pdf:

  "metadata": {
    "url": "my.company.com/assets/document.pdf",
    "media_type": "pdf"
  }

document.metadata.language

string

We expect the language parameter to be sent in ISO 639-1 standard, with 2 character code. See some examples below:

Correct example ✅

"metadata": {
  "language": "en",
}

Incorrect examples ❌

"metadata": {
  "language": "english"
}

"metadata": {
  "language": "en-us",
}

document.metadata.copyright

string

Any necessary copyright information for claiming the intellectual property for the content.

document.metadata.document_type

string

The document_type field is intended to capture the category or classification of the content being provided. This can include — but is not limited to — regulatory filings, financial reports, presentations, disclosures, or any other type of document.

Examples of document_type values might include:

10-K
8-K
Annual Report
Investor Presentation
Sustainability Report
Risk Disclosure

Note: This field is still being defined. If you have specific content types you’d like to map, please let us know. We’re actively working with providers to understand their content and define a set of values that can be used for consistent mapping.

Correct examples ✅

"metadata": {
  "document_type": "RNS-SEC-6-K"
}

"metadata": {
  "document_type": "RNS-annual-report"
}

document.metadata.custom

object

Additional metadata that is not modeled in the Bigdata Document Format can be included here for reference.Custom metadata must be provided as an object, with each entry represented as a key-value pair.

Correct example ✅

"metadata": {
"custom": [
  {
  "key": "expert.id",
  "value": "250"
  },
  {
  "key": "expert.name",
  "value": "John Smith",
  },
  {
  "key": "expert.title",
  "value": "President"
  },
]
}

Incorrect example ❌

Providing it as one single object

"metadata": {
  "data": {
    "expert": {
      "id":250,
      "name":"John Smith",
      "title":"President"
    }
  } 
}

Reporting Period

document.metadata.period

string

The time period the document refers to. This is mostly relevant for financial content like earnings reports, call transcripts, or filings. For example, if a document discusses earnings results, which period is it covering — Q1, Q2, or the full year?Examples:Q1 2023FY 2022

document.metadata.reporting_period.fiscal_year

int

Fiscal year, for example: 2024

document.metadata.reporting_period.fiscal_period

enum

Fiscal period - should be enum with the following values allowed: FY, H1, H2, Q1, Q2, Q3, Q4If you leave this field empty, we’ll assume it’s FY

Correct examples ✅

"metadata": {
  "reporting_period": [
    {
    "fiscal_year": "2023",
    "fiscal_period": "Q1"
    },
  ]
}

"metadata": {
"reporting_period": [
  {
  "fiscal_year": "2023",
  "fiscal_period": "FY"
  },
]
}

Incorrect example ❌

"metadata": {
  "reporting_period": [
    {
    "fiscal_year": "2023 Q1"
    },
  ]
}

Codes

document.metadata.codes

string

A list of identifiers or codes linked to entities mentioned in the document — such as company tickers, ISINs, LEIs, or even person-level IDs like FactSet identifiers. For example, if you know the ISINs for all companies mentioned in the document, include them here. These codes help improve our entity recognition and other linguistic features.

document.metadata.codes.type

string

This is the place where you specify the type of the identifier provided, such as ISIN, CUSIP, Ticker etc.

document.metadata.codes.value

string

This is where you put the actual value, for example MSFT

Correct example ✅

"metadata": {
  "codes": [
    {
    "type": "Ticker",
    "value": "RP"
    },
    {
    "type": "ISIN",
    "value": "123456",
    },
  ]
}

Content

The content node is where the actual information of a document lives. The content node has two main parts:

content.title

contentBlock

The title of the document. For example, the headline of a news article, the title of a PDF document, or subject of a transcripted event.

content.body

contentBlock[]

The main body of the content. Since a body usually has multiple pieces (paragraphs, sections, etc.), this is expressed as an array of contentBlock items.

A <contentBlock> represents a single piece of information inside the document. Each block tells us not only what the content is but also how it should be interpreted and where it belongs.

Title

Let’s start with one simple content block for the title. Here are the fields we expect:

content.title.content_type

string

required

The IANA media type for the block. We currently support:

text/plain
application/html
text/markdown

This lets us parse the content properly depending on its format.

content.title.value

string

required

This is the place for the title of the document

Correct examples ✅

"content": {
  "title": {
    "content_type": "text/plain",
    "value": "This is a Title",
    "role": "HEADING"
  }
}

"content": {
  "title": {
    "content_type": "text/plain",
    "value": "This is a Title",
    "role": "HEADING",
    "url": "https://www.example.com/logo.png"
  }
}

Body

Content blocks can be used to split text, with the goal of organizing content clearly into paragraphs. Please follow these guidelines to help decide when to create a new content block:

when the blocks require different metadata (e.g., section, page, etc.)
when the blocks use different content types (e.g., text/plain vs text/markdown)

Here are the fields we would expect in this node:

content.body.content_type

string

required

The IANA media type for the block. We currently support:

text/plain
application/html
text/markdown

This lets us parse the content properly depending on its format.

content.body.value

string

required

This is where text of the content block goes. Check the Tips & Tricks page for details on how to organize this field.

content.body.role

string

The role of the content block in the document. This field is optional but can help us determine its importance.For content.body.role you can put NORMAL, FOOTER, HEADER..

content.body.url

string

If you have images or other asset elements as a part of your documents, this is the place to put their URL.

content.body.pages

string[]

Array of page numbers - typically it will be a single int indicating the page of the original document (for example PDF) where this content block (paragraph) is. In case the paragraph spans pages, provide multiple page values in this array.

Correct example ✅

"pages": [
  7,
  8,
]

content.body.section

object

We’re looking for standardized section tagging that goes beyond splitting content into paragraphs. The goal is to semantically identify and label key sections that consistently appear across documents.Below are some examples of common content types and the recurring sections typically found within them:

Company Filings	Transcripts	Job Listings
Management Discussion	Question	About Employer
Risk Factors	Answer	Job Description
Forward-Looking Statements		Benefits
		Duties and Responsibilities

Content block section node has the following fields:

content.body.section.name

string

content.body.section.parents

string[]

If you have a hierarchical taxonomy of sections, use this field to specify the parent section.

Here is an example of a section hierarchy:

Document Type	Parent Section	Section
Job Listing	Experience	Desired Experience
		Required Experience
	Education	Desired Education
		Required Education
Transcript	Q&A	Question
		Answer

Check the example below to see how it could be constructed for a transcript document.

Correct examples ✅

"body": [
  {
    "content_type": "text/plain",
    "value": "What drove revenue growth this quarter?",
    "role": "NORMAL",
    "section": {
      "name": "question",
      "parents": ["Q&A"]
    }
  },
  {
    "content_type": "text/plain",
    "value": "Revenue growth was primarily driven by strong demand in our core markets and improved operational efficiency.",
    "role": "NORMAL",
    "section": {
      "name": "answer"
      "parents": ["Q&A"]
    }
  }
]

content.body.section.metadata

object

Metadata describing this content block - while we can think of many different use cases, we mainly use this field to identify LLM generated content. We’ve introduced a generated_by property to flag if part of your content has been generated by an LLM - see an example below.

Correct examples ✅

if this is a summary or an image description generated by an LLM, please label it like this:

  "metadata": {
    "generated_by": "LLM",
  }

Here are a few examples of how content.body.section can be structured:

Correct examples ✅

"body": [
  {
    "content_type": "text/plain",
    "value": "Example Corp is a global company focused on delivering innovative solutions across a range of industries.",
    "role": "NORMAL",
    "section": {
      "name": "About Company"
    }
  }
]

"body": [
  {
    "content_type": "text/plain",
    "value": "This document is Example Corp's Form S-4 registration statement filed with the SEC on September 25, 2025, detailing an exchange offer to register and exchange certain outstanding senior notes for new registered notes with substantially the same terms, including terms of the exchange, resale restrictions, and related legal and financial information.",
    "role": "NORMAL",
    "section": {
      "name": "Filing Description"
    }
  }
]

"body": [
  {
    "content_type": "text/plain",
    "value": "What drove revenue growth this quarter?",
    "role": "NORMAL",
    "section": {
      "name": "Question",
      "parents": ["Q&A"]
    }
  },
  {
    "content_type": "text/plain",
    "value": "Revenue growth was primarily driven by strong demand in our core markets and improved operational efficiency.",
    "role": "NORMAL",
    "section": {
      "name": "Answer",
      "parents": ["Q&A"]
    }
  }
]

Incorrect examples ❌

Avoid using document-specific section names. We’re looking for generic, reusable labels. For example, use “About Company” instead of “About RavenPack”

"content_type": "text/plain",
  "value": "RavenPack is a leading provider of real-time analytics and data solutions for financial institutions worldwide. Headquartered in Marbella, Spain, the company leverages AI and natural language processing to turn unstructured data into actionable insights for quantitative and discretionary investors.",
  "section": {
    "name": "About Ravenpack"
  }

There is no requirement to distinguish the title and body via the section fields, since this is done via content.title and content.body

"body": [
  {
    "content_type": "text/plain",
    "value": "RavenPack is a leading provider of real-time analytics and data solutions for financial institutions worldwide. Headquartered in Marbella, Spain, the company leverages AI and natural language processing to turn unstructured data into actionable insights for quantitative and discretionary investors.",
    "section": {
      "name": "job-listing-body"
    }
  }
]

Stating the BDDF nodes as the section parents

"body": [
  {
    "content_type": "text/plain",
    "value": "RavenPack is a leading provider of real-time analytics and data solutions for financial institutions worldwide. Headquartered in Marbella, Spain, the company leverages AI and natural language processing to turn unstructured data into actionable insights for quantitative and discretionary investors.",
    "section": {
      "name": "body",
      "parents": [
        "document",
        "content"
      ],
    }
  }
]

Next Steps

Check out our Resource Hub for handy tips, JSON schema downloads, sample files, and other resources to help you make the most of your experience. If you haven’t already, check out Validation Steps page to learn more about how we’ll analyze and validate your content.

Introduction

Getting Started

Format Requirements

Upload Mechanisms

Developer Resources

Schema

Document

Revision

Source

Timestamps

Metadata

Reporting Period

Codes

Content

Title

Body

Next Steps

Introduction

Getting Started

Format Requirements

Upload Mechanisms

Developer Resources

​Schema

​Document

​Revision

​Source

​Timestamps

​Metadata

​Reporting Period

​Codes

​Content

​Title

​Body

​Next Steps

Schema

Document

Revision

Source

Timestamps

Metadata

Reporting Period

Codes

Content

Title

Body

Next Steps