Skip to main content

Why It Matters

In commodity markets like crude oil, timely awareness of trending news can directly influence trading strategies, hedging decisions, and market forecasts. As news cycles accelerate, being able to identify not just what is being talked about but also its potential market impact is essential for staying ahead of price-moving developments.

What It Does

This workflow identifies, verifies, clusters, and summarizes the most relevant and impactful news trends in the crude oil market using the Bigdata API for retrieval and large language models for topic analysis. It produces both daily market reports and structured datasets that can be used for monitoring or backtesting.

How It Works

The notebook implements a four-step agentic workflow built on Bigdata API:
  • Lexicon Generation of industry-specific jargon to maximize recall in news retrieval
  • Content Retrieval via the Bigdata API, splitting searches into daily windows and parallelizing keyword lookups for speed
  • Topic Clustering & Selection to verify, group, and summarize news into ranked trending topics, scoring each for trendiness, novelty, impact, and magnitude
  • Custom Report Generation in the form of a daily digest with a configurable ranking system, supported by granular news sources for verification

A Real-World Use Case

This cookbook illustrates the full workflow through a practical example: tracking daily stories in the crude oil market during the Israel-Iran tensions of June 2025. You’ll learn how to transform unstructured news into structured, ranked insights on market-moving narratives. Ready to get started? Let’s dive in! Open in GitHub

Prerequisites

To run the Daily Digest Crude Oil workflow, you can choose between two options:
  • 💻 GitHub cookbook
    • Use this if you prefer working locally or in a custom environment.
    • Follow the setup and execution instructions in the README.md.
    • API keys are required:
      • Option 1: Follow the key setup process described in the README.md
      • Option 2: Refer to this guide: How to initialise environment variables
        • ❗ When using this method, you must manually add the OpenAI API key:
          # OpenAI credentials
          OPENAI_API_KEY = "<YOUR_OPENAI_API_KEY>"
          
  • 🐳 Docker Installation
    • Docker installation is available for containerized deployment.
    • Provides an alternative setup method with containerized deployment, simplifying the environment configuration for those preferring Docker-based solutions.

Setup and Imports

Below is the Python code required for setting up our environment and importing necessary libraries.
from bigdata_client import Bigdata
from bigdata_client.models.search import DocumentType
from src.lexicon_generator import LexiconGenerator
from src.search_topics import search_by_keywords
from src.topics_extractor import (process_all_reports,
                                run_process_all_trending_topics,
                                run_add_advanced_novelty_scores,
                                add_market_impact_to_df,)
from src.report_generator import(prepare_data_for_report, generate_html_report,
                                save_html_report)
from IPython.display import display
from IPython.core.display import HTML

# Define output file paths for our results
output_dir = "report"
os.makedirs(output_dir, exist_ok=True)

export_path = f"{output_dir}/daily_digest_crude_oil.csv"

Defining Your Daily Digest Context and Parameters

To perform a trending topics analysis, we need to define several key parameters:
  • Main Theme (main_theme): The main topic, asset class, or context to analyze (e.g. Crude Oil)
  • Point of View (point_of_view): The additional instructions to the LLM-based Impact and Magnitude generation. Use this parameter to add your domain expertise and contextualize your own definition of Financial Materiality
  • Time Period (start_date and end_date): The date range over which to run the search
  • Document Type (document_type): Specify which documents to search over (transcripts, filings, news)
  • Model Selection (llm_model): The AI model used for semantic analysis and topic classification
  • Frequency (frequency): The frequency of the date ranges to search over. Supported values:
    • Y: Yearly intervals
    • M: Monthly intervals
    • W: Weekly intervals
    • D: Daily intervals (default)
  • Document Limit (document_limit): The maximum number of documents to return per query to Bigdata API
# ===== Context Definition =====
main_theme = 'Crude Oil'

point_of_view = "a crude oil trader, where price is influenced by supply-demand dynamics, geopolitical events, and market sentiment. \
As a trader, you focus on changes in production, inventories, and economic indicators from key markets."

# ===== Specify Time Range =====
start_query = '2025-06-20'
end_query = '2025-06-27'

# ===== Query Configuration =====
document_type = DocumentType.NEWS

# ===== LLM Specification =====
llm_model = "gpt-4o-mini"

# ===== Query Configuration =====
document_limit = 10 # Maximum number of documents to retrieve per query
frequency = 'D'  # Query frequency

Instantiating the Lexicon Generator

In this step, we identify the specialized industry-specific jargon relevant to the crude oil market to ensure a high recall in the content retrieval.
lexicongenerator = LexiconGenerator(openai_key=OPENAI_API_KEY, model="gpt-4o", seeds=[123, 123456, 123456789, 456789, 789])

keywords = lexicongenerator.generate(theme=main_theme)

Content Retrieval from Bigdata Search API

In this section, we perform a keyword search on the news content with the Bigdata API to retrieve documents, splitting the search over daily timeframes and multi-threading the content search on the individual keywords for speed purpose. With the list of market-specific keywords parameters, you can leverage the Search functionalities in bigdata-research-tools, built with Bigdata API, to run search at scale against news documents.
results, daily_keyword_count = search_by_keywords(
    keywords=keywords,
    start_date=start_query,
    end_date=end_query,
    scope=document_type,
    freq=frequency,
    document_limit=document_limit)

Topic Clustering and Summarization

In this step, we perform topic modelling using a large language model to verify and cluster the news. Then, the summarization ensures topic selection identifying the top trending news for crude oil, while deriving advanced analytics to quantify the trendiness, novelty, impact and magnitude of the trending topics.
semaphore_size = 1000 # Maximum number of concurrent requests to OpenAI API

# Apply verification layer to remove irrelevant news
filtered_reports = process_all_reports(results, llm_model, OPENAI_API_KEY, main_theme, semaphore_size)

# Perform topic modeling and clustering
flattened_trending_topics_df = run_process_all_trending_topics(
    unique_reports=filtered_reports,
    model=llm_model,
    start_query=start_query,
    end_query=end_query,
    api_key=OPENAI_API_KEY,
    main_theme=main_theme,
    batches=20 # Number of batches to process the reports in parallel
)

Topic Scoring

Trendiness and Novelty Scores: We derive analytics related to the trendiness of the topic based on the news volume, and the novelty of the topic based on the changes in daily summaries.
flattened_trending_topics_df = run_add_advanced_novelty_scores(flattened_trending_topics_df, api_key=OPENAI_API_KEY, main_theme=main_theme)
Financial Materiality: We derive analytics related to the impact (Positive, Negative) and magnitude (High, Medium, Low) of the topics, inferring their market impact on crude oil prices.
flattened_trending_topics_df = add_market_impact_to_df(flattened_trending_topics_df, api_key=OPENAI_API_KEY, main_theme=main_theme, point_of_view=point_of_view)

Generate a Custom Daily Digest

In this step, we rank the topics, allowing the user to customize the ranking system to reindex the news, based on their trendiness, novelty, and financial materiality on crude oil prices.
specific_date = '2025-06-25'  # Example date, can be modified as needed
user_selected_ranking = ['novelty', 'volume', 'magnitude']  # User can modify this list
#impact_filter = 'positive_impact' #User can use the impact_filter to filter out the report

prepared_reports = prepare_data_for_report(flattened_trending_topics_df, user_selected_ranking, impact_filter=None, report_date=specific_date)

# Generate and display the HTML report for each date
for report in prepared_reports:
    html_content = generate_html_report(
        report['date'],
        report['day_in_review'],
        report['topics'],
        f'Daily {main_theme} market update'
    )
    display(HTML(html_content))
    save_html_report(html_content, report['date'], main_theme)
daily digest example

Export the Results

Export the data as CSV files for further analysis or to share with the team.
flattened_trending_topics_df.to_csv(export_path, index=False)

Conclusion

The Daily Digest provides a comprehensive framework for identifying and quantifying trending topics in the crude oil market. By leveraging advanced information retrieval and LLM-powered analysis, this workflow transforms unstructured data into actionable market intelligence. Through the automated analysis of crude oil market dynamics, you can:
  1. Identify trending topics - Discover the most relevant and impactful news trends in the crude oil market through systematic analysis
  2. Assess market impact - Use scoring methodology to evaluate the potential impact and magnitude of news developments on crude oil prices
  3. Generate daily reports - Create professional HTML reports with ranked topics and comprehensive market summaries
  4. Export structured data - Obtain structured datasets for backtesting and further quantitative analysis
Whether you’re conducting market analysis, building trading strategies, or monitoring commodity exposure, the Daily Digest automates the research process while providing the depth required for professional market analysis.
I