How To Use In-Pipeline Data Aggregations for Smarter Document Analytics

By Dherik Barison
May 2025 | 15 min. read
Table of Contents
At Datastreamer, we’re continually evolving our platform to help you extract more insights from your data. One of the most impactful capabilities we offer is support for in-pipeline data aggregations, a method that lets you group and analyze documents in real time, directly within your pipelines using the Datastreamer Searchable Storage Ingress component.
This enables faster, more efficient document analytics without relying on external systems.
What Are Aggregations?
Aggregations are powerful analytical operations that summarize large volumes of data without needing to return every individual document. Rather than sifting through thousands—or even millions—of entries, aggregations group data and run calculations across those groups to uncover trends, patterns, and summaries.
Here’s what you can do with them:
Understand Trends: Track how sentiment around your brand shifts over time.
Identify Patterns: Discover which topics most frequently appear alongside mentions of your product.
Summarize Data: Reveal the top categories, terms, or users across a dataset.
Some commonly used aggregation types include:
Terms Aggregation: Identify the most frequent values in a dataset, such as top hashtags or the most active users.
Date Histogram Aggregation: Organize data by time intervals to highlight trends—like daily post volumes or weekly changes in sentiment.
Significant Terms Aggregation: Pinpoint unusually common terms in a subset of your data when compared to the overall dataset. For instance, this can help detect spikes in mentions tied to emerging issues or viral events.
The Power of Aggregations Inside Your Pipeline
The real value of aggregations becomes clear when you use them as a starting point within a Datastreamer pipeline. Instead of simply returning results, the summarized output—also known as “buckets”—can be routed directly into the next stage of your pipeline for further processing, transformation, or action.
Imagine this:
You’re analyzing social media posts about BrandX. The pipeline first aggregates sentiment scores, then routes the results in two directions:
-
A custom function evaluates whether there’s a spike in negative sentiment.
-
A Firehose Egress sends daily topic summaries to your analytics platform.
If a sentiment spike is detected, a Webhook Egress triggers an alert in real time.

By using aggregations inside your pipelines, you can:
-
Automate Complex Reporting: Create daily summaries of critical metrics and route them to different teams or platforms.
-
Trigger Proactive Actions: If a sudden spike in negative sentiment or a specific event is detected, your pipeline can automatically send alerts to the appropriate teams.
-
Enrich Summarized Data: Pass aggregation results into custom functions or AI classifiers for deeper, more actionable insights.
These capabilities turn your pipeline into a dynamic system that doesn’t just analyze data, it acts on it.
Real-World Use Cases with Social Media Data
Here are some example use cases for leveraging the Datastreamer Searchable Storage Ingress. These examples assume you have already completed the following setup steps:
-
Created a Datastreamer Searchable Storage using the Searchable Storage Egress component. This component ingests your data into the configured Storage instance.
-
Populated the storage with data using the unified schema, made available through our Unify Transformer.
Keep in mind that your actual storage content may differ based on your specific data sources and transformation pipelines. Always confirm which fields are available by checking your instance in the Datastreamer Portal before implementing any of the examples below.
Across all examples, the pipeline follows a similar structure: the aggregation is performed against your Datastreamer Searchable Storage, and the results are routed through the Webhook Egress component, which then sends the output to an external API for further use.

1. Trending Hashtag Analysis
Scenario: Identify the top five trending hashtags in Instagram posts related to sustainability.
Configuration:
-
Lucene query:
content.body:sustainability
-
Aggregation JSON:
{
"top_hashtags": {
"terms": {
"field": "content.hashtags.keyword",
"size": 5
}
}
}
This configuration uses an Elasticsearch terms aggregation on the content.hashtags
field. It identifies the top 5 most frequent hashtags by counting exact matches across documents. The results are sorted by frequency, providing a clear view of the most prominent tags in the dataset.
The aggregation result might look like this:
{
"buckets": [
{"key": "sustainability", "doc_count": 2541},
{"key": "eco-friendly", "doc_count": 1893},
{"key": "green-living", "doc_count": 1672},
{"key": "climate-action", "doc_count": 1421},
{"key": "circulareconomy", "doc_count": 1289}
]
}
This output shows which sustainability-related hashtags are trending the most. For example, “sustainability” appeared in 2,541 posts, making it the most popular tag in the dataset. This type of aggregation helps you quickly surface what environmental topics are resonating with your audience right now.
2. Sentiment Over Time Tracking
Scenario: Track weekly sentiment trends for a brand based on social media mentions.
Configuration:
- Lucene query:
content.body:sustainability
- Aggregation JSON:
{
"weekly_sentiment": {
"date_histogram": {
"field": "content.published",
"calendar_interval": "week"
},
"aggs": {
"avg_sentiment": {
"avg": {
"field": "enrichment.sentiment.confidence"
}
}
}
}
}
This aggregation uses the date_histogram
function to break content into weekly time intervals based on the content.published
field. It then calculates the average sentiment confidence for each week using values from the enrichment.sentiment.confidence
field.
By doing so, you generate a weekly sentiment trendline—one that helps you track how public perception of a brand shifts over time.
Sample output:
{
"buckets": [
{
"key_as_string": "2024-05-01",
"avg_sentiment": { "value": 0.78 },
"doc_count": 142
},
{
"key_as_string": "2024-05-08",
"avg_sentiment": { "value": 0.62 },
"doc_count": 89
}
]
}
This output reveals how sentiment is changing week by week. For example, from May 1–7, 142 mentions showed an average sentiment score of 78% positive. In the following week, that dropped to 62%, based on 89 mentions. This kind of shift could indicate an issue worth investigating—such as a product concern or a change in public messaging.
3. Influencer Impact Analysis
Scenario: Identify the most influential authors contributing to tech-related discussions.
Configuration:
- Lucene query:
content.body:AI
- Aggregation JSON:
{
"influential_authors": {
"significant_terms": {
"field": "author.name.keyword",
"background_filter": {
"match_all": {}
}
}
}
}
This aggregation uses the significant_terms
function to highlight statistically significant authors. It analyzes the author.name.keyword
field and compares term frequency within the query context (in this case, AI discussions) against the baseline frequency in the broader dataset, using a background_filter
.
By detecting terms that appear disproportionately in filtered results compared to the entire dataset, this method reveals authors who are unusually prominent or active within a specific topic—such as AI.
Sample output:
{
"buckets": [
{
"key": "TechAnalyst_AI",
"doc_count": 89,
"score": 0.045,
"bg_count": 120
},
{
"key": "FutureOfTech",
"doc_count": 67,
"score": 0.032,
"bg_count": 85
}
]
}
This output shows that @FutureOfTech was mentioned 85 times in total, with 67 of those mentions appearing in AI discussions—making their AI-related content roughly three times more frequent than average. Meanwhile, @TechAnalyst_AI demonstrates an even stronger focus on AI, based on both volume and statistical score.
This kind of analysis helps identify high-signal contributors whose commentary could shape narratives within a particular domain.
Turn Aggregations Into Actionable Intelligence
Aggregations within pipelines convert raw data into decision-ready insights. By summarizing information at scale and passing it directly into automated workflows, you unlock powerful capabilities such as:
-
Proactive Monitoring: Detect shifts in sentiment, rising hashtags, or sudden spikes in engagement before they escalate into crises or missed opportunities.
-
Resource Optimization: Focus your analysts’ time on high-value anomalies, such as statistically significant influencers, rather than sifting through thousands of documents manually.
As shown in the examples above—whether you’re tracking sustainability trends, monitoring brand sentiment, or surfacing niche voices—Datastreamer pipelines turn aggregations into automation-ready insights. Instead of simply processing data, you’re building an always-on analytics engine that adapts in real time to your audience.
Advanced Use Cases
Custom Functions aren’t just for basic tasks—they also enable advanced capabilities like external data enrichment, document normalization, and schema-aware processing. Below are some common use cases where Custom Functions truly shine.
Enriching Documents with External Data
You can use Custom Functions to enrich your documents by connecting to third-party APIs or databases.
For example, if your document includes an address, you can call an external geocoding service and append the corresponding coordinates:
Ready to get started?
-
Explore how it works by reviewing our Datastreamer Storage Egress and Datastreamer Storage Ingress components.
-
Experiment with chaining aggregations to notifications using the Webhook Egress component.
-
Connect with our team—we’ll help you design a pipeline that transforms your data into a strategic advantage.