A Guide to Building Custom Data Pipelines with Datastreamer

By Marvan Boff

June 2025 | 10 min. read

Introduction

In today’s data-driven world, organizations often struggle to manage and extract insights from large and complex datasets. Custom data pipelines, like those offered by Datastreamer, provide the essential infrastructure to handle data efficiently. This enables your team to focus on core business goals while transforming raw data into actionable insights.

To successfully navigate these challenges, it’s essential to:

Classify large volumes of data: By organizing data into meaningful categories, organizations can streamline analysis, improve searchability, and make faster, more informed decisions.
Move data efficiently: Transferring data across systems—whether on-premises or in the cloud—is critical for timely analysis and decision-making.
Scale operations effectively: As data volumes grow, scalable solutions become crucial. Datastreamer custom data pipelines offer built-in auto-scaling capabilities that dynamically adjust resources to meet workload demands.

In this article, we’ll explore these capabilities by walking through an example: bringing a sample dataset into a Datastreamer custom data pipeline and classifying the data for better insights.

If you’re new here, check out our Getting Started Guide to begin.

Pipeline Steps Explained

To build your custom data pipeline, you will follow three key steps. These steps are designed to efficiently ingest, process, and visualize your data, ensuring a smooth and flexible workflow.

Direct Data Upload: The first step is to bring data into your pipeline. This component feeds data directly into the system, supporting a variety of sources such as S3, GCS, PubSub, and more. With these flexible options, you can easily adapt your custom data pipeline to different data sources.
Product Sentiment Classifier: Next, each record in your dataset is analyzed. The classifier extracts brand names from short text inputs and assigns a sentiment score to each brand, along with a supporting reason. This step enhances the data’s value by adding meaningful context and insights.
Document Inspector: Finally, the processed data is visualized through the Document Inspector. This tool allows you to explore your data directly in the UI. Additionally, the Datastreamer catalog offers a variety of output destinations—such as Datastreamer Searchable Storage, Webhook, Google Cloud Storage, Azure Blob, S3, Elasticsearch, PubSub, and more—making your pipeline highly versatile.

About the Data

For this example, we’ll use a small, simple dataset containing two user reviews for a product. One review is positive, highlighting the product’s quality and features, while the other is negative, expressing dissatisfaction with the product’s performance.

This sample data is ideal for demonstrating how custom data pipelines can process and analyze user feedback—extracting sentiment, identifying key trends, and helping organizations make better decisions based on customer insights.

				
					[
    {
        "review_id": "001",
        "product_id": "A12345",
        "reviewer": {
            "name": "John Doe"
        },
        "rating": 4.5,
        "review_date": "2024-12-18",
        "title": "Excellent quality and design",
        "review_text": "The product exceeded my expectations! The quality of the material is outstanding, and the design is sleek and modern. Delivery was prompt, and the customer service was responsive. Highly recommend!",
        "verified_purchase": true,
        "likes": 15
    },
    {
        "review_id": "002",
        "product_id": "B67890",
        "reviewer": {
            "name": "Jane Smith"
        },
        "rating": 2.0,
        "review_date": "2024-12-16",
        "title": "Not worth the price",
        "review_text": "The product is overpriced for what it offers. The build feels cheap, and it stopped working within a week. I contacted support, but the response was slow and unhelpful. Disappointed with this purchase.",
        "verified_purchase": true,
        "likes": 3
    }
]

Creating the Pipeline

To start building your custom data pipeline, follow these simple steps:

Access the Datastreamer Portal: First, navigate to the Datastreamer Portal. If you don’t have an account, you’ll need to create one.
Create a New Pipeline: After logging in, go to the Dynamic Pipeline section and click the New Pipeline button to begin.

Direct Data Upload

Once your pipeline is created, you’ll add the Direct Data Upload component. This acts as the data ingress point for yourdata pipeline. While the data source configuration is optional, setting it up allows you to define where your data originates, whether from product reviews, event logs, or another source.

By using Direct Data Upload, you enable your pipeline to ingest data through the UI or a secure HTTPS API, providing flexibility and control over your data flow.

Product Sentiment Classifier

Next, you’ll add the Product Sentiment Classifier. This component helps extract sentiment insights from text data.

Follow these steps to configure it:

Click the plus (+) button below the Direct Data Upload box.
From the menu, select Product Sentiment Classifier.
In the Target Text field, specify the field containing the text you want to analyze. For example, in this guide, we’ll use review_text.
The results of the operation will be saved to the destination path enrichment.product_sentiment. You can adjust this path based on your needs. If the specified field doesn’t exist yet, the component will automatically create it for you.
Finally, set the filter condition to define which data should be enriched. This step helps ensure that only relevant records are processed.

It’s important to note that any data that doesn’t match the criteria won’t be deleted; instead, the pipeline will simply skip over it. This is especially useful in cases where machine learning models only support certain languages.

By using the Product Sentiment Classifier within your data pipeline you can quickly extract insights from unstructured text, making it easier to identify trends and inform business decisions.

Document Inspector

The final step in building your custom data pipeline is to add the Document Inspector. This component allows you to view and explore the content directly in the user interface (UI). No additional configuration is required, making it easy to complete your pipeline setup.

By incorporating the Document Inspector, you can review the enriched data—such as sentiment analysis results—in a user-friendly format. This final step helps ensure your pipelines deliver insights that are easy to access and act upon.

Your completed pipeline should look similar to the example shown in the image below.

Deploy Your Pipeline

Once your pipeline is ready, it’s time to deploy. Start by saving your pipeline. Then, click the Deploy button to spin up all the resources required for your pipeline to run.

📘 Did you know? Your pipeline runs in a dedicated, isolated environment to ensure performance and security. For more details, check out our documentation on Isolation.

Upload the Data

Once your pipeline has been deployed, you’re ready to upload your data. To begin, click the Upload button on the Direct Data Upload component.

You can either copy and paste the JSON content provided below or save it as a JSON file and upload it. For more advanced integrations, use the Code button to generate a curl command for the API.

By following this step, you’ll feed your custom pipeline with data, allowing the pipeline to process, enrich, and ultimately deliver insights from your dataset.

After a few seconds, you should see your content in the Document Inspector as the image below.

				
					{
  "review_id": "001",
  "product_id": "A12345",
  "reviewer": {
    "name": "John Doe"
  },
  "rating": 4.5,
  "review_date": "2024-12-18",
  "title": "Excellent quality and design",
  "review_text": "The product exceeded my expectations! The quality of the material is outstanding, and the design is sleek and modern. Delivery was prompt, and the customer service was responsive. Highly recommend!",
  "verified_purchase": true,
  "likes": 15,
  "enrichment": {
    "product_sentiment": {
      "brands": [],
      "confidence": 0.95,
      "entities": [
        "product",
        "quality",
        "material",
        "design",
        "customer service",
        "delivery"
      ],
      "reason": "The customer expressed high satisfaction with the product's quality, design, delivery, and customer service.",
      "sentiment": "positive"
    }
  }
}

				
					{
  "review_id": "002",
  "product_id": "B67890",
  "reviewer": {
    "name": "Jane Smith"
  },
  "rating": 2,
  "review_date": "2024-12-16",
  "title": "Not worth the price",
  "review_text": "The product is overpriced for what it offers. The build feels cheap, and it stopped working within a week. I contacted support, but the response was slow and unhelpful. Disappointed with this purchase.",
  "verified_purchase": true,
  "likes": 3,
  "enrichment": {
    "product_sentiment": {
      "brands": [],
      "confidence": 0.95,
      "entities": [
        "product",
        "support"
      ],
      "reason": "The customer expresses negative sentiment due to the product being overpriced, poorly built, malfunctioning, and unhelpful support.",
      "sentiment": "negative"
    }
  }
}

What’s Next?

Once your custom data pipeline is up and running, there are several ways to extend its capabilities:

Add more classifiers: Enhance your data enrichment by integrating additional machine learning operations. Datastreamer supports a variety of pre-integrated NLP models, enabling deeper contextual understanding and reducing the time spent on manual analysis.
Filter and route data: Utilize Datastreamer’s filtering and routing tools to manage data flow more effectively. By creating multiple pipelines with complex routing, you can send data to the right destinations while filtering out unnecessary information. This ensures that only the most relevant data reaches your applications.
Leverage searchable storage and aggregations: Store and analyze your data more efficiently with Datastreamer’s Searchable Storage. This fully managed solution lets you query and aggregate your data directly within your pipelines. By leveraging labels and classifiers, you can unlock advanced analytics—such as trend detection and sentiment analysis—without needing external tools.

Conclusion

In summary, Datastreamer’s pipelines offer a comprehensive solution for organizations looking to efficiently classify large datasets, enable smooth data movement, and scale operations with ease. By leveraging these pipelines, businesses can stay focused on their core objectives without spending months building complex data infrastructure from scratch.

This approach not only simplifies data management but also unlocks the ability to extract meaningful insights from complex datasets, driving better decisions and business outcomes.

Want to build smarter, more efficient data pipelines?

Explore Datastreamer’s pipeline components or talk to our experts to learn how custom data platforms can help you extract insights, scale with ease, and power your business decisions.

A Guide to Building Custom Data Pipelines with Datastreamer

Table of Contents

Introduction

Pipeline Steps Explained

About the Data

Creating the Pipeline

Direct Data Upload

Product Sentiment Classifier

Document Inspector

Deploy Your Pipeline

Upload the Data

What’s Next?

Conclusion

Want to build smarter, more efficient data pipelines?

Get started automating your data pipelines with Datastreamer

We look forward to connecting with you.

A Guide to Building Custom Data Pipelines with Datastreamer

Table of Contents

Introduction

Pipeline Steps Explained

About the Data

Creating the Pipeline

Direct Data Upload

Product Sentiment Classifier

Document Inspector

Deploy Your Pipeline

Upload the Data

What’s Next?

Conclusion

Want to build smarter, more efficient data pipelines?

Get started automating your data pipelines with Datastreamer

We look forward to connecting with you.

Let us know if you're an existing customer or a new user, so we can help you get started!