Overview

Bullet ...

How is this useful

How Bullet is used is largely determined by the data source it consumes. Depending on what kind of data you put Bullet on, the types of queries you run on it and your use-cases will change. As a look-forward query system with no persistence, you will not be able to repeat your queries on the same data. The next time you run your query, it will operate on the different data that arrives after that submission. If this usage pattern is what you need and you are looking for a light-weight system that can tap into your streaming data, then Bullet is for you!

Example: How Bullet is used at Yahoo

Bullet is used in production internally at Yahoo by having it sit on a subset of raw user engagement events from Yahoo sites and apps. This lets Yahoo developers automatically validate their instrumentation code end-to-end in their Continuous Delivery pipelines. Validating instrumentation is critical since it powers pretty much all decisions and products including machine learning, corporate KPIs, analytics, personalization, targeting.

This instance of Bullet also powers other use-cases such as letting analysts validate assumptions about data, product managers verify launches instantly, debug issues and outages, or simply explore and play around with the data.

Blog post

Here is a link to our blog post condensing most of this information if you want to take a look.


Quick Start

See Quick Start to set up Bullet using a local Storm topology. You will generate some synthetic streaming data that you can then query with Bullet.

Setting up Bullet on your streaming data

To set up Bullet on a real data stream, you need:

  1. To setup the Bullet Backend on a stream processing framework. Currently, we support Bullet on Storm:
    1. Plug in your source of data. See Getting your data into Bullet for details
    2. Consume your data stream
  2. The Web Service set up to convey queries and return results back from the backend
  3. To choose a PubSub implementation that connects the Web Service and the Backend. We currently support Kafka on any Backend and Storm DRPC for the Storm Backend.
  4. The optional UI set up to talk to your Web Service. You can skip the UI if all your access is programmatic

Schema in the UI

The UI also needs an endpoint that provides your data schema to help with query building. The Web Service you set up provides a simple file based schema endpoint that you can point the UI to if that is sufficient for your needs.


Querying in Bullet

Bullet queries allow you to filter, project and aggregate data. It lets you fetch raw (the individual data records) as well as aggregated data.

Termination conditions

A Bullet query terminates and returns whatever has been collected so far when:

  1. A maximum duration is reached. In other words, a query runs for a defined time window.
  2. A maximum number of records is reached (only applicable for queries that are fetching raw data records and not aggregating).

Filters

Bullet supports two kinds of filters:

Filter Type Meaning
Logical filter Allow you to combine filter clauses (Logical or Relational) with logical operations like AND, OR and NOTs
Relational filters Allow you to use comparison operations like equals, not equals, greater than, less than, regex like etc, on fields

Projections

Projections allow you to pull out only the fields needed and rename them when you are querying for raw data records.

Aggregations

Aggregations allow you to perform some operation on the collected records.

The current aggregation types that are supported are:

Aggregation Meaning
GROUP The resulting output would be a record containing the result of an operation for each unique value combination in your specified fields
COUNT DISTINCT Computes the number of distinct elements in the fields. (May be approximate)
LIMIT or RAW The resulting output would be at most the number specified in size.
DISTRIBUTION Computes distributions of the elements in the field. E.g. Find the median value or various percentile of a field, or get frequency or cumulative frequency distributions
TOP K Returns the top K most frequently appearing values in the column

Currently we support GROUP aggregations with the following operations:

Operation Meaning
COUNT Computes the number of the elements in the group
SUM Computes the sum of the non-null values in the provided field for all elements in the group
MIN Returns the minimum of the non-null values in the provided field for all the elements in the group
MAX Returns the maximum of the non-null values in the provided field for all the elements in the group
AVG Computes the average of the non-null values in the provided field for all the elements in the group

Results

The Bullet Web Service returns your query result as well as associated metadata information in a structured JSON format. The UI can display the results in different formats.


Approximate computation

It is often intractable to perform aggregations on an unbounded stream of data and still support arbitrary queries. However, it is possible if an exact answer is not required and the approximate answer's error is exactly quantifiable. There are stochastic algorithms and data structures that let us do this. We use Data Sketches to perform aggregations such as counting uniques, and will be using Sketches to implement some future aggregations.

Sketches let us be exact in our computation up to configured thresholds and approximate after. The error is very controllable and quantifiable. All Bullet queries that use Sketches return the error bounds with Standard Deviations as part of the results so you can quantify the error exactly. Using Sketches lets us address otherwise hard to solve problems in sub-linear space. We uses Sketches to compute COUNT DISTINCT, GROUP, DISTRIBUTION and TOP K queries.

We also use Sketches as a way to control high cardinality grouping (group by a natural key column or related) and rely on the Sketching data structure to drop excess groups. It is up to you setting up Bullet to determine to set Sketch sizes large or small enough for to satisfy the queries that will be performed on that instance of Bullet.

Architecture

End-to-End Architecture

Overall Architecture

The image above shows how the various pieces of the Bullet interact at a high-level. All these layers are modular and pluggable. You can choose an implementation for the Backend and the PubSub (or create your own). The core of Bullet is abstracted into a library that can be reused to implement the Backend, Web Service and PubSub layers in a platform agnostic manner.


Backend

High Level Architecture

The Bullet Backend can be split into three main conceptual sub-systems:

  1. Request Processor - receives queries, adds metadata and sends it to the rest of the system
  2. Data Processor - reads data from a input stream, converts it to an unified data format and matches it against queries
  3. Combiner - combines results for different queries, performs final aggregations and returns results

The core of Bullet querying is not tied to the Backend and lives in a core library. This allows you implement the flow shown above in any stream processor you like. We are currently working on Bullet on Spark Streaming.

PubSub

The PubSub is responsible for transmitting queries from the API to the Backend and returning results back from the Backend to the clients. It decouples whatever particular Backend you are using with the API. We currently provide a PubSub implementation using Kafka as the transport layer. You can very easily implement your own by defining a few interfaces that we provide.

In the case of Bullet on Storm, there is an additional simplified option using Storm DRPC as the PubSub. This layer is planned to only support a request-response model for querying in the future.

DRPC PubSub

This was how Bullet was first implemented in Storm. Storm DRPC provided a really simple way to communicate with Storm that we took advantage of. We provide this as a legacy adapter or for users who use Storm but don't want a PubSub layer.

Web Service and UI

The rest of the pieces are just the standard other two pieces in a full-stack application:

The Bullet Web Service is built using Spring Boot in Java and the UI is built in Ember.

The Web Service can be deployed as a standalone Java application (a JAR file) or easily rebuilt as a WAR to deploy your favorite servlet container like Jetty. The UI is a client-side application that can be served using Node.js

Want to know more?

In practice, the backend is implemented using the basic components that the Stream processing framework provides. See Storm Architecture for details.