At ngrok, we regulate an extensive data lake with an engineering team of one (me!).
This article is a see at how we built it, what we lgeted, as well as some pickive proset up dives I set up engaging enough to be worth sharing in more detail, since they’ll bridge the gap between what people usuassociate comprehfinish by the term “data engineering” and how we run data here at ngrok.
Some of this might even be advantageous (or, at the very least, engaging!) for your own data platestablish finisheavors, whether your team is huge or minuscule.
Data we store
First of all, as an ngrok user, it’s vital to comprehfinish what data we store and what we use it for, especiassociate when talking about originateing a data platestablish. I want to be as clear and upfront as possible about this, since talking about storing and processing data will always lift valid privacy troubles.
My colleague Ari recently wrote in-depth about what personal data of customers we store and you can always discover up to date compliance data in our Trust Cgo in.
What we store
But to give you a more concise overwatch, we generassociate store and process:
- Data from our globassociate scatterd Postgres instances, which retains customer data, such as account IDs, user to account mappings, certificates, domains and aappreciate data.
- Data from our metering and usage cluster which tracks the usage of resources.
- Subscription and payment inestablishation (excluding recognize card inestablishation).
- Third-party data, such as help conveyions from Zfinishesk.
- Metadata about shiftment of events thcdisesteemful the various components that originate up ngrok’s infrastructure.
- Purpose-built product signals, also as authentic-time events (more on those two procrastinateedr!).
Note that we do not store any data about the traffic satisfied flotriumphg thcdisesteemful your tunnels—we only ever see at metadata. While you have the ability to allow brimming seize mode of all your traffic and can choose in to this service, we never store or scrutinize this data in our data platestablish. Instead, we use Clickhouse with a unwiseinutive data retention period in a finishly split platestablish and sturdy access regulates to store this inestablishation and originate it useable to customers.
I hope this demystifies our data processing efforts a bit, so we can talk about the engineering side of slimgs.
Why data engineering is separateent at ngrok (and probably not what you slimk)
We engaged the first brimming time data person (it’s me!) in the summer of 2023. Before we filled that position, the data platestablish was set up by our CTO, Peter. Because of that, our data engineering (DE) team (or rather, our data role) is part of the Office of the CTO and effectively toils horizonloftyy atraverse our engineering organization.
One artifact of having such a minuscule team, our DE toil is much sealr aligned to hoenumerateic backfinish engineering toil than the term “data engineering” frequently implies.
While I’m primarily reliable for the Data Platestablish and all associated services and tooling, I normally discover myself toiling on the actual ngrok products (as resistd to “fair” the data lake). That includes architecture proposals and summarizes that span multiple teams and services and are mostly tangentiassociate roverdelighted to the “core” data toil. And yes, I do originate a unfragmentary bit of Go because of it.
As such, the convey inantity of the data modeling toil (i.e., SQL) is done by subject matter experts, which is very separateent to DE roles in many other organizations. In other words, I originate very little SQL on a day to day basis and won’t usuassociate be the person that originates a data model.
Wislim those subject matter experts, some people originate reusable, well-structured dbt models, while other people concentrate on ad hoc analysis (based on these models) via our BI tooling in Superset. It’s worth noting that our entire organization has access to Superset and most data models and sources!
For instance, our increaseth team carry outs a huge amount of our actual data models and understands the business, product, and finance side much better than I do. They’re much better supplyped to originate rational, intricate, and carry onable data models that properly answer business inquires.
In includeition to that, we have a minuscule-but-mighty infrastructure team that owns and runs our core infrastructure, such as Kubernetes, as well as the increaseer tools that are instrumental in carry oning every engineering team (and, of course, ngrok itself!) running brittlely.
This particular setup—watching DE as a very technical, vague-purpose scatterd system SWE discipline and making the people who understand best what authentic-world scenarios they want to model—originates our setup toil in rehearse.
Our data platestablish architecture
Now that you understand how our (data) engineering org is structured, let’s talk about what we actuassociate built and carry on and how it fortifyd over time.
ngrok as a product is a huge, scatterd, worldexpansive nettoiling system with very high uptime secures, huge traffic volumes, and a huge set of inherently eventuassociate stable (and frequently, ephemeral) data that can be difficult to describe. Think about how a TCP joinion toils, and what steps are included with making and routing it!
In other words, even our plain architecture was a bit more intricate than “fair duplicate your Postgres database somewhere to query it offline.”
ngrok’s data architecture in the past
Despite that, our innovative architecture was utilitarian and relied more heavily on AWS tools than our conmomentary architecture, which is very uncover-source concentrateed.
Its primary goal was to get the most vital data sets we needed—Postgres, vital outer data, as well as some events—to effectively run finance inestablishing, unfair treatment, and help.
On the batch ingestion side, we used Airbyte uncover source on Kubernetes to ingest third-party data via their esteemive APIs. We used the ngrok OAuth module to do authentication (as we do for all our uncover-source services that need an ingress regulateler).
Airbyte wrote JSON files, where we remendd the schema with manual runs of a Glue parser and cut offal Python scripts to originate the aim schemas, as well as another Glue job to originate the aim schema as Iceberg.
At the time, we did not have an orchestrator useable and relied on Glue inside schedules. This uncomardentt we had no attentiveing or any integration with on-call tools.
We used AWS DMS here to get our core Postgres data, writing parquet
data to S3. This was a once-a-day batch job.
On the streaming side, we streamed event metadata via AWS Firehose, writing JSON data to 2 separateent S3 locations.
For analytics, all our data was (and still is) eventuassociate stored as Apache Iceberg and generassociate queried via AWS Athena, although the legacy architecture did have some datasets that were based on raw JSON in the join. We used AWS Glue as a meta store.
Our SQL models were actuassociate SQL watchs straightforwardly in Athena, with no version regulate or lineage, that were straightforwardly originated in production and queried via Preset (which is the regulated cdeafening version of Superset).
Expensive queries and unreasonable models
Our eventing system, which is core to comprehfinish system behavior, relied on a very pricy AWS Firehose pipeline, as the way we split and orderly events needd us to both originate JSON data (creating hundreds of TiB of data), as well as carry on data platestablish definite Go code in otherinestablished sanitizely customer facing services (see the procrastinateedr section on Apache Fconnect, Scala, and Protobuf).
Some of the data became straight up impossible to query (or very pricey), as queries would time-out despite tuning with partitions and other tricks. The entire system was on borrowed time from the begin.
It was also difficult to impossible to reason about our models, since we alertageed any of dbt‘s (or a comparable tool’s) creature soothes, such as lineage, recordation, version regulate, auditing, tests, and so on.
Without predicting you to be able to grok the details here, envision getting asked why a certain field sees doubtful (if not to say, wrong), at the very finish of this lineage tree:
…without having this lineage tree useable, of course.
On a aappreciate vein, not having a central orchestrator, attentiveing, and auditing for our data jobs was an opereasonable contest (you can lget more about how we mendd those two rehires here).
Our data stack was also not united very proset uply in our Go monorepo and tooling, missing slimgs appreciate Datadog watchs and metrics, outstanding test coverage, or style directs and utilizements via CI (see the Working in a go monorepo section).
Lastly (at least for the scope of this article), Airbyte and Glue have been a contest to get right, but we’ll inestablish you how we did a restricted sections from now.
ngrok’s data architecture now
Our up-to-date data platestablish is more heavily based around uncover-source tools we self-structure on Kubernetes, dogfooding ngrok, with some AWS native tools in the join.
To mend these contests, a simplified, conmomentary watch of our current architecture sees appreciate this.
All our batch ingestion is now run and orchestrated via Dagster, which I’ve written about previously. We still use Airbyte and still use ngrok to do so, but originate straightforwardly to Glue and carry on our schemas as Terraestablish by querying the Glue API.
For streaming data (which is where most of our volume and intricateity comes from), we now run Apache Fconnect to use Protobuf messages straightforwardly from Kafka, rather than depend on Firehose and inside services. We’ll also cover this in more detail in a bit.
Our database ingestion is still using DMS, but now mostly relies on streaming originates, which are rapider and more effective (when reacting to a help ask, you don’t want yesterday’s data!).
For analytics, we heavily depend on dbt now, as well as self-structure the uncover-source version of Apache Superset. We also includeed a structureed version of the dbt docs, of course also dogfooded behind an ngrok finishpoint.
Technical proset up-dives and problem-solving
While we cannot get into all the details of all the contests we mendd in the past 12 or so months, here are some contests I set up especiassociate engaging as a gentleware engineer.
Collaborating on data and infra in a Go monorepo
Most of ngrok’s code base is written in Go and exists in a monorepo. We run Bazel for originate tooling, as well as Nix as a package regulater. This permits us to have reproducible increaseer environments, as well as reasonably rapid compile, originate, and by proxy, CI times.
As most of our data infrastructure exists in Python and Scala, we had to change our toilflow to this environment, as it is vital to us to unite the data environment with the rest of the engineering organization at ngrok.
Speaking from experience, having a finishly split data engineering team or department will eventuassociate result in a fragmented engineering environment, with many bespoke paths that are not applicable to all team members, usuassociate causing huge help and maintenance burdens on individual teams (e.g., carry oning two or more iterations of a CI/CD system).
Having one deployment system all engineers use is much easier and can be carry oned by one infrastructure team:
I discover this is frequently an artifact of the DE roles not being supplyped with the vital understandledge of more generic SWE tools, and vague SWEs not being supplyped with understandledge of data-definite tools and toilflows.
Speaking of, especiassociate in minusculeer companies, supplyping all engineers with the technical tooling and understandledge to toil on all parts of the platestablish (including data) is a huge advantage, since it permits people not usuassociate on your team to help on projects as needed. Standardized tooling is a part of that equation.
For instance, we have an inside, Go-based increaseer CLI, called nd
, that does a lot of weighty lifting (slimk “abstract kubectl
orders for half a dozen clusters”). We also use it to run diffs between a increaseer’s branch and predicted state, for instance to utilize establishatting and code styles. Our CI runners run NixOS.
So, for our data toil, enforcing standards around dbt models included a custom Nix package for shandy-sqlfmt, which we use as a standard for establishatting all our dbt models, as well as integration into our nd
tool, so increaseers (as well as CI) have access to nd sql fmt
, fair as they have nd go fmt
.
While this does include includeitional toil for me, it secures data tooling is never the “odd one out” and ramping onto data toil (or vice versa) is much less of a cognitive shift.
Other integrations we’ve includeed over time include:
- Bespoke, up-to-date Python tooling (not only for our data tools), such as
poetry
andruff
, as well as utilizement of style and motionless analysis via CI. - Smart
sbt
caches for Scala, since Scala + Bazel is not someslimg we’ve scrutinized in depth. - Datadog watchs and metrics, including custom metric interfaces for all our data tools.
- Integration in our on-call tooling, such as Salertage attentives, OpsGenie integration, and others.
- Various custom Nix derivations.
Wrestling schemas and structs between Airbyte and Glue
A more “data definite” contest we’ve dealt with are intricate schemas in Airbyte that frequently don’t suit the actual data or are otherinestablished incompatible with our query engine, which is someslimg I’m certain a lot of you are understandn with.
With a team of one, I can’t reasonably originate individual jobs for individual tables or sources that regulate all corner cases, as we srecommend have too huge and diverse a set of data sources. Myself and others have to depend on code-gen and automated processing. This hancigo ins genuine for all data tools, not fair Airbyte.
Originassociate, we wrote JSON files to S3, which helped the arbitrary data and schema changes that might happen, and ran AWS Glue crawlers on top of these files to discover the schema and originate “raw” tables.
JSON is conceptuassociate pleasant for this, since it can deal with arbitrary schemas. For example, using a parquet originater to S3 would depend on the source schema to be 100% right and has to deal with an array of restrictations. Glue crawlers, on paper, help table versions and table evolutions.
But we rapidly authenticized that these crawlers were very undependworthy, especiassociate with changing schemas or separateences between dev
and prod
. This resulted in schemas that either couldn’t be queried outright or inestablished inright data.
We experimented with custom schema discoverion logic, which gave us more regulate over parameters appreciate sample size, see back triumphdows, and corner cases, but set up that burdensome to regulate, despite using existing libraries.
A part of this was AWS Glue’s odd way of storing structs
, which are (misdirectingly so) depicted as arbitrarily proset up JSON objects in the Glue web UI:
{
"payment_method_configuration_details": {
"id": "string",
"parent": "string"
}
}
Whereas the API portrays these fields as:
{
"Name": "payment_method_configuration_details",
"Type": "struct"
}
Which includes portray table and show originate table statements via Athena:
CREATE EXTERNAL TABLE `...`(
`status` string COMMENT 'from deserializer',
`payment_method_configuration_details` struct<id:string,parent:string> COMMENT 'from deserializer',
This very custom, bespoke way of describing structs uncomardentt that standard “JSON to JSON schema” parsers would not toil out of the box. While it is possible to toil around some of this, this became a very convoluted problem, given that some AWS APIs are notoriously convoluted in themselves. The Glue API predicts the struct<
syntax, for instance.
It’s also arguably not a problem we should need to mend.
So, we remendd on using the Airbyte Glue destination joinor, which would originate the tables straightforwardly on Glue based on the inestablished schema of the source. This reshifts the includeitional hop (and point of fall shorture) of running a crawler entidepend and secures we get at least a valid Glue table (albeit not necessarily a valid Athena table).
But it still does not mend the rehire of fields being inestablished inrightly at times, usuassociate straightforwardly by the API.
For instance, the table nakede.accuses
cannot be queried due to Athena returning a TYPE_NOT_FOUND: Ununderstandn type: row
error. Trying to get a DDL will produce a java.lang.IllegitimateArgumentException: Error: name predicted at the position 204
. Keep in mind that this was entidepend set up by Airbyte, with no human includement or custom code run yet.
Position 204, for those that are inquisitive, sees appreciate this: total_count:decimal(38)>:boolean:struct<>:
This struct<>
field can't be queried in Athena.
To mend this, we now have a post-processing step that turns each struct
field into a custom tree data structure, maps or erases invalid types at arbitrary depth (such as struct<>
), and originates a terraestablish reconshort-termation of each table, so we get uniestablish environments between dev
and prod
.
It also does an includeitional step by creating a flattened table that will map each nested field into a flat field with the appropriate type. We do this to increase compatibility with Iceberg and originate queries more ergonomic for users.
This toils by querying the Glue API and some plain DSA. For instance, the chaseing field might conshort-term as:
But reassociate retain the chaseing data, as inestablished to Glue (notice the object under subscription_items
):
{
"pfinishing_refresh": {
"trial_finish": "int",
"expires_at": "int",
"trial_from_structure": "boolean",
"subscription_items": [
{
"id": "string",
"price": {
"id": "string",
"type": "string",
// ...
}
}
If any of these fields is invalid, we map them to valid names or erase them.
For instance, a struct<>
or srecommend a field that has a name that’s invalid in Athena, but valid in Glue). Invalid names appreciate "street-name"
or "1streetname"
need to be escaped with "
in Athena, but cannot be used in nested fields, which are very widespread in JSON.
Airbyte also doesn’t have a hugeint
type, but Athena does; if a schema inestablishs an Athena Integer
type, we generassociate map it to a hugeint
to be protected, since a appreciate >= 2^32 will cause Athena queries to fall short. We also normalize other types, such as decimal(38)
.
All of this results in parsed (nakedped) tree aappreciate to this, with types rapidened on the nodes:
Which we can now traverse and mutate, e.g. change names, types, or set deletion tombstones.
This would produce a field in the "raw" table as:
columns {
name = "pfinishing_refresh"
type = "struct
As well as cut offal fields in the aim table, mapping the to aforealludeed flat fields:
columns {
name = "pfinishing_refresh_billing_cycle_anchor"
type = "hugeint"
parameters = {
"iceberg.field.current": "genuine",
"iceberg.field.id": "67",
"iceberg.field.nonessential": "genuine"
}
}
This way, we reap cut offal advantages:
- We can depend on a supplyd schema by the source as much as possible by straightforwardly writing from Airbyte to Glue.
- We have originated code that we can version regulate for both the raw as well as the aim table.
- We reshift any potential separateences between
dev
andprod
, since we depend on the sources' inestablished schema.
And, perhaps most cruciassociate, it originates this process regulateable for our definite team setup. Our next step is to terraestablish the entire Airbyte process.
While this is arguably fair as complicated as running custom JSON schema parsers, we set up that spreading the time into originateing a proper data structure once and adfairing the ruleset where needed down the line was very much worth it, rather than trying to beat Glue crawlers or outer JSON parser libraries into submission.
Scaling Apache Fconnect, Scala, and Protobuf to 650 GB/day
Another technicassociate challenging, albeit engaging component, is our streaming integration with our core codebase.
We stream a subset of our Protobuf messages to Apache Iceberg via Kafka and Fconnect and originate them useable to query—in fact, it's one of our core sources to fight unfair treatment (more on that in a second). Our Fconnect jobs are all written in Scala 3, depending on fconnect-extfinished/fconnect-scala-api
.
Between our normal services, we convey via GRPC and talk Protobuf, which perhaps isn't the most widespread establishat in the data world. However, hooking our data processing tools straightforwardly into Protobuf has cut offal advantages:
- Protobuf's schema evolution is one-straightforwardional; Apache Iceberg actuassociate helps a lot more evolutions than Protobuf does, making certain our schemas stay in sync with the business' schemas
- Old messages are always compatible with newer ones, uncomardenting we never run into scatterd ordering messages (i.e., processing a message with an ancigo in schema for a new table)
- Protobuf, albeit oddly opinionated, is relatively straightforward and has fantastic Scala help via scalapb
- It's a very effective establishat on the wire, making it so that our pipelines can process tens or hundreds of thousands of events per second
Our main job uses an ordinary of ~9,000-15,000 events/s and at about ~691 bytes per message. Over time, this toils out to cdisesteementirey ~1,000,000,000 events or ~650 GB per day, split atraverse ~55 message types, each with a separateent schema. Our other, more exceptionalized streaming jobs are in the reduce millions a day.
From a technical perspective, this was a fun contest to mend. We accomplish the entire process by, in essence, sfinishing a wrapper message called SubscriptionEvent
that retains an EventType
(an enum
) that portrays the satisfied of the wrapper and a []byte
field that retains the actual aim Protobuf (one of the aforealludeed 55 message types). Newer pipelines skip that wrapper message, but this system predates the data platestablish.
- We originate all Protobuf Scala classes with some custom mappings to secure all types are compatible with Iceberg (
one-ofs
, for instance, are not!), usingscalapb
. - We originate a sweightlessly customized
avro
schema (mostly to include metadata),serializers
,deserializers
, and so on, for each aim message and store them as Scala objects (so this isn’t done at runtime).
In the pipeline:
- We read the raw wrapper message from Kafka.
- We split it by the
EventType
and parse the aim Protobuf type by parsing the[]byte
field on the message and produce an output tag to route messages by type. - We use previously originated
avro
schema and use theFconnectWriter
for Iceberg to originate the data, based on the job's verifypoint interval.
Or, in Scala terms:
pipeline[A <: GeneratedMessage: TypeInestablishation : EventTyper : SerializableTimestampAssigner : ProtoHandler]
The combination of having nested messages and incompatible types made it impossible to use any of the built-in Protobuf parser that exist w/in the Fconnect ecosystem. In fact, I had to customize scalapb
.
One of the key elements in this is a typeclass called ProtoHandle
that supplys a ProtoHandle
:
trait ProtoHandle[
A,
B <: GeneratedMessage: EventTyper: AvroMeta
] extfinishs Serializable {
type ValueType = A
def derivedSchema: AvroSchema
protected def encoder: MetaRecordEncoder[A]
def encode(msg: A, schema: AvroSchema, metadata: AvroMetadata): Record
// …
}
A ProtoHandle
is a generic structure that's reliable for schema parsing and encoding, for each Protobuf.
For efficiency reasons by front loading a lot of heavily lifting to compile time, we code originate all understandn ProtoHandle
s based on the enum
:
case object CAComp extfinishs CompanionComp[CA] {
personal val T = CA
personal val coreSchema: Schema = AvroSchema[CA]
personal val schema: Schema = coreSchema.withMetadata
personal val enc: AvroEncoder[CA] = AvroEncoder[CA]
personal val ti: TypeInestablishation[CA] = deriveTypeInestablishation[CA]
override val regulate: ProtoHandle[CA, SubscriptionEvent] = SubscriptionEventProtoHandle(
T.messageCompanion,
schema,
enc,
ti
)
}
These objects (in Java terms, motionless
) derive the avro
schema using com.sksamuel.avro4s.AvroSchema
, as well as the encoder using com.sksamuel.avro4s.Encoder
, which uses magnolia
under the hood.
We now have one right, toiling avro
schema + encoder for each possible Protobuf message.
Note: Some of the type signatures see a bit untamed—that's an artifact of the nested Protobufs and can only be partiassociate simplified.
For instance, a
ProtoHandle
needs to understand its own type (A
) as well as its wrapper type (B
) to rightly derive schemas and encoders, but for providing a concrete regulater to a downstream carry outation, we don't need to understand about the concrete type ofA
anymore.We also had to toil around some other fun JVM restrictations, such as
ClassTooLarge
errors, which did not originate some of these class and type hierarchies easier to read.This can be upgraded, but this is also one of the results of a minuscule team.
We can then expose this mapping as Map[FlinkEventTag, ProtoHandle[_, SubscriptionEvent]]
to the job.
For each mapping, we can now:
- Create the table or refresh its schema during the job's beginup by depending on Iceberg's schema evolution.
- Add a sink dynamicassociate to a
Ref
of an incoming data stream.
Or, conveyed in code:
trait FconnectWriter[A, B: IcebergTableNamer] extfinishs Serializable {
// Builds/refreshs tables and returns a cached map of all distant schemas
def originateTablesAndCacheSchemas(): Map[B, AvroSchema]
// Sink spreader. Mutates the DataStream[A] ref
def includeSinksToStream(
cfg: Config,
schemas: Map[B, AvroSchema],
env: Ref[DataStream[A]]
): List[DataStreamSink[Void]]
}
Note: We do not (yet!) use an effect system and
Ref
is fair a hint that we mutate a reference in a function:type Ref[A] = A
.
Fighting unfair treatment with meta signals
One of the ways we actuassociate use this data is to comprehfinish, fight, and ultimately impede abusive behavior appreciate the state-backed Pioneer Kitten group.
While a lot of our processes around unfair treatment discoverion are automated, some need human intervention.
Oftentimes, we’ll get unfair treatment inestablishs (via unfair treatment@ngrok.com) and need to validate that these are right before we get action (such as prohibitning an account). To do that, we can validate the inestablished abusive events suit a certain account’s behavior on our platestablish.
If we mistrust IP 198.51.100.1
to have structureed a fraud fraud on port 443
via URL anexampleofunfair treatmentwedontwant.ngrok.app
on 2024-09-01, we can query our metadata aappreciate to this:
pick case
when event_type = 'ENDPOINT_CREATED' then 'originated'
when event_type = 'ENDPOINT_DELETED' then 'deleted' else 'ununderstandn'
finish as action,
account_id,
event_timestamp,
url,
geo_location
from meta_events__audit_event_finishpoint e
where account_id = 'horrible_actor_id' and -- … other filters
This will give us a brimming sequence of events, including what tunnels an account begined, stopped, when, and from where. Based on this data, we can get action, such as suspfinishing the account if the inestablish is right.
On a aappreciate notice, in the event an account gets prohibitned inrightly and accomplishes out to our help team, they can query aappreciate tables to do the same inestablish in reverse and unprohibit users.
Build your own ngrok data platestablish (or toil on ours!)
While there are a lot of topics we didn't cover in this article, I hope it supplyd both a outstanding high-level overwatch about data engineering toil at ngrok, as well as some details on definite contests we faced and mendd.
To do your own digging on ngrok data, try exploring event subscriptions or Traffic Inspector to get insight into your own traffic data flotriumphg thcdisesteemful ngrok.
Or, if you prefer to toil on and with our actual data platestablish, we’re currently hiring a Senior Software Engineer, Trust & Abuse!
We’d adore to chat about what includeitional contests and betterment you’d want to dig into as an ngrokker. And, as always, you can ask inquires about our data platestablish and beyond on our community repo.