JSONs. Packets that Fill Most of Data Pipelines.

Michał Poła
6 min readJan 22, 2022

You can look up the term JSON without any friction using Google. I use a bit different angle in this post. Inspired by the one project and one particular conversation. I tried to pack things as densely as possible to expose mainly the context of using JSON and why I brought it up. A quick look at its architecture is also here. If you are a curious business person or an entry-level developer/data scientist, it may interest you. An IT pro will probably be bored.

Photo by Pietro Jeng on Unsplash

Why you might care

There are more and more people who are curious what’s under the hood of tools and frameworks concerning data management. I like addressing those needs.

There’s this thing — the so called Modern Data Stack (MDS). Its advent picked up the pace of data democratization across companies, since it’s quite easy to implement. It doesn’t mean it’s some kind of a magic sprinkle that causes fantastic ideas to pop up just like that. It means the data-driven quest may start without expensive procedures beforehand.

Usually the story starts with the central part of MDS, which is a Data Warehouse. The more recent alternative is the so called Data Lakehouse, but let’s not diverge from the simpler picture. Data Warehouse, as the name suggests stands almost exactly in the middle between producers or sources and final users. However, I keep it for a different story, detailed story,

I had the project where the overall architecture had been already well understood, so the special attention was paid to the dynamics of the system. Especially to a funny looking file popping up here and there that doesn’t look like a well known Excel-style table. Because data suppose to be put in those elegant structures, right? Not necessarily. The file was the JSON.

Let’s use an analogy. Let’s start describing a cardiovascular system (the data stack including connecting pipes) from the blood part (JSON). Or even better: red blood cells specifically. Erythrocytes carry oxygen, JSON files carry data.

Just the basics

Formally, it’s JavaScript Object Notation. You shouldn’t pay too much attention to the full name. You don’t stop and ponder on the PDF file before using it, do you? What’s worth noting for a fraction of a second, though, is JavaScript part. It’s the ubiquitous framework when it comes to building interactive websites. Since JavaScript is extremely popular, there was no other destiny for JSON.

JSON is a standard. Standard means it’s widely accepted. Why?

Because it’s easy to use. Why?

Because it’s light in terms of weight (memory), easy to read by humans and easy to parse by machines.

It is most often compared to XML and CSV format. Both have the same purpose: store data in a logical way to be moved around the net. CSV is too restrictive for super fast applications, XML is noisy with its syntax. Thanks to the folks from GeekforGeeks.org I can put a JSON v. XML sample right away.

{“Geeks”:[
{ “firstName”:”Vivek”, “lastName”:”Kothari” },
{ “firstName”:”Suraj”, “lastName”:”Kumar” },
{ “firstName”:”John”, “lastName”:”Smith” },
{ “firstName”:”Peter”, “lastName”:”Gregory” }
]}

and…

<Geeks>
<Geek><firstName>Vivek</firstName><lastName>Kothari</lastName</Geek>
<Geek><firstName>Suraj</firstName><lastName>Kumar</lastName></Geek>
<Geek><firstName>John</firstName><lastName>Smith</lastName</Geek>
<Geek><firstName>Peter</firstName><lastName>Gregory</lastName</Geek>
</Geeks>

Even if you see this for the first time, there is a small chance that you’ve just developed the preference for one of them. The first one is JSON, the other is XML. It doesn’t mean XML is bad. The JSON — XML relation just has the Darwinian nature.

Let’s expand on its architecture a bit.

  • JSON is built with two primary parts: keys and associated values.
  • Key is always a string enclosed in quotation marks.
  • Value has more freedom. A value can be a number, boolean expression, string, an array, or an object.
  • Key value pairs are combined with clear rule: key is followed by a colon which is followed by the value.
  • Key/value pairs are separated with a comma.

Take a look at data types:

  • boolean — binary alternatives of true and false,
  • number — an integer,
  • string — text,
  • array — classic associative array of values,
  • object — an associative array of key/value pairs.

JSON is winning because it’s very simple. It’s just that: not advanced technology hidden inside, nothing special about it.

Photo by Luke Chesser on Unsplash

Back to the context

Data almost always needs to be moved from one place to another. The usual case is from a server to a client. Have you ever wondered what’s the package that gets into your weather app whenever you check it up? Now you know.

An app is a client. You, as a user launching it, send a request to a remote server to collect data you’re interested in (weather variables for given location). Your app calls programmatically to someone to collect the data for you. The alternative? Checking the weather on a website and see it for yourself. Well, guess what? The website calls the same place to get the data. Please notice I used 2 words: “programmatically” and “an app” (abbreviation of application). I’m building another concept here for you.

We are used to Graphic User Interfaces. We move around websites with visual interface. It’s not needed when machines talk to one another. You simply establish rules of communication. The visual part goes away. It is called Application Programming Interface (API). For instance, your weather app call an API of the weather forecast provider. Guess what format the payload goes back in? I’ve mentioned that already. Of course, it’s JSON.

To wrap up technological aspect, I need to mention where JSON is used as frequently as in the API sphere.

There is specific type of a database: NoSQL. It means Not-Only-SQL which is a relatively new paradigm of building powerful databases.

Traditional database design is formulated on restrictive rules of combining rows and columns.

NoSQL databases are not bounded by those rigorous rules and thanks to that, they can achieve high elasticity and super speeds. A user waiting for a response of a mobile app for a second or longer is basically a lost customer. To limit the latency NoSQL database is set at the serving part. And JSON files are often used in this kind of a system.

I’ll touch the Modern Data Stack context the final time. It’s the background of the most of the present data architecture stories.

Unfortunately, a blog piece is not a good platform for presenting complex ideas at once. Yet I won’t sleep well without emphasizing the context which helps in grounding new knowledge.

I intend to expand the subject in the future posts. For know a simplistic architecture of what all the recent fuss is about.

source: the author

This is a great abstraction, yet perfectly fine for the intro for some of the readers. You collect data and store it. You used to perform Extract-Transform-Load (ETL) processes. Now they’re rather ELT processes (“T” and “L” switched) or ETLT processes (Transform performed before and after Load part).

DataWarehouse is the focal point. Often this is the place where the job of a data engineer ends and teams of interested parties hook into the system launching their procedures.

Since data to prove its value must be put in a specific context, modeling procedures are performed and they feed the final parts, where the curated data displays its potential in front of interested parties.

Again, I will expand on the subject in the future posts.

I hope you got a good grip on the meaning of the JSON format as itself and as the part of the data ecosystem. You encountered the API idea and have an abstract concept of what the Modern Data Stack looks like.

--

--

Michał Poła

Analytics engineer. Quantum information enthusiast.