Call us at 970.744.3340|Login
Jun 10
Build a JSON Parser & Query Tool

Building a JSON parser & query tool with Go

Welcome to dora, the JSON query tool! As of right now, I don’t suggest actually using dora, as it is a work in progress, and you will be better off using the standard library’s encoding/JSON or even another third-party party package. I also haven’t completed full support for the spec, in favor of writing this post first. I need to apply all of the escape rules for strings and add support for exponential notation.

The main reason I wanted to build this was that I’ve never had to worry about how it works, and I don’t like magic.

I’d like the code/comments themselves to tell a deeper technical story while I stay a bit higher level from within the post.

Note: This covers a small corner of these topics, especially when it comes to parsing. Keep that in mind, as there are many other techniques and processes. With that being said, I’ll do my best to explain the parts needed to understand dora.

Alright, let’s get started!

Here are some questions to keep in mind. I’m hoping you’re able to answer these to some degree by the end of this post.

What is a JSON parser?

What is lexical analysis?

What is an abstract syntax tree?

How can you query against data like that?


What is a JSON parser?

JSON, or JavaScript Object Notation, is a lightweight data-interchange format. It’s extremely popular, especially within the web application space. When we need to produce or consume JSON, we can do it from any language. JSON is completely language independent, so you will need a tool to consume and make sense of the data.

This is where JSON parsers come in. Most languages (especially newer ones) have some sort of built-in support/library for JSON. Languages have different data structures, so each one maps JSON to these structures a bit differently.


Lexical analysis (aka scanning, aka tokenization)

This is the first task that dora completes. It is generally the first step that any programming language completes as well. When you take source code, or in this case some JSON, on its own it is just bytes representing a string with no real way to interpret/manipulate the data productively.

{ “someKey”: “someValue”, “anotherKey”: [1, 2, “three”] }

So, what we are going to do is iterate over the bytes and break them into a series of tokens, removing comments, whitespace, any unnecessary noise, etc. Comments don’t apply here as JSON doesn’t support them. Before we can decide how the parser will handle syntax, we must have tokens to work with.

The end result is a collection of tokens, representing the source code. A token in dora looks like this:

The token Type can be any of the token types listed at the top of pkg/token/token.go:

The token’s literal holds the actual value of the token. For example {, } , [ , ] , , , “someKey”, 10, etc. The Line field is simply for better error reporting, while the Start and End fields represent indexes within the JSON where an item lives (I’ll expand on this towards the end).

The approach to this lives inside lexer.go. We iterate through the bytes with some defined rules and create tokens from the source. We end up with a []Token that represents the program from start to finish.

Here is a good link on lexical analysis you can follow up with if you’d like.


On to parsing!

So what the heck is an AST (Abstract Syntax Tree)? It is a tree that represents the abstract syntactic structure of a language. In other words, it gives our tokens meaning and structure. Here is a thorough definition of doing it within the scope of building a compiler.

Our job parsing JSON is much simpler than parsing most programming languages. We have much less to orchestrate. We essentially just need to build a representation of it in memory so we can query against the data.

I was inspired by this AST explorer for the shape of the tree. I found it to be a really nice tool for helping visualize them in general. You can switch between a bunch of different languages and see what their ASTs look like. It should help bridge some gaps in your head around what such a thing looks like.

The method we use for parsing is called recursive descent parsing. It is not the only possible parsing method, or the most efficient, but it works well for our use case. Recursive descent is a top-down parsing approach that constructs the tree from the top, the input is read from left to right, and we do a single pass over it. This parsing technique is regarded as recursive because it uses context-free grammar which is recursive in nature.

So, what does JSON’s grammar look like? We’ll have to look at the spec! It has a really nice visual representation of the grammar as well. Here is a rough estimate of what the BNF (Backus-Naur form) might look like:

So, what is this now? Well, if we start at the top, we can talk about it like this:

  1. JSON is a <value>.
  2. A <value> is one, and only one, of the following: <object>, <array>, <boolean>, <string>, <number>, <null>.
  3. An <array> must start with a left-brace: [ and can contain zero or more <value>. When there are multiple <value>, they must be comma-separated. Arrays must end with a right-brace ].
  4. An <object> must start with a left-bracket: {, and can contain zero or more <property>. When there are multiple <property>, they must be comma-separated. Objects must end with a right-bracket: }.
  5. A <property> is a <string> followed by a : followed by a <value>.

This is a nice guideline for building the AST. We now know what valid JSON must look like. Hopefully, this helps for visualizing how some of that recursion we talked about comes into play here. For a basic JSON object like this:

{ “property1”: { “another”: “object” } }

You have the base JSON object, which is a <value>. That <value> is an <object> which has a <property> inside ( “property1”). Its right-side is another <value>. That <value> is another object! The same idea applies to arrays, and combinations of the two.

Here is the implementation:

As you can see, there is a fair amount going on in there. The main takeaways are:

  1. We recursively parse from the root.
  2. The entry point for parsing values is parseValue()
  3. parseValue() switches on the current token type and calls the appropriate parsing method based on what the upcoming token is.
  4. Each value type gets a set of states, almost like little state machines. When the tree expands in the middle of a value, for ex: we could be parsing an object and get to a property key which points to another object. This means we have to pause parsing the first object in the middle, as we dive into an inner object/array. So as you could imagine, we could have any number of objects or arrays in different states at different times.

The different states are defined in ast.go:

As you can see this is also where we define our AST types. There is a root node which holds a Type (ObjectRoot or ArrayRoot), and the corresponding value. That value gets type-asserted into an array or object which serves as the root item in the JSON.

If you look at the rest of the types you will see that each of them uses our Value interface in some way. These are the types that we build the in-memory tree from.

The last piece of dora’s current functionality is the Client method GetByPath. Dora currently supports a very simple query syntax. The GetByPath method is a public method on a dora Client and just takes a query string. There are a couple basic rules:

  1. All queries must start with $ followed by either a . or [] (depending on the root Type)
  2. Access objects with .
  3. Access arrays by index with [i]

Example:

So in other words, what you get is something similar to the way you would access the data in JavaScript (except without bracket [] notation for accessing objects.

Now we have to parse and execute a user’s query against our AST. The lexer and parser for a dora query are a bit simpler than the original ones for the JSON itself. It’s just a couple of functions:

The idea is to parse the query into a slice of queryToken. Each query token holds an accessType so we know whether we’re looking at an object or an array, along with a keyReq and indexReq. That way when we’re iterating through our []queryToken, we know what to look for in our tree!

Here is the query implementation (just a couple of functions again):

I ended up using a jump variable instead of keeping track of read position, current token, next token, etc. That seemed like overkill, where we could just parse each query token and assign a jump to the next one.

If the value being queried is a Literal, it is simply returned: “null”, “true”, “false”, “some random value”, “10”, “8.1”, etc. If the requested value is an object or array itself, it returns the full array or object as a JSON string.

Earlier I mentioned I would talk more about why we have a Start and End field on our Object and Array types from pkg/ast/ast.go. We set those values as we are scanning & parsing and they point to the start and end indexes of the object or array in the actual input JSON. This gives us the ability to slice and return a value from the original JSON.

You made it!

I wanted to leave a list of different resources to explore further if you have any interest! I would say if you got this far and you like this type of stuff then the next step would be diving into writing a toy language of some sort.

  1. My introduction to this type of work was building monkey-lang. The original design is from the books Writing an Interpreter in Go, and Writing a Compiler in Go by Thorsten Ball. I highly recommend those books. I wanted to share a quote from Thorsten’s site from the founder of HashiCorp:“Compilers was the most surprisingly useful university course I ever took. Learning to write a parser and runtime for a toy language helps take away a lot of “magic” in various parts of computer science. I recommend any engineer who isn’t familiar with lexers, parsers, and evaluators to read Thorsten’s book.”- Mitchell Hashimoto, Founder of HashiCorp
  2. If you prefer to build a language with Java or C instead of Go, I highly recommend https://craftinginterpreters.com by Bob Nystrom. He just recently completed the last few chapters, which I haven’t gotten the chance to go back and go through.
  3. JSONPath which also gave me inspiration for the query syntax (although dora is basically a very small subset of it right now.)
  4. The JSON spec
  5. A BNF representation of JavaScript
  6. A simple recursive decent parser
  7. Link to the source code

Follow me on twitter @LamsonScribner for more content & tutorials on software engineering!

blank

About The Author