POSTS
Handling Json With Jq
- 8 minutes read - 1670 wordsIntro
jq is a power tool for working with JSON on
the command-line. It’s incredibly useful when working with JSON data such as
JSON documents or API responses. It’s one of several popular tools for
manipulating JSON including JMESPath and
Jsonnet. Although I have become particular fond of
jq
, you might want to look into JMESPath
for its integration into Amazon Web Services (AWS) and
Azure command-line tools.
If you don’t feel like installing jq
on your machine, you can follow along
on jq play.
Extracting Data
We can extract data from a JSON document like the one below. Let’s say this
data is stored in data.json
.
[
{
"name": "Ada",
"reports": [
"Boris",
"Becky",
"Booker"
]
},
{
"name": "Booker",
"reports": [
"Carly",
"Chris"
]
},
{
"name": "Carly",
"reports": [
"Diana",
"David"
]
}
]
Let’s say we want all of the supervisor’s names:
% jq '. [] .name' < data.json
"Ada"
"Booker"
"Carly"
If we don’t want the quotes, we can use --raw-output
(or -r
), which is a command-line flag
used to output raw values rather than JSON data:
% jq '. [] .name' -r < data.json
Ada
Booker
Carly
We can get all of the reports:
% jq '. [] .reports []' -r < data.json
Boris
Becky
Booker
Carly
Chris
Diana
David
We can get all of the employees together by combining the two filters:
% jq '. [] .reports [], . [] .name' -r < data.json
Boris
Becky
Booker
Carly
Chris
Diana
David
Ada
Booker
Carly
We can pipe data into jq
from an API result in order to pretty print it. We
can also use this in scripting scenarios.
% curl http://dummy.restapiexample.com/api/v1/employees | jq .
[
{
"id": "41393",
"employee_name": "Fernando",
"employee_salary": "4061",
"employee_age": "68",
"profile_image": ""
},
{
"id": "41394",
"employee_name": "SLKJDLKJ",
"employee_salary": "32",
"employee_age": "43",
"profile_image": ""
},
...
Transformations
jq
has powerful transformation capabilties. Using the example above, we can
build new objects from the data:
% jq '. [] | { firstName: .name }' < data.json
{
"firstName": "Ada"
}
{
"firstName": "Booker"
}
{
"firstName": "Carly"
}
The map
filter allows us to process array items in place. It works similarly to the map
higher-order function as defined in many other languages.
% jq 'map({firstName: .name}) < data.json
[
{
"firstName": "Ada"
},
{
"firstName": "Booker"
},
{
"firstName": "Carly"
}
]
We can filter the results returned by applying selectors to each object of an
array. We do this using the select
filter.
% jq 'map(select(.name == "Ada"))' < data.json
[
{
"name": "Ada",
"reports": [
"Boris",
"Becky",
"Booker"
]
}
]
And we can combine selectors with transformations:
jq 'map(select(.name == "Ada") | { firstName: .name }) ' < data.json
[
{
"firstName": "Ada"
}
]
Functions
jq
comes with a bunch of built-in functions we can use to transform and
filter data, such as reversing arrays:
% jq ' . | reverse[] |.name' -r < data.json
Carly
Booker
Ada
Filtering based on content using the startswith
function:
% jq ' . [] | .name | select(. | startswith("A")) ' -r < data.json
Ada
String interpolation using a back-slash pattern:
% jq ' map("Hi, \(.name)")[]' -r < data.json
Hi, Ada
Hi, Booker
Hi, Carly
And so many other functions you should investigate in the manual.
Working with AWS Results
We can use jq
to work with AWS API data. Assuming you have an AWS account, you
can install awscli to work with the AWS APIs
from the command-line. One of the simplest things you can do is list publicly
available AWS Elastic Compute Cloud (EC2) images. These are images you can use to create new virtual machines. This takes a long time to run, so we’ll save the
results in a local file.
aws ec2 describe-images --output json > aws_images.json
The image data looks something like this:
{
"Images": [
{
"Architecture": "i386",
"CreationDate": "",
"ImageId": "aki-00806369",
"ImageLocation": "karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091001-test-04.manifest.xml",
"ImageType": "kernel",
"Public": true,
"OwnerId": "099720109477",
"State": "available",
"BlockDeviceMappings": [],
"Hypervisor": "xen",
"RootDeviceType": "instance-store",
"VirtualizationType": "paravirtual"
},
{
"Architecture": "i386",
"CreationDate": "",
"ImageId": "aki-00896a69",
"ImageLocation": "karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091002-test-04.manifest.xml",
"ImageType": "kernel",
"Public": true,
"OwnerId": "099720109477",
"State": "available",
"BlockDeviceMappings": [],
"Hypervisor": "xen",
"RootDeviceType": "instance-store",
"VirtualizationType": "paravirtual"
},
// ... more
]
}
I’ve cut out most of the results since when I ran it, there were 123,521 public images. We can take limit the results to the first 1000 images by using the following:
cat aws_images.json | jq '.Images |= .[0:1000]' > short_aws_images.json
If we want to get a list of the providers of Ubuntu images, and the versions
of those images, we can use the .ImageLocation
property of each image dictionary, and
use a regular expression to parse that location, capturing information about
the Ubuntu release and version. The capture
function allows us to use
named capture groups to
transform the results into a new JSON object.
cat short_aws_images.json \
| jq '.Images[] | select(.ImageLocation | contains("ubuntu")) | .ImageLocation | {ImageLocation: .} + capture("^(?<provider>\\w+)\/.*ubuntu(?:-(?<release>trusty|xenial|bionic|eoan))?-(?<version>(?:[\\d\\.]+|daily))")'
The command is a little hard to read, but we can save it as a filter file
that can be loaded via the --from-file
(or -f
) flag. We’ll save it as
ubuntu.jq_filter
with the contents below. One big advantage of storing the
filter as a file is that we can use comments to clarify what we are trying to
do.
.Images[] # unpack the array at the `Images` key
| select(.ImageLocation | contains("ubuntu")) # select all objects where the `ImageLocation` key contains ubuntu
| .ImageLocation # grab the `ImageLocation` key value from each object
| {ImageLocation: .} # Define a new object with the ImageLocation key and value
# then add that to the object returned by this capture, which will be an object
# where each key-value pair corresponds to a named capture group, like:
# { "provider": "abc", "release": trusty, "version": "12.0.0" }
+ capture("(?# we match the start of the string)^(?# then, we grab the provider, which is all the text before the first forward-slash)(?<provider>\\w+)\/(?# after the first backslash, we ignore everything until the last occurrence of ubuntu).*ubuntu(?# next, we may optionally have a release name )(?:-(?<release>trusty|xenial|bionic|eoan))?-(?# finally, we have a numeric version or the word daily)(?<version>(?:[\\d\\.]+|daily))")
Unfortunately, the regular expression is a bit hard to read. jq
uses the
Oniguruma regular expression library
which seems to allow
comments
defined using a (?#...)
group. Although this is better than having no
comments, it would be even better to be able to use
free-spacing to
allow multi-line regular expressions with comments. I’m not sure whether the
problem is that Oniguruma doesn’t support this or that there’s something
lacking in jq
’s implementation. Below, I show the regular expression using
free-spacing with additional comments and escape characters removed.
^ # we match the start of the string
# then, we grab the provider, which is all the text before the first forward-slash
(?<provider>\w+)
/
# after the first backslash, we ignore everything until the last occurrence of ubuntu
.*ubuntu
# next, we may optionally have a release name
(?:- # a dash is required
(?<release>trusty|xenial|bionic|eoan))?
- # then a dash
# finally, we have a numeric version or the word daily
(?<version>
(?:[\d\.]+ # the numeric version may contain digits and periods
|daily)) # or it can be the word "daily"
Then, we can use the filter file like so:
cat short_aws_images.json \
| jq -f ubuntu.jq_filter
{
"ImageLocation": "099720109477/ubuntu/images/ebs/ubuntu-trusty-14.04-i386-server-20180627",
"provider": "099720109477",
"release": "trusty",
"version": "14.04"
}
{
"ImageLocation": "099720109477/ubuntu/images-testing/ebs-ssd/ubuntu-xenial-daily-amd64-server-20170823.1",
"provider": "099720109477",
"release": "xenial",
"version": "daily"
}
{
"ImageLocation": "099720109477/ubuntu/images-testing/hvm-ssd/ubuntu-xenial-daily-amd64-server-20181102",
"provider": "099720109477",
"release": "xenial",
"version": "daily"
}
// more ...
Working with Elasticsearch results
Elasticsearch is a distributed full-text search engine based on Apache Solr.
It has an easy to use API you can use to issue queries and retrieve data.
This great article on useful Elasticsearch
queries
has some example data that we’ll use in our example. We’ll save this into a
file called books.json
.
{
"hits": [
{
"_index": "bookdb_index",
"_type": "book",
"_id": "4",
"_score": 1.3278645,
"_source": {
"title": "Solr in Action",
"authors": [
"trey grainger",
"timothy potter"
],
"summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
"publish_date": "2014-04-05",
"num_reviews": 23,
"publisher": "manning"
}
},
{
"_index": "bookdb_index",
"_type": "book",
"_id": "1",
"_score": 1.2871116,
"_source": {
"title": "Elasticsearch: The Definitive Guide",
"authors": [
"clinton gormley",
"zachary tong"
],
"summary": "A distibuted real-time search and analytics engine",
"publish_date": "2015-02-07",
"num_reviews": 20,
"publisher": "oreilly"
}
}
]
}
Let’s try two things with the data. Firstly, we want to transform the
original result into a JSON array of authors of all books. First, we’ll get
retrieve the array referenced by the hits
key. Then, we change each “hit”
in the array and replace it with the value of the _source.authors
key.
Then, we flatten
the resulting multi-dimensional array. Finally, we sort
the array and only return unique
results. This is not necessary but will
typically make the output cleaner.
cat books.json | jq '.hits | map(._source.authors) | flatten | sort | unique'
[
"clinton gormley",
"timothy potter",
"trey grainger",
"zachary tong"
]
Let’s try obtaining the total number of reviews of all of the books.
cat books.json | jq '.hits | map(._source.num_reviews) | add'
First, we get the array referenced by the hits
key. Then we map
over that
array, returning the number of reviews. Then, we add all of those values. The
answer is 43
.
Conclusion
jq
is an incredibly useful program that makes working with JSON data easy.
It’s the missing tool for manipulating JSON results from the command-line.
It’s a great tool to use in scripts especially when dealing with APIs that
return JSON. jq
is an incredibly rich tool with its own language and
idioms. Not only can you filter and transform data, save filters in files,
but there are also lots of things I didn’t talk about including working with
streaming JSON, defining functions, and creating your own modules of custom
filters. While it’s not the only JSON tool that you should look at, it will
go a long way in filling that empty spot in your toolbox.