PRACTICAL GREMLIN: An Apache TinkerPop Tutorial

1. INTRODUCTION

This second edition of Practical Gremlin is an evolving guide to the Apache TinkerPop Gremlin graph query and traversal language, built around real examples using real-world graph data. It is written for developers and practitioners who want to learn Gremlin by doing, with a focus on practical patterns, working traversals, and lessons learned from applying graph technology to real problems.

Feedback on this book is very much encouraged and welcomed! Please open GitHub issues as appropriate.

1.1. Welcome to the second edition

Practical Gremlin was first published in 2017. In the years that followed, the Apache TinkerPop graph computing framework has continued to evolve. While it was possible to keep the first edition mostly up to date and to publish periodic updates, there comes a time when so much has changed that it makes more sense to take a step back and release a second edition. The book has been updated to include the most recent new features added to the Gremlin query language and related technologies. Material such as discussions of migrating from old versions of Gremlin, and discussions of how to work around features then lacking, but since added, have been removed. The following paragraph was used to introduce the first edition. It remains as true today as it was then.

The title of this book could equally well be '"A getting started guide for users of graph databases and the Gremlin query language featuring hints, tips, and sample queries"'. It turns out that is a bit too long to fit on one line for a heading but in a single sentence that describes the focus of this work pretty well.

— Practical Gremlin first edition
October 2017

As with the first edition, I have resisted the urge to cover every single feature of TinkerPop one after the other in a reference manual fashion. Instead, what I have tried to do is capture the learning process that I myself have gone through using what I hope is a sensible flow from getting started to more advanced topics. To get the most from this book, I recommend having the Gremlin Console open, with the air route sample data loaded, as you follow along. I have not assumed that anyone reading this book has any prior knowledge of Apache TinkerPop, the Gremlin query language, or related tools. Everything you need to get started is introduced in Chapter 2.

I hope people continue to find what follows useful. It definitely remains a work in progress as the Apache TinkerPop framework, and in particular, the Gremlin query language, continues to evolve.

The book is available in multiple formats including PDF, HTML, ePub, and MOBI. Those versions, along with sample code and data, can be found at the project’s home on GitHub. You will find a summary of everything that is available in the "Introducing the book sources, sample programs, and data" section.

1.2. How this book came to be

I forget exactly when, but sometime early in 2016 I started compiling a list of notes, hints, and tips, initially for my own benefit. My notes were full of things I had found poorly explained elsewhere while using graph databases and especially while using Apache TinkerPop, Gremlin, and JanusGraph. Over time that document continued to grow and had effectively become a book in all but name. After some encouragement from colleagues, I decided to release my notes as a 'living book' in an open source venue so that anyone who is interested can read it. It is definitely aimed at programmers and data scientists, but I hope it is also consumable by anyone using the Gremlin graph query and traversal language to work with graph databases.

I have included a large number of code examples and sample queries along with discussions of best practices and more than a few lessons I learned the hard way, that I hope you will find informative. I call it a 'living book' as my goal is to regularly make updates as I discover things that need adding while also trying to keep the content as up to date as possible as Apache TinkerPop itself evolves.

I remain extremely grateful to all those who have encouraged me to keep going with this adventure. Keeping up with a moving target requires a fair bit of work, but it remains a lot of fun!

Kelvin R. Lawrence
First draft (first edition): October 5th, 2017
Final draft (first edition): May 4th, 2022
First draft (second edition): January 1, 2026 Current draft (second edition): 2026-02-26 16:36:54 UTC

1.3. Providing feedback

Please let me know about any mistakes you find in this material, and also, please feel free to provide feedback of any sort. Suggested improvements are especially welcome. A good way to provide feedback is by opening an issue in the GitHub repository located at https://github.com/krlawrence/graph. You are currently reading revision v2-002-preview of the book.

The change history contains details of everything that has been added over time and can be found at this location: https://github.com/krlawrence/graph/blob/main/ChangeHistory.md

I am grateful to those who have already taken the time to review the manuscript and open issues or submit pull requests.

1.4. Some words of thanks

No open source project can succeed without dedicated contributors and equally dedicated users. Apache TinkerPop continues to be not just a technology, but a vibrant community as well. This book would have no audience but for the continued hard work, and interest, of that community.

As always, special thanks should go to, Marko Rodriguez, Daniel Kuppitz, Stephen Mallette, and others, who created TinkerPop and Gremlin, and drove its evolution for many years. I’m also grateful to Stephen for his help in putting this second edition of the book together.

Inspiration as to what topics people are interested in often comes from seeing the active discussions on-line at venues such as StackOverflow and the Gremlin Users Google Group. Since the first edition of this book was released, Apache TinkerPop now also has an active Discord server where many exciting topics get discussed daily.

I continue to be grateful for the contributions of my former colleagues, Graham Wallis, Jason Plurad, and Adam Holley, who helped refine and improve several of the example queries contained in the first edition of this book. Gremlin is definitely a bit of a team sport. We spent many fun hours discussing the best way to handle different types of queries and traversals!

Lastly, I would like to thank everyone who has submitted feedback and ideas via e-mail or GitHub issues, and pull requests. That is the best part about this being a 'living book' we can continue to improve and evolve it just as the technology it is about continues to evolve. Your help and support is very much appreciated.

1.5. Thoughts on the Second Edition

Except for this section, references to "I" in this book refer to the book’s original author, Kelvin Lawrence. As the editor and co-author of the Second Edition, I didn’t feel as though "I" should become "We" anywhere, as I believe that much of the appeal of this book lies in Kelvin relating his personal discoveries on his journey with Gremlin. I’ve tried to preserve that feeling and voice as much as possible in my many additions and edits.

As I write this section, I realize that I’ve authored many lines of TinkerPop’s Reference Documentation, Tutorials, and Recipes over the years and, as a result, I tend to think there is a certain completeness to the documentation that is officially offered. I think that’s a valuable aspect of the TinkerPop project, yet I’ve also always recognized the importance and impact of this book as a resource for Gremlin users. My personal opinion is that its approach and organization is what allows it to make that helpful impact. Where TinkerPop Documentation has its completeness presented all at once, this book gives you a clear, ramped process to learning Gremlin so that you can make use of that completeness in the official documentation.

In 2023, TinkerPop was in the midst of releasing 3.6.x and 3.7.x release lines which had introduced a number of important new features. Kelvin and I started realizing that Practical Gremlin was falling behind TinkerPop’s continued evolution significantly. With even bigger sets of changes expected on the horizon with 4.0, it felt as though it was time to make an organized effort to produce an updated Second Edition.

We officially announced that work had started on it in September 2023 and with the big release of 3.8.0 in November 2025, we believed it time to polish up the draft and make an official publication of the new edition. As with the First Edition, we expect the book to continue to be a work-in-progress and will make minor publications as new TinkerPop releases appear.

As you read this edition, please keep in mind that it reflects the state of Gremlin and TinkerPop at a particular point in time, but it is also part of an ongoing conversation with the community. Your questions, suggestions, and examples help shape where the book goes next. If you find places where it could be clearer, more helpful, or more complete, your feedback is welcome and will help guide future updates.

Stephen Mallette
Second edition editor and co-author

1.6. What is this book about?

This book is about learning to think in Gremlin by working through practical examples. It focuses on how to express real graph questions as traversals, how to read and reason about Gremlin code, and how to apply common patterns to your own graphs. As the book evolves, new examples and refinements continue to be added, but the core goal remains the same: to help you build an intuitive understanding of how Gremlin works.

In this book you will find real examples featuring real-world graph data. That data, along with sample code and example applications, is available for download from the GitHub project as well as many other items. The graph, 'air-routes', is a model of the world airline route network between 3,504 airports, including 50,637 routes. The examples presented will work unmodified with the air-routes.graphml file loaded into the Gremlin console running with a TinkerGraph. How to set that environment up is covered in the "Download, install, and launch the Gremlin Console" section below.

The examples in this book have been tested using Apache TinkerPop release 3.8.0.

TinkerGraph is an 'in-memory' graph, meaning nothing gets saved to disk automatically. It is included as part of the Apache TinkerPop download. The goal of this tutorial is to allow someone with little to no prior knowledge to get up and going quickly using the Gremlin Console and the 'air-routes' graph. Later in the book we discuss writing standalone applications in other programming languages such as Java, Groovy, and Python.

The first few sections of the book focus on showing some basic Gremlin queries that are both useful and yet easy to understand. By the end of Chapter 3 you should have a basic understanding of how to explore the air-routes graph using commonly used Gremlin steps. Chapters 4, 5, and 6 explore Gremlin in more depth.

How this book is organized

Chapter 1 – INTRODUCTION

We start our journey with a brief introduction to Apache Tinkerpop and a quick look at why Graph databases are of interest to us. We also discuss how the book is organized and where to find additional materials, such as sample code and data sets.

Chapter 2 – GETTING STARTED

Many of the examples throughout the book use the Gremlin Console and TinkerGraph, and both are introduced in this chapter. We also introduce the air-routes example graph - air-routes.graphml - used throughout the book.

Chapter 3 – WRITING GREMLIN QUERIES

Now that the basics have been covered, things start to get a lot more interesting! It’s time to start writing Gremlin queries. We briefly explore how we could have built the 'air-routes' graph using a relational database, and then look at how SQL and Gremlin are both similar in some way,s and very different in others. We then introduce several of the key Gremlin query language '"steps"'. We focus on exploring the graph rather than changing it in this chapter.

Chapter 4 – BEYOND BASIC QUERIES

Having now introduced Gremlin in some detail, we introduce the Gremlin steps that can be used to create, modify, and delete, data. We present a selection of best practices and start to explore some more advanced query writing.

Chapter 5 – MISCELLANEOUS QUERIES AND THE RESULTS THEY GENERATE

Using the Gremlin steps introduced in Chapters 3 and 4, we are now ready to use what we have learned so far and write queries that analyze the air-routes graph in more depth and answer more complicated questions. The material presented includes a discussion of analyzing distances, route distribution, and writing geospatial queries.

Chapter 6 – MOVING BEYOND THE GREMLIN CONSOLE

The next step in our journey is to move beyond the Gremlin Console and take a look at interacting with a TinkerGraph using Java and Groovy applications.

Chapter 7 – INTRODUCING GREMLIN SERVER

Our journey so far has focused on working with graphs in a "directly attached" fashion. We now introduce Gremlin Server as a way to deploy and interact with remotely hosted graphs.

Chapter 8 – COMMON GRAPH SERIALIZATION FORMATS

Having introduced Gremlin Server, we take a look at some common Graph serialization file formats along with coverage of how to use them in the context of TinkerPop enabled graphs. We take a close look at the TinkerPop GraphSON (JSON) format, which is used extensively when using Gremlin queries in conjunction with a Gremlin Server.

Chapter 9 – FURTHER READING

Our journey to explore Apache TinkerPop and Gremlin concludes with a look at useful sources of further reading. We present links to useful websites where you can find tools and documentation for many of the topics and technologies covered in this book.

1.7. Introducing the book sources, sample programs, and data

All work related to this project is being done in the open at GitHub. A list of where to find the key components is provided below. The examples in this book make use of a sample graph called 'air-routes' which contains a graph based on the world airline route network between over 3,504 airports. The sample graph data, quite a bit of sample code, and some larger demo applications can all be found at the same GitHub location that hosts the book manuscript. You will also find releases of the book in various formats (HTML, PDF, DocBook/XML, MOBI, and EPUB) at the same GitHub location.

The sample programs include standalone Java, Groovy, Python, and Ruby examples as well as many examples that can be run from the Gremlin Console. There are some differences between using Gremlin from a standalone program and from the Gremlin Console. The sample programs demonstrate several of these differences. The sample applications area contains a full example HTML and JavaScript application that lets you explore the 'air-routes' graph visually. The home page for the GitHub project includes a README.md file to help you navigate the site. Below are some links to various resources included in this book.

Where to find the book, samples, and data

Project home

https://github.com/krlawrence/graph

Book manuscript in Asciidoc format

This file can be viewed using the GitHub web interface. It will always represent the very latest updates.
https://github.com/krlawrence/graph/tree/main/book

Latest PDF and HTML snapshots

These files are regularly updated to reflect any significant changes. These are the only generated formats that are updated outside the full release cycle. The PDF version includes pagination as well as page numbering and is produced using an A4 page size. The HTML version does not include these features. Otherwise, they are more or less identical.
https://kelvinlawrence.net/book/PracticalGremlin.pdf
https://kelvinlawrence.net/book/PracticalGremlin.html

Official book releases in multiple formats

Official releases include AsciiDoc, HTML, PDF, ePub, MOBI, and DocBook versions as well as snapshots of all the samples and other materials in a single package available through GitHub Releases. The eBook and MOBI versions are really intended to be read using e-reader devices, and for that reason use a white background for all source code highlighting to make it easier to read on monochrome devices.
I recommend using the PDF version if possible as it has page numbering. If you prefer reading the book as if it were a web page, then by all means use the HTML version. You will just not get any pagination or page numbers. The DocBook format can be read using tools such as Yelp on Linux systems, but is primarily included so that people can use it to generate other formats that I do not already provide. The MOBI and ePub versions may require you to change the font size you use on your device to make things easier to read.
https://github.com/krlawrence/graph/releases

Sample data (air-routes.graphml)

https://github.com/krlawrence/graph/tree/main/sample-data

Sample code

https://github.com/krlawrence/graph/tree/main/sample-code

Example applications

https://github.com/krlawrence/graph/tree/main/demos

Change history

If you want to keep up with the changes being made, this is the file to keep an eye on.
https://github.com/krlawrence/graph/blob/main/ChangeHistory.md

1.8. Apache TinkerPop Evolution

Over the last 15 years, TinkerPop, and especially Gremlin, have evolved substantially from their earliest versions. What we now know as Apache TinkerPop is the result of an open source project created in 2009 and moved to the Apache Software Foundation (ASF) in 2015, after the final release of TinkerPop version 2. The first official release of Apache TinkerPop 3.0 came in July 2015, with the project being promoted to Apache’s "top-level" status the following year. After a decade of continuous releases for TinkerPop 3.0, the project released a beta version of 4.0 in January 2025 for early evaluation.

We focus this second edition of the book on the semantics of Gremlin as of TinkerPop release 3.8.0.

If you are new to TinkerPop and Gremlin, you can probably skip the next few sections. They appeared in a slightly modified form, as part of the First Edition, and provided a way to highlight the arrival of key new features. These notes have been left in the Second Edition as there are still people using older versions of Gremlin, and it can be useful to have a list like this to cross-reference.

The complete ApacheTinkerPop change history can be found at https://github.com/apache/tinkerpop/blob/master/CHANGELOG.asciidoc

Graph database engines that support Apache TinkerPop often take a while to move up to new releases, and it’s always a good idea to verify the exact level the database you are using supports.

This version of the book covers features of Gremlin available as part of the TinkerPop 3.8.0 release. As appropriate, notes and examples have been added that show other ways to perform tasks that new features may simplify. In some cases, notes have been added to point out when more recent features first appeared.

1.8.1. TinkerPop 3.4

A major update to Apache TinkerPop, version 3.4.0, was released in January 2019, and a number of point releases followed.

Full details of all the new features added in the TinkerPop 3.4.x releases can be found at the following link: https://github.com/apache/tinkerpop/blob/3.4-dev/CHANGELOG.asciidoc

1.8.2. TinkerPop 3.5

Apache TinkerPop 3.5.0 was released in May 2021. This update introduced a number of improvements in areas such as Gremlin client drivers, the Gremlin Server, and overall bug fixes. The release also improved the Gremlin query language in some key areas. Some features that had been declared deprecated in earlier releases were finally removed as part of the 3.5.0 update. If you have queries and code that still use these deprecated features, as part of an upgrade to the 3.5.x level, you will need to make the appropriate changes.

The main breaking change to be aware of is that 'Order.incr' and 'Order.decr' were removed from the Gremlin language. The newer 'Order.asc' and 'Order.desc' must be used instead. The examples in this book and those in the sample-code folder have been updated to reflect these changes.

In January 2022, the TinkerPop 3.5.2 release added a native datetime operator to the Gremlin language such that dates can be added without needing programming language specific constructs. This is useful when sending Gremlin queries as text strings.

Full details of all the new features added in the TinkerPop 3.5.x releases can be found at the following link: https://github.com/apache/tinkerpop/blob/3.5-dev/CHANGELOG.asciidoc

1.8.3. TinkerPop 3.6

Apache TinkerPop 3.6.0 was released in April 2022. Coming almost exactly a year after the initial 3.5.0 release, this is one of the most significant TinkerPop releases since TinkerPop 3.4.0 appeared in January 2019. The release contains many improvements, including several new Gremlin steps, designed to make commonly performed tasks much easier. Notable improvements include:

New 'mergeV' and 'mergeE' steps that make "create if not exist" type queries, sometimes referred to as "upserts", much easier to write. Over time, these steps will replace the use of the 'fold…coalesce' pattern and will also replace the various "map injection" patterns that can be used to create multiple vertices and edges in a single query.
A new 'TextP.regex' predicate that allows regular expressions to be used when comparing strings.
The 'property' step can now be given a map of key/value pairs so that several properties can be created at once.
A new 'element' step that can be used to find the parent element (vertex or edge) of a property.
A new 'call' step that lays the foundation enabling Gremlin queries to call other endpoints. This opens up many types of interesting use cases such as query federation, and looking up values from other services.
A lot of effort has been put into removing unnecessary exceptions by filtering out parts of traversals instead of failing with an error. This is especially so in the case of 'by' modulators that now filter when a value does not exist rather than throw an exception. This work began as part of the TinkerPop 3.5.2 update and is completed as of TinkerPop 3.6.0.
A new 'fail' step that can be used to abort a query in a controlled way.

Full details of all the new features added in the TinkerPop 3.6.x releases can be found at the following link: https://github.com/apache/tinkerpop/blob/3.6-dev/CHANGELOG.asciidoc

1.8.4. TinkerPop 3.7

TinkerPop 3.7.0 was released July 2023 and with the follow-on release of 3.7.1 a few months later, introduced a large expansion of the Gremlin language, providing long-awaited features for manipulating strings, collections, and dates. There were other major features as well, such as TinkerGraph gaining some simple transactional features and the ability for properties to be returned on elements from Gremlin Server, rather than only getting references. Notable improvements include:

New Gremlin steps for working with strings: 'asString', 'concat', 'length', 'toLower', 'toUpper', 'trim', 'lTrim', 'rTrim', 'reverse', 'replace', 'split', 'substring, and 'format'.
New Gremlin steps for working with collections: 'any', 'all', 'product', 'merge', 'intersect', 'combine', 'conjoin', 'difference', 'disjunct,' and 'reverse'.
A new Gremlin steps for working with dates: 'asDate', 'dateAdd' and 'dateDiff'.
The 'union' step became available as a start step.
Improved syntax for specifying cardinality directly within a 'Map' for use with 'mergeV'.
TinkerGraph gained support for simple transactions.
Graph elements like 'Vertex' and 'Edge' can now be returned from Gremlin Server with their properties attached using the 'materializeProperties' option.

Full details of all the new features added in the TinkerPop 3.7.x releases can be found at the following link: https://github.com/apache/tinkerpop/blob/3.7-dev/CHANGELOG.asciidoc

1.8.5. TinkerPop 3.8

TinkerPop 3.8.0 was released November 2025 and established a wide mix of features and improvements to Gremlin semantics designed to enhance language consistency. Some of these changes do lead to behaviors that break from previous versions of TinkerPop, but are expected to stay consistent with TinkerPop 4.0, which should help ease the migration there when 4.0 eventually releases. Here are some of the important highlights to consider from this release:

New Gremlin steps for type conversions: 'asBool' and 'asNumber'.
Added the new 'typeOf' predicate to make it possible to filter traverser objects given their data type.
The 'none' step was renamed to 'discard', and 'none' has been modified to take 'P' as an argument, whose behavior is now a complement to 'any' and 'all' steps.
The minimum Java version supported is JDK11.
The default implementation for the 'date' data type for Java is 'OffsetDateTime' rather than 'java.util.Date', which means that steps like 'asDate' and helpers like the 'datetime' function will return 'OffsetDateTime' whose string representation is an ISO 8601 format.
Creation of 'g' has been simplified slightly by replacing more verbose construction options of 'traversal().withEmbedded(…)' and 'traversal().withRemote(…)' with a single 'traversal().with(…)'.
The 'split' method splits a string to characters if given an empty string as a separator argument.
The Gremlin language removed syntax for 'Vertex', made use of 'new' optional, included support for 'withoutStrategies,' and removed 'Map' key restrictions related to reserved words.
The semantics for the 'choose' step changed to match the first option only, pass through traversers when the choice is unproductive or the determined choice unmatched, and introduces a new 'Pick.unproductive' option.
Gremlin no longer follows Groovy by defaulting float values to 'BigDecimal' – they go to 'double' instead.
The 'valueMap', 'propertyMap', 'groupCount', 'dedup', 'sack', 'sample', 'aggregate' steps can no longer take 'by' modulators beyond the number each step expects.
Included the air-routes dataset, as is used in this book, as official sample data in the distribution.
The form of 'has' that takes a key and a 'Traversal' as an argument has been removed to prevent confusion around its functionality and to prepare for a revised implementation of it in 4.0.0.
The semantics for 'limit', 'skip,' and 'range' with 'global' scope are tracked per iteration inside 'repeat'.
The semantics for 'limit', 'skip,' and 'range', with 'local' scope will no longer automatically unwrap a single item in a list if there is only one object present.

Full details of all the new features added in the TinkerPop 3.8.x releases can be found at the following link: https://github.com/apache/tinkerpop/blob/3.8-dev/CHANGELOG.asciidoc

1.8.6. TinkerPop 4.0

TinkerPop 4.0.0 has released a beta version for early evaluation of specific features and remains under active development. We mention it here for awareness but do not intend to cover any features it offers until its full official release. We will only note that the intention of the TinkerPop Community is for 4.0.0 to have the same semantics for Gremlin as 3.8.0. As a result, all the Gremlin examples provided in this second edition of Practical Gremlin will work equally well for both versions.

1.9. So what is a graph database and why should I care?

This book is mainly intended to be a tutorial in working with graph databases and related technology using the Gremlin query language. However, it is worth spending just a few moments to summarize why it is important to understand what a graph database is, what some good use cases for graphs are and why you should care in a world that is already full of all kinds of SQL and NoSQL databases. In this book we are going to be discussing 'directed property graphs'. At the conceptual level, these types of graphs are quite simple to understand. You have three basic building blocks: vertices, edges, and properties. Vertices represent "things" such as people or places. Edges represent connections between those vertices, and properties are information added to the vertices and edges as needed. The 'directed' part of the name means that any edge has a direction. It goes 'out' from one vertex and 'in' to another. You will sometimes hear people use the word 'digraph' as shorthand for 'directed graph'. Consider the relationship "Kelvin knows Jack". This could be modeled as a vertex for each of the people and an edge for the relationship as follows.

Kelvin — knows → Jack

Note the arrow which implies the direction of the relationship. If we wanted to record the fact that Jack also admits to knowing Kelvin, we would need to add a second edge from Jack to Kelvin. Properties could be added to each person to give more information about them. For example, my age might be a property on my vertex.

It turns out that Jack really likes cats. We might want to store that in our graph as well so we could create the relationship:

Jack — likes → Cats

Now that we have a bit more in our graph, we could answer the following question: "who does Kelvin know that likes cats?"

Kelvin — knows → Jack — likes → Cats

This is a simple example, but hopefully you can already see that we are modeling our data the way we think about it in the real world. Armed with this knowledge, you now have all the basic building blocks you need to start thinking about how you might model things you are familiar with as a graph.

So getting back to the question of "why should I care?", well, if something looks like a graph, then wouldn’t it be great if we could model it that way. Many things in our everyday lives center around things that can very nicely be represented in a graph. Things such as your social and business networks, the route you take to get to work, the phone network, airline route choices for trips you need to take are all great candidates. There are also many great business applications for graph databases and algorithms. These include recommendation systems, crime prevention, and fraud detection to name but three.

The reverse is also true. If something does not feel like a graph, then don’t try to force it to be. Your videos are probably doing quite nicely living in the object store where you currently have them. A sales ledger system built using a relational database is probably doing just fine where it is, and likewise a document store is quite possibly just the right place to be storing your documents. So "using the right tool for the job" remains as valid a phrase here as elsewhere. Where graph databases come into their own is when the data you are storing is intrinsically linked by its very nature, the air-routes network used as the basis for all the examples in this book being a perfect example of such a situation.

Those of you that looked at graphs as part of a computer science course are correct if your reaction was "Surely graphs have been around for ages, why is this considered new?". Indeed, Leonard Euler is credited with demonstrating the first graph problem and inventing the whole concept of "Graph Theory" all the way back in 1763 when he investigated the now famous "Seven Bridges of Koenigsberg" problem.

If you want to read a bit more about graph theory and its present-day application, you can find a lot of good information online. Here’s a Wikipedia link to get you started: https://en.wikipedia.org/wiki/Graph_theory

So, given Graph Theory is anything but a new idea, why is it that only recently we are seeing a massive growth in the building and deployment of graph database systems and applications? At least part of the answer is that computer hardware and software have reached the point where you can build large big data systems that scale well for a reasonable price. In fact, it’s even easier than ever to build large systems because you don’t have to buy the hardware that your system will run on when you use the cloud.

While you can certainly run a graph database on your laptop—I do just that every day—the reality is that in production, at scale, they are big data systems. Large graphs commonly have many billions of vertices and edges in them, taking up petabytes of data on disk. Graph algorithms can be both compute- and memory-intensive, and it is only fairly recent that deploying the necessary resources for such big data systems has made financial sense for more everyday uses in business, and not just in government or academia. Graph databases are becoming much more broadly adopted across the spectrum, from high-end scientific research to financial networks and beyond.

Another factor that has really helped start this graph database revolution is the availability of high-quality open source technology. There are a lot of great open source projects addressing everything from the databases you need to store the graph data, to the query languages used to traverse them, all the way up to visually displaying graphs as part of the user interface layer. In particular, it is so-called 'property graphs' where we are seeing the broadest development and uptake. In a property graph, both vertices and edges can have properties (effectively, key-value pairs) associated with them. There are many styles of graph that you may end up building, and there have been whole books written on these various design patterns, but the property graph technology we will focus on in this book can support all the most common usage patterns. If you hear phrases such as 'directed graph' and 'undirected graph', or 'cyclic' and 'acyclic' graph, and many more as you work with graph databases, a quick online search will get you to a place where you can get familiar with that terminology. A deep discussion of these patterns is beyond the scope of this book, and it’s in no way essential to have a full background in graph theory to get productive quickly.

A third, and equally important, factor in the growth we are seeing in graph database adoption is the low barrier of entry for programmers. As you will see from the examples in this book, someone wanting to experiment with graph technology can download the Apache TinkerPop package and as long as Java 11+ is installed, be up and running with zero configuration (other than unzipping of the files), in as little as five minutes. Graph databases do not force you to define schemas or specify the layout of tables and columns before you can get going and start building a graph. Programmers also seem to find the graph style of programming quite intuitive as it closely models the way they think of the world.

Graph database technology should not be viewed as a "rip and replace" technology, but as very much complementary to other databases that you may already have deployed. One common use case is for the graph to be used as a form of smart index into other data stores. This is sometimes called having a polyglot data architecture.

1.10. A word about terminology

The words 'node' and 'vertex' are synonymous when discussing a graph. This book will prefer the term 'vertex' and 'vertices' as the Apache TinkerPop documentation almost exclusively uses it when discussing Gremlin queries and other concepts. We will only see a shade of the term 'node' when we refer to "supernodes", relating to vertices of an especially high degree, as this term has wide acceptance independently of TinkerPop. By the same token, this book will be consistent with the official TinkerPop documentation when discussing the connections between vertices, where we will use the term 'edge' or the plural form, 'edges'. In other books and articles you may also see terms like 'relationship' or 'arc' used. Again, these terms are synonymous in the context of graphs.

2. GETTING STARTED

In this chapter you will set up the Gremlin Console, load the air-routes sample graph, and run your first Gremlin traversals against it. You will learn how to start and stop the console, adjust a few useful settings, and verify that the sample data has been loaded correctly. By the end of the chapter, you should be comfortable using the console as a workspace for experimenting with the examples used throughout the rest of the book.

2.1. What is Apache TinkerPop?

Apache TinkerPop is a graph computing framework and top-level project hosted by the Apache Software Foundation. The homepage for the project is located at this URL: https://tinkerpop.apache.org/

The project includes the following components:

Gremlin

A graph traversal (query) language

Gremlin Console

An interactive shell for working with local or remote graphs.
https://tinkerpop.apache.org/docs/current/reference/#gremlin-console

Gremlin Server

Allows hosting of graphs remotely via an HTTP/Web Sockets connection.
https://tinkerpop.apache.org/docs/current/reference/#gremlin-server

TinkerGraph

A small in-memory graph implementation that is great for learning.
https://tinkerpop.apache.org/docs/current/reference/#tinkergraph-gremlin

Programming Interfaces

A set of programming interfaces written in Java
https://tinkerpop.apache.org/javadocs/current/full/

Documentation

A user guide, a tutorial and programming API documentation.
https://tinkerpop.apache.org/docs/current/
https://tinkerpop.apache.org/docs/current/reference/

Useful Recipes

A set of examples or "recipes" showing how to perform common graph-oriented tasks using Gremlin queries.
https://tinkerpop.apache.org/docs/current/recipes/

The programming interfaces allow providers of graph databases to build systems that are TinkerPop enabled and allow application programmers to write programs that talk to those systems.

Any TinkerPop enabled graph databases can be accessed using the Gremlin query language and corresponding API. We can also use the TinkerPop API to write client code, in languages like Java, that can talk to a TinkerPop enabled graph. For most of this book we will be working within the Gremlin Console with a local graph. However, in Chapters 7 and 8 we take a look at Gremlin Server and some other TinkerPop enabled environments. Most of Apache TinkerPop has been developed using Java, but there are also bindings available for many other programming languages such as Groovy, Python, Go, JavaScript, and C#. These bindings help make Gremlin feel comfortable to you as you can work with Gremlin in the idioms of the programming language that you are most familiar with.

Even though this book focuses on Gremlin written with Groovy, you should remember that whatever examples you see in Groovy can easily be converted to any other supported programming language. You need to understand the idioms of the language you are using, and converting should be straightforward. For example, Python prefers snake case compared to Groovy preferring camel-case formatting. Therefore, a Groovy query of 'g.addV("person")' just converts to 'g.add_v("person")' in Python. You will read more about Gremlin translation in the "Translating Gremlin to different programming languages" section.

The queries used as examples in this book have been tested with Apache TinkerPop version 3.8.0 as well as some prior releases where appropriate. Tests were performed using the TinkerGraph in memory graph and the Gremlin console, as well as other TinkerPop enabled graph stores.

2.2. The Gremlin Console

The Gremlin Console is an interactive shell for running Gremlin traversals against a graph. In this book it serves as the main workspace for experimenting with the air-routes graph, trying out examples, and exploring variations of your own. It is based on the Groovy Shell, and if you have used any of the other console environments such as those found with Scala, Python, and Ruby, you will feel right at home here. The console offers a low overhead (you can set it up in seconds) and a low barrier to entry as a way to start to play with graphs on your local computer. The console can actually work with graphs that are running locally or remotely, but for the majority of this book we will keep things simple and focus on local graphs.

To follow along with this tutorial, you will need to have installed the Gremlin console or have access to a TinkerPop/Gremlin enabled graph store such as TinkerGraph or JanusGraph.

Regardless of the environment you use, if you work with Apache TinkerPop enabled graphs, the console should always be installed on your machine!

2.2.1. Download, install, and launch the Gremlin Console

You can download the Gremlin Console from the official Apache TinkerPop website at https://tinkerpop.apache.org/

It only takes a few minutes to get the console installed and running. You just download the ZIP, 'unzip' it, and you are all set. TinkerPop also requires Java to be at version 11 or higher.

For more information on the compatability of Gremlin with versions of Java and other languages, please see the official documentation.

The console download also includes all the JAR files that are needed to write a standalone Java or Groovy TinkerPop application, but that is a topic for later!

When you start the console, you will be presented with a banner/logo and a prompt that will look something like this. Don’t worry about the plugin messages, yet we will talk about those a bit later.

$ ./gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin>

You can get a list of the available commands by typing ':help'. Note that a colon prefixes all commands to the console itself. This enables the console to distinguish them as special and different from actual Gremlin and Groovy commands.

gremlin> :help

For information about Groovy, visit:
    http://groovy-lang.org

Available commands:
  :help       (:h  ) Display this help message
  ?           (:?  ) Alias to: :help
  :exit       (:x  ) Exit the shell
  :quit       (:q  ) Alias to: :exit
  import      (:i  ) Import a class into the namespace
  :display    (:d  ) Display the current buffer
  :clear      (:c  ) Clear the buffer and reset the prompt counter
  :show       (:S  ) Show variables, classes or imports
  :inspect    (:n  ) Inspect a variable or the last result with the GUI object browser
  :purge      (:p  ) Purge variables, classes, imports or preferences
  :edit       (:e  ) Edit the current buffer
  :load       (:l  ) Load a file or URL into the buffer
  .           (:.  ) Alias to: :load
  :save       (:s  ) Save the current buffer to a file
  :record     (:r  ) Record the current session to a file
  :history    (:H  ) Display, manage and recall edit-line history
  :alias      (:a  ) Create an alias
  :grab       (:g  ) Add a dependency to the shell environment
  :register   (:rc ) Register a new command with the shell
  :doc        (:D  ) Open a browser window displaying the doc for the argument
  :set        (:=  ) Set (or list) preferences
  :uninstall  (:-  ) Uninstall a Maven library and its dependencies from the Gremlin Console
  :install    (:+  ) Install a Maven library and its dependencies into the Gremlin Console
  :plugin     (:pin) Manage plugins for the Console
  :remote     (:rem) Define a remote connection
  :submit     (:>  ) Send a Gremlin script to Gremlin Server
  :bytecode   (:bc ) Gremlin bytecode helper commands
  :cls        (:C  ) Clear the screen.

For help on a specific command type:
    :help command

Of all the commands listed above ':clear' (':c' for short) is important to remember. If the console starts acting strangely, or you find yourself stuck with a prompt like "……1>" , typing ':clear' will reset things nicely.

It is worth noting that as mentioned above, the console is based on the Groovy Shell, and as such you can enter valid Groovy code directly into the console. So as well as using it to experiment with graphs and Gremlin, you can use it as, for example, a desktop calculator should you so desire!

gremlin> 2+3
==>5

gremlin> a = 5
==>5

gremlin> println "The number is ${a}"
The number is 5

gremlin> for (a in 1..5) {print "${a} "};println()
1 2 3 4 5

If you want to see lots of examples of the output from running various queries, you will find plenty in the "MISCELLANEOUS QUERIES AND THEIR RESULTS" section of this book where we have tried to go into more depth on various topics.

Mostly, you will run the console in its interactive mode. However, you can also pass the name of a file as a command line parameter, preceded by the '-e' flag, and Gremlin will execute the file and exit. For example, if you had a file called "mycode.groovy", you could execute it directly from your command line window or terminal window as follows:

$ ./gremlin.sh -e mycode.groovy

If you want to have the console run your script and not exit afterward, you can use the '-i' option instead of '-e'.

You can get help on all the command line options for the console by typing 'gremlin --help'. You should get back some help text that looks like this

$ ./gremlin.sh --help

Usage: gremlin.sh [-CDhlQvV] [-e=<SCRIPT ARG1 ARG2 ...>]... [-i=<SCRIPT ARG1
                  ARG2 ...>...]...
  -C, --color     Disable use of ANSI colors
  -D, --debug     Enabled debug Console output
  -e, --execute=<SCRIPT ARG1 ARG2 ...>
                  Execute the specified script (SCRIPT ARG1 ARG2 ...) and close
                    the console on completion
  -h, --help      Display this help message
  -i, --interactive=<SCRIPT ARG1 ARG2 ...>...
                  Execute the specified script and leave the console open on
                    completion
  -l              Set the logging level of components that use standard logging
                    output independent of the Console
  -Q, --quiet     Suppress superfluous Console output
  -v, --version   Display the version
  -V, --verbose   Enable verbose Console output

If you ever want to check which version of TinkerPop you have installed, you can enter the following command from inside the console.

// What version of the console am I running?
gremlin>  Gremlin.version()
==>3.8.0

One thing that is not at all obvious is that the console quietly imports a large number of Java Classes and Enums on your behalf as it starts up. This makes writing queries within the console simpler. However, as we shall explore in the "Important Classes and Enums to be aware of" section later, once you start writing standalone programs in Java or other languages, you need to actually know what the console did on your behalf. Reading through that section will help familiarize you with the classes you need to import to your application code.

2.2.2. Saving output from the Gremlin Console to a file

Sometimes it is useful to save part or all of a Gremlin Console session to a file so that you can review it later, compare different traversals, or share it with others. The console provides a simple way to start and stop recording, which captures both the commands you type and the results that are displayed.

In the following example, we turn session recording on using ':record start mylog.txt' which will force all commands entered and their output to be written to the file 'mylog.txt' until the command ':record stop' is entered. The command 'g.V().count().next()' just counts how many vertices are in the graph. We will explain the Gremlin graph traversal and query language in detail starting in the next section.

gremlin> :record start mylog.txt
Recording session to: "mylog.txt"

gremlin> g.V().count().next()
==>3749
gremlin> :record stop
Recording stopped; session saved as: "mylog.txt" (157 bytes)

If we were to look at the 'mylog.txt' file, this is what it now contains.

// OPENED: Tue Sep 12 10:43:40 CDT 2017
// RESULT: mylog.txt
g.V().count().next()
// RESULT: 3618
:record stop
// CLOSED: Tue Sep 12 10:43:50 CDT 2017

For the remainder of this book we are not going to show the 'gremlin>' prompt or the '=⇒' output identifier as part of each example, just to reduce clutter a bit. You can assume that each command was entered and tested using the console, however.

If you want to learn more about the console itself, you can refer to the official TinkerPop documentation and, even better, have a play with the console and the built-in help.

2.2.3. Setting up console preferences

There are a number of preferences that can be established within the Gremlin Console to make it more suitable for your needs. The ':set' command is used to establish various preference values. Let’s look at a few helpful configurations.

The first option to know is 'max-iteration'. The console will only display the first 100 lines of output for any command by default. If you’d like to see more, you would need to increase this value.

:set max-iteration 1000

Set the 'max-iteration' to '-1' to have no limit in the number of lines displayed.

If you are on a system that can display colors, there are a wide range of color options you can modify to suit your needs. The various color settings take a comma-separated combination of a foreground, background and attribute.

:set error.color black,bg_black,underline

If you’d like to remove console configurations, you can use the ':purge preferences' command.

The full list of available preferences can be found in the Apache TinkerPop reference documentation https://tinkerpop.apache.org/docs/current/reference/#console-preferences

2.3. Introducing TinkerGraph

TinkerGraph is the in-memory reference implementation that ships with Apache TinkerPop and is included with the Gremlin Console download. It runs inside a single JVM process, stores its data in memory, and is designed primarily for learning, experimentation, and small test graphs, but does have use cases for production workloads in some scenarios.

This book was mostly developed using TinkerGraph The nice thing about TinkerGraph is that for learning and testing things you can run everything you need on your laptop or desktop computer and be up and running very quickly. We will also explain how to get started with the console and TinkerGraph a bit later in this section.

TinkerPop defines a number of capabilities that a graph store should support. Some are optional, others are not. If supported, you can query any TinkerPop enabled graph store to see which features are supported using a command such as 'graph.features()' once you have established the 'graph' object. We will look at how to do that soon. The following list shows the features supported by TinkerGraph. This is what you would get back should you call the 'features' method provided by TinkerGraph. We have arranged the list in two columns to aid readability. Don’t worry if not all of these terms make sense right away – we’ll get there soon!

Output from graph.features()

> GraphFeatures                          > VertexPropertyFeatures
>-- ConcurrentAccess: false              >-- UserSuppliedIds: true
>-- ThreadedTransactions: false          >-- StringIds: true
>-- Persistence: true                    >-- RemoveProperty: true
>-- Computer: true                       >-- AddProperty: true
>-- Transactions: false                  >-- NumericIds: true
> VariableFeatures                       >-- CustomIds: false
>-- Variables: true                      >-- AnyIds: true
>-- LongValues: true                     >-- UuidIds: true
>-- SerializableValues: true             >-- Properties: true
>-- FloatArrayValues: true               >-- LongValues: true
>-- UniformListValues: true              >-- SerializableValues: true
>-- ByteArrayValues: true                >-- FloatArrayValues: true
>-- MapValues: true                      >-- UniformListValues: true
>-- BooleanArrayValues: true             >-- ByteArrayValues: true
>-- MixedListValues: true                >-- MapValues: true
>-- BooleanValues: true                  >-- BooleanArrayValues: true
>-- DoubleValues: true                   >-- MixedListValues: true
>-- IntegerArrayValues: true             >-- BooleanValues: true
>-- LongArrayValues: true                >-- DoubleValues: true
>-- StringArrayValues: true              >-- IntegerArrayValues: true
>-- StringValues: true                   >-- LongArrayValues: true
>-- DoubleArrayValues: true              >-- StringArrayValues: true
>-- FloatValues: true                    >-- StringValues: true
>-- IntegerValues: true                  >-- DoubleArrayValues: true
>-- ByteValues: true                     >-- FloatValues: true
> VertexFeatures                         >-- IntegerValues: true
>-- AddVertices: true                    >-- ByteValues: true
>-- DuplicateMultiProperties: true       > EdgePropertyFeatures
>-- MultiProperties: true                >-- Properties: true
>-- RemoveVertices: true                 >-- LongValues: true
>-- MetaProperties: true                 >-- SerializableValues: true
>-- UserSuppliedIds: true                >-- FloatArrayValues: true
>-- StringIds: true                      >-- UniformListValues: true
>-- RemoveProperty: true                 >-- ByteArrayValues: true
>-- AddProperty: true                    >-- MapValues: true
>-- NumericIds: true                     >-- BooleanArrayValues: true
>-- CustomIds: false                     >-- MixedListValues: true
>-- AnyIds: true                         >-- BooleanValues: true
>-- UuidIds: true                        >-- DoubleValues: true
> EdgeFeatures                           >-- IntegerArrayValues: true
>-- RemoveEdges: true                    >-- LongArrayValues: true
>-- AddEdges: true                       >-- StringArrayValues: true
>-- UserSuppliedIds: true                >-- StringValues: true
>-- StringIds: true                      >-- DoubleArrayValues: true
>-- RemoveProperty: true                 >-- FloatValues: true
>-- AddProperty: true                    >-- IntegerValues: true
>-- NumericIds: true                     >-- ByteValues: true
>-- CustomIds: false
>-- AnyIds: true
>-- UuidIds: true

TinkerGraph is really useful while learning to work with Gremlin and great for testing things out. One common use case where TinkerGraph can be invaluable is to create a sub-graph of a large graph and work with it locally. TinkerGraph can even be used in production deployments if an in-memory graph fits the bill. Typically, TinkerGraph is used to explore static (unchanging) graphs, but you can also use it from a programming language like Java and mutate its contents if you want to. However, TinkerGraph does not support some of the more advanced features you will find in implementations like JanusGraph such as an advanced transaction system, (though it does have basic transactions that are helpful to certain use cases as of 3.7.0) and external indexes. One other thing worth noting in the list above is that the 'UserSuppliedIds' option is set to true for vertex and edge ID values. This means that if you load a graph file, such as a GraphML format file, that specifies ID values for vertices and edges, then TinkerGraph will honor those IDs and use them. As we shall see later, this is not the case with some other graph database systems.

When running in the console, support for TinkerGraph should be on by default. If for any reason you find it to be off, you can enable it by issuing the following command.

:plugin use tinkerpop.tinkergraph

Once the TinkerGraph plugin is enabled, you will need to close and re-load the Gremlin Console. After doing that, you can create a new TinkerGraph instance from the console as follows:

g = TinkerGraph.open().traversal()

which is shorthand for

graph = TinkerGraph.open()
g = traversal().with(graph)

The shorthand is helpful to save a bit of typing, but you lose reference to the graph instance, which might be helpful when accessing 'graph.features()', creating indices, or initiating close operations on the graph itself. The longer form generally tends to be preferable for this reason. You will read more about this in the "Deep dive on traversal terminology" section.

In some cases you will want to pass parameters to the 'open' method providing more information on how the graph is to be configured. We will explore those options later on. The variable called 'g' created above is known as a 'graph traversal source' and will be used throughout the book at the start of each query we write.

Throughout the remainder of this book the variable name 'g' will be used for any object that represents an instance of a graph traversal source object.

2.4. Introducing the air-routes graph

The examples in this book use a sample dataset called the air-routes graph. It models airports as vertices, flight routes as edges, and includes properties such as airport codes, city and country names, geographic coordinates, and distances between airports. This graph is large enough to be interesting but still small enough to explore comfortably from the Gremlin Console.

The air-routes.graphml file can be downloaded from the sample-data folder located in the GitHub repository at the following URL: https://github.com/krlawrence/graph/tree/main/sample-data

While the air-routes graph was built from actual real-world data, routes are added and deleted by airlines all the time, so please don’t use this graph to plan your next vacation or business trip! However, as a learning tool we hope you will find it useful and easy to relate to. If you feel so inclined, you can load the file into a text editor and examine how it is laid out. As you work with graphs, you will want to become familiar with popular graph serialization formats. Two common ones are GraphML and GraphSON. The latter is a JSON format that is defined by Apache TinkerPop and heavily used in that environment. GraphML is widely recognized by TinkerPop and many other tools as well, such as Gephi, a popular open source tool for visualizing graph data. A lot of graph ingestion tools also still use comma-separated values (CSV) format files.

We will briefly look at loading and saving graph data in Sections 2 and 4. We take a look at different ways to work with graph data stored in text format files including importing and exporting graph data in the "COMMON GRAPH SERIALIZATION FORMATS" section towards the end of the book.

The 'air-routes' graph contains several vertex types that are specified using labels. The most common ones being 'airport' and 'country'. There are also vertices for each of the seven continents ('continent') and a single 'version' vertex that we provided as a way to test which version of the graph you are using.

Routes between airports are modeled as edges. These edges carry the 'route' label and include the distance between the two connected airport vertices as a property called 'dist'. Connections between countries and airports are modeled using an edge with a 'contains' label.

Each airport vertex has many properties associated with it, giving various details about that airport, including its IATA and ICAO codes, its description, the city it is in, and its geographic location.

Specifically, each airport vertex has a unique ID, a label of 'airport' and contains the following properties. The word in parentheses indicates the type of the property.

 type    (string) : Vertex type. Will be 'airport' for airport vertices
 code    (string) : The three letter IATA code like AUS or LHR
 icao    (string) : The four letter ICAO code or none. Example KAUS or EGLL
 desc    (string) : A text description of the airport
 region  (string) : The geographical region like US-TX or GB-ENG
 runways (int)    : The number of available runways
 longest (int)    : Length of the longest runway in feet
 elev    (int)    : Elevation in feet above sea level
 country (string) : Two letter ISO country code such as US, FR or DE.
 city    (string) : The name of the city the airport is in
 lat     (double) : Latitude of the airport
 lon     (double) : Longitude of the airport

We can use Gremlin once the air route graph is loaded to show us what properties an airport vertex has. As an example, here is what the Austin airport vertex looks like. We will explain the steps that make up the Gremlin query shortly. First, we need to dig a little bit into how to load the data and configure a few preferences.

// Query the properties of vertex 3
g.V().has('code','AUS').valueMap(true).unfold()

id=3
label=airport
type=[airport]
code=[AUS]
icao=[KAUS]
desc=[Austin Bergstrom International Airport]
region=[US-TX]
runways=[2]
longest=[12250]
elev=[542]
country=[US]
city=[Austin]
lat=[30.1944999694824]
lon=[-97.6698989868164]

Even though the airport vertex label is 'airport', we chose to also have a property called 'type' that also contains the string 'airport'. This was done to aid with indexing when working with other graph database systems and is explained in more detail later in this book.

You may have noticed that the values for each property are represented as lists (or arrays if you prefer), even though each list only contains one element. The reasons for this will be explored later in this book, but the quick explanation is that this is because TinkerPop allows us to associate a list of values with any vertex property. We will explore ways that you can take advantage of this capability in the "Attaching multiple values (lists or sets) to a single property" section.

The full details of all the features contained in the 'air-routes' graph can be learned by reading the comments at the start of the air-routes.graphml file or reading the README.txt file.

The graph currently contains a total of 3,619 vertices and 50,148 edges. Of these 3,374 vertices are airports, and 43,400 of the edges represent routes. While in big data terms this is really a tiny graph, it is plenty big enough for us to build up and experiment with some interesting Gremlin queries.

Lastly, here are some statistics and facts about the 'air-routes' graph. If you want to see a lot more statistics check the README.txt file that is included with the 'air-routes' graph.

Air Routes Graph (v1.0, 2025-Oct-22) contains:
  3,504 airports
  50,637 routes
  237 countries (and dependent areas)
  7 continents
  3,749 total vertices
  57,645 total edges

Additional observations:
  Longest route is between SIN and JFK (9,526 miles)
  Shortest route is between WRY and PPW (2 miles)
  Average route distance is 1,212.918 miles.
  Longest runway is 18,045ft (BPX)
  Shortest runway is 1,300ft (SAB)
  Average number of runways is 1.42123
  Furthest North is LYR (latitude: 78.2461013793945)
  Furthest South is USH (latitude: -54.8433)
  Furthest East is SVU (longitude: 179.341003418)
  Furthest West is TVU (longitude: -179.876998901)
  Closest to the Equator is MDK (latitude: 0.0226000007242)
  Closest to the Greenwich meridian is LDE (longitude: -0.006438999902457)
  Highest elevation is DCY (14,472 feet)
  Lowest elevation is GUW (-72 feet)
  Maximum airport vertex degree (routes in and out) is 620 (FRA)
  Region with the most airports: US-AK (150)
  Country with the most airports: United States (586)
  Continent with the most airports: North America (989)
  Average degree (airport vertices) is 28.902
  Average degree (all vertices) is 28.891

Here are the Top 15 airports sorted by overall number of routes (in and out). In graph terminology this is often called the degree of the vertex or just 'vertex degree'.

    POS  ID   CODE  TOTAL     DETAILS

     1	  52   FRA  (620)  out:310 in:310
     2	 161   IST  (618)  out:309 in:309
     3	  51   CDG  (587)  out:293 in:294
     4	  70   AMS  (568)  out:283 in:285
     5	  80   MUC  (541)  out:270 in:271
     6	  18   ORD  (529)  out:265 in:264
     7	   8   DFW  (506)  out:253 in:253
     8	  64   PEK  (497)  out:248 in:249
     9	  58   DXB  (496)  out:248 in:248
    10	   1   ATL  (484)  out:242 in:242
    11	 102   DME  (465)  out:232 in:233
    12	  50   LGW  (464)  out:232 in:232
    13	  49   LHR  (442)  out:221 in:221
    14	  31   DEN  (434)  out:217 in:217
    15	  84   MAN  (431)  out:216 in:215

Throughout this book you will find Gremlin queries that can be used to generate many of these statistics.

The source code in this section comes from the 'graph-stats.groovy' sample located in: https://github.com/krlawrence/graph/tree/main/sample-code/groovy.

2.4.1. Updated versions of the air route data

Over time the air-routes graph has evolved, and several versions of the data files are included with the book sources. The examples in this book assume that you are using the 1.0 version of the air-routes data. Unless a section explicitly asks you to load a different file, you should use that default version when following along.

Even though no further releases of the dataset are expected, the GitHub repository still maintains a "latest" version. It is also 1.0. You can download it from https://github.com/krlawrence/graph/blob/main/sample-data/air-routes-latest.graphml

2.5. Loading the air-routes graph using the Gremlin Console

The easiest way to load the air-routes graph is to use the prepackaged version of the dataset that comes with TinkerGraph. We can demonstrate this in the Gremlin Console as follows.

graph = TinkerFactory.createAirRoutes()
g = traversal().with(graph)

That’s it! The graph is now loaded with the 1.0 version of the dataset. These two lines simplify a longer set of manual steps which we will go through as an explanatory next step so that you can understand the details.

The following code loads the air-routes graph using the console by putting it into a file and using ':load' to load and run it or by entering each line into the console manually. These commands will set up the console environment, create a TinkerGraph, and load the air-routes.graphml file into it. Some extra console features are also enabled.

These commands create an in-memory TinkerGraph which will use LONG values for the vertex, edge, and vertex property IDs. As part of loading a graph, we need to set up a 'graph traversal source' object called 'g' which we will then refer to in our subsequent queries of the graph. We discussed the ':set max-iteration' command in Setting up console preferences.

If you are using a different graph environment and GraphML import is supported, you can still load the air-routes.graphml file by following the instructions specific to that system. Once loaded, the queries below should still work either unchanged or with minor modifications.

There is a file called load-air-routes.groovy, that contains the commands shown below, available in the /sample-data directory. https://github.com/krlawrence/graph/tree/main/sample-data

load-air-routes.groovy

conf = new BaseConfiguration()
conf.setProperty("gremlin.tinkergraph.vertexIdManager","LONG")
conf.setProperty("gremlin.tinkergraph.edgeIdManager","LONG")
conf.setProperty("gremlin.tinkergraph.vertexPropertyIdManager","LONG")
graph = TinkerGraph.open(conf)
g = traversal().with(graph)

// Change the path below to point to wherever you put the graphml file
g.io('/mydata/air-routes.graphml').read()
:set max-iteration 1000

Setting the ID manager as shown above is important. If you do not do this, by default, when using TinkerGraph, ID values will have to be specified as strings such as '"3"' rather than just the numeral '3'.

:load load-air-routes.groovy

As a best practice, you should use the full path to the location where the GraphML file resides if at all possible to make sure that the GraphML reading code can find it.

Once you have the console up and running and have the graph loaded, if you feel like it, you can cut-and-paste queries from this book directly into it to see them run.

Once the 'air-routes' graph is loaded, you can enter the following command, and you will get back information about the graph. In the case of a TinkerGraph you will get back a useful message telling you how many vertices and edges the graph contains. Note that the contents of this message will vary from one graph system to another and should not be relied upon as a way to keep track of vertex and edge counts. We will look at some other ways of counting things a bit later.

// Tell me something about my graph
graph.toString()

When using TinkerGraph, the message you get back will look something like this.

tinkergraph[vertices:3749 edges:57645]

2.6. Turning off some of the Gremlin Console’s output

Sometimes, especially when assigning a result to a variable and you are not interested in seeing all the steps that Gremlin took to get there, the Gremlin console displays more output than is desirable. An easy way to prevent this is to just add an empty list ";[]" to the end of your query as follows.

a=g.V().has('code','AUS').out().toList();[]

2.7. A word about indexes and schemas

Before going much further, it is worth briefly mentioning indexes and schemas. For now, you will be working with TinkerGraph, which does not enforce a formal schema and offers only simple in-memory indexing. That is sufficient for the examples in this chapter, but later chapters will return to these topics in more detail and show how indexes and schemas can have a big impact on query performance and data quality.

As most of the examples in this book are intended to work just fine with only a basic TinkerGraph, the subject of indexes is not covered in detail until Chapter 6 "MOVING BEYOND THE GREMLIN CONSOLE" . However, as TinkerGraph does have some indexing capability, we have also included some discussion of it in the "Introducing TinkerGraph indexes" section. You should always refer to the specific documentation for the graph system you are using to decide what you need to do about creating an index and schema for your graph.

In general, for any graph database, regardless of whether it is optional or not, use of an index should be considered the best practice.

When working with TinkerGraph, there is no need to define a schema ahead of time. The types of each property are derived at creation time. This is a really convenient feature and allows us to get productive and do some experimenting really quickly.

In production systems, especially those where the graphs are large, the task of creating and managing indexes may include the use of additional software components, such as Apache Solr or Elasticsearch.

3. WRITING GREMLIN QUERIES

Now that you hopefully have the 'air-routes' graph loaded, it’s time to start writing some queries!

Chapter 3 is focused on queries that are simply read from an existing graph. If you are more interested in adding new vertices, edges and properties or modifying existing properties, you may want to jump to Chapter 4 and in particular the "Adding vertices, edges, and properties" section.

In this chapter we will begin to look at the Gremlin query language. We will start off with a quick look at how Gremlin and SQL differ and are yet in some ways similar, then present some fairly basic queries and finally get into some more advanced concepts. Hopefully, each set of examples presented, building upon things previously discussed, will be easy to understand.

3.1. Introducing Gremlin

Gremlin is the name of the graph traversal and query language that TinkerPop provides for working with property graphs. Gremlin can be used with any graph store that is Apache TinkerPop enabled. Gremlin is a fairly imperative language but has some more declarative constructs as well. Using Gremlin, we can traverse a graph looking for values, patterns and relationships, we can add or delete vertices and edges, we can create sub-graphs, and lots more.

3.1.1. A quick look at Gremlin and SQL

While it is not required to know SQL to be productive with Gremlin, if you do have some experience with SQL, you will notice many of the same keywords and phrases being used in Gremlin. As a simple example, the SQL and Gremlin examples below both show how we might count the number of airports there are in each country using firstly a relational database and secondly a property graph.

When working with a relational database, we might decide to store all the airport data in a single table called 'airports'. In a basic case (the air-routes graph actually stores a lot more data than this about each airport) we could set up our airports table so that it had entries for each airport as follows.

ID   CODE  ICAO  CITY             COUNTRY
---  ----  ----  ---------------  ----------
1    ATL   KATL  Atlanta          US
3    AUS   KAUS  Austin           US
8    DFW   KDFW  Dallas           US
47   YYZ   CYYZ  Toronto          CA
49   LHR   EGLL  London           UK
51   CDG   LFPG  Paris            FR
52   FRA   EDDF  Frankfurt        DE
55   SYD   YSSY  Sydney           AU

We could then use a SQL query to count the distribution of airports in each country as follows.

SELECT country,count(country) FROM airports GROUP BY country;

We can do this in Gremlin using the 'air-routes' graph with a query like the one below (We will explain what all of this means later on in the book).

g.V().hasLabel('airport').groupCount().by('country')

You will discover that Gremlin provides its own flavor of several constructs that you will be familiar with if you have used SQL before, but again, prior knowledge of SQL is in no way required to learn Gremlin.

One thing you will not find when working with a graph using Gremlin is the concept of a SQL 'join'. Graph databases by their very nature avoid the need to join things together (as things that need to be connected already are connected), and this is a core reason why, for many use cases, graph databases are a great choice and can be more performant than relational databases.

Graph databases are usually a good choice for storing and modeling networks. The 'air-routes' graph is an example of a network graph. A social network is, of course, another good example. Networks can be modeled using relational databases too, but as you explore the network and ask questions like "who are my friends' friends?" in a social network or "where can I fly to from here with a maximum of two stops?" things rapidly get complicated and result in the need for multiple 'joins'.

As an example, imagine adding a second table to our relational database called routes. It will contain three columns representing the source airport, the destination airport and the distance between them in miles (SRC, DEST and DIST). It would contain entries that looked like this (the real table would, of course, have thousands of rows, but this gives a good idea of what the table would look like).

SRC  DEST  DIST
---  ----  ----
ATL  DFW   729
ATL  FRA   4600
AUS  DFW   190
AUS  LHR   4901
BOM  AGR   644
BOM  LHR   4479
CDG  DFW   4933
CDG  FRA   278
CDG  LHR   216
DFW  FRA   5127
DFW  LHR   4736
LHR  BOM   4479
LHR  FRA   406
YYZ  FRA   3938
YYZ  LHR   3544

If we wanted to write a SQL query to calculate the ways of traveling from Austin (AUS) to Agra (AGR) with two stops, we would end up writing a query that looked something like this:

SELECT a1.code,r1.dest,r2.dest,r3.dest FROM airports a1
  JOIN routes r1 ON a1.code=r1.src
  JOIN routes r2 ON r1.dest=r2.src
  JOIN routes r3 ON r2.dest=r3.src
  WHERE a1.code='AUS' AND r3.dest='AGR';

Using our 'air-routes' graph database, the query can be expressed quite simply as follows:

g.V().has('code','AUS').out().out().out().has('code','AGR').path().by('code')

Adding or removing hops is as simple as adding or removing one or more of the 'out' steps, which is a lot simpler than having to add additional 'join' clauses to our SQL query. This is a simple example, but as queries get more and more complicated in heavily connected data sets like networks, the SQL queries get harder and harder to write whereas, because Gremlin is designed for working with this type of data, expressing a traversal remains fairly straightforward.

We can go one step further with Gremlin and use 'repeat' to express the concept of 'three times' as follows.

g.V().has('code','AUS').repeat(out()).times(3).has('code','AGR').path().by('code')

Gremlin also has a 'repeat … until' construct that we will see used later in this book. When combined with the 'emit' step, 'repeat' provides a nice way of getting back any routes between a source and destination no matter how many hops it might take to get there.

Again, don’t worry if some of the Gremlin steps shown here are confusing, we will cover them all in detail a bit later. The key point to take away from this discussion of SQL and Gremlin is that for data that is very connected, graph databases provide a great way to store that data, and Gremlin provides a nice and fairly intuitive way to traverse that data efficiently.

One other point worthy of note is that every vertex and every edge in a graph has a unique ID. Unlike in the relational world where you may or may not decide to give a table an ID column, this is not optional with graph databases. In some cases the ID can be a user-provided ID, but more commonly it will be generated by the graph system when a vertex or edge is first created. If you are familiar with SQL, you can think of the ID as a primary key of sorts if you want to. Every vertex and edge can be accessed using its ID. Just as with relational databases, graph databases can be indexed, and any of the properties contained in a vertex or an edge can be added to the index and can be used to find things efficiently. In large graph deployments this greatly speeds up the process of finding things as you would expect. We look more closely at IDs in the "Working with IDs" section.

3.2. Some fairly basic Gremlin queries

A graph 'query' is often referred to as a 'traversal' as that is what we are in fact doing. We are traversing the graph from a starting point to an ending point. Traversals consist of one or more 'steps' (essentially methods) that are chained together.

As we start to look at some simple traversals, here are a few 'steps' that you will see used a lot. Firstly, you will notice that almost all traversals start with either a 'g.V()' or a 'g.E()'. Sometimes there will be parameters specified along with those steps, but we will get into that a little later. You may remember from when we looked at how to load the 'air-routes' graph in Section 2, we used the following instruction to create a graph traversal source object for our loaded 'graph'.

g = traversal().with(graph)

Once we have a graph traversal source object, we can use it to start exploring the graph. The 'V' step returns vertices and the 'E' step returns edges. You can also use a 'V' step in the middle of a traversal as well as at the start, but we will examine those uses a little later. The 'V' and 'E' steps can also take parameters indicating which set of vertices or edges we are interested in. That usage is explained in the "Working with IDs" section.

If it helps with remembering, you can think of 'g.V()' as meaning "looking at all the vertices in the graph" and 'g.E()' as meaning "looking at all the edges in the graph". We then add additional steps to narrow down our search criteria.

The other steps we need to introduce are the 'has' and 'hasLabel' steps. They can be used to test for a certain label or property having a certain value. We will encounter a lot of different Gremlin steps as we explore various Gremlin queries throughout the book, including many other forms of the 'has' step, but these few examples are enough to get us started.

You can refer to the official Apache TinkerPop documentation for full details on all the graph traversal steps that are used in this tutorial. With this tutorial we have not tried to teach every possible usage of every Gremlin step and method, rather, We have tried to provide a good and approachable foundation in writing many different types of Gremlin queries using an interesting and real-world graph.

The latest TinkerPop documentation is always available at this URL: https://tinkerpop.apache.org/docs/current/reference/

Below are some simple queries against the 'air-routes' graph to get us started. It is assumed that the 'air-routes' graph has been loaded already per the instructions above. The query below will return any vertices that have the 'airport' label.

// Find vertices that are airports
g.V().hasLabel('airport')

This query will return the vertex that represents the Dallas Fort Worth (DFW) airport.

// Find the DFW vertex
g.V().has('code','DFW')

The next two queries combine the previous two into a single query. The first one just chains the queries together. The second shows a form of the 'has' step that we have not looked at before that takes an additional label value as its first parameter.

// Combining those two previous queries (two ways that are equivalent)
g.V().hasLabel('airport').has('code','DFW')

g.V().has('airport','code','DFW')

Here is what we get back from the query. Notice that this is the Gremlin Console’s way of telling us we got back the 'Vertex' with an ID of 8.

v[8]

So, what we actually got back from these queries was a TinkerPop 'Vertex' data structure. Later in this book we will look at ways to store that value into a variable for additional processing. Remember that even though we are working with a Groovy environment while inside the console, everything we are working with here, at its core, is Java code. So we can use the 'getClass' method from Java to introspect the object. Note the call to 'next' which turns the result of the traversal into an object we can work with further.

g.V().has('airport','code','DFW').next().getClass()

class org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerVertex

The 'next' step that we used above is one of a series of steps that the TinkerPop documentation describes as 'terminal steps'. We will see more of these 'terminal steps' in use throughout this book. As mentioned above, a terminal step essentially ends the graph traversal and returns a concrete object that you can work with further in your application. We will see 'next' and other related steps used in this way when we start to look at using Gremlin from a standalone program a bit later on. We could even add a call to 'getMethods()' at the end of the query above to get back a list of all the methods and their types supported by the 'TinkerVertex' class. If you’d like to jump ahead to learn more about terminal steps, you can see more examples in the "Assigning query results to a variable with a terminal step" section.

3.2.1. Retrieving property values from a vertex

There are several different ways of working with vertex properties. We can add, delete and query properties for any vertex or edge in the graph. We will explore each of these topics in detail over the course of this book. Initially, let’s look at a couple of simple ways that we can look up the property values of a given vertex.

// What property values are stored in the DFW vertex?
g.V().has('airport','code','DFW').values()

Here is the output that the query returns. Note that we just get back the values of the properties when using the 'values' step, we do not get back the associated keys. We will see how to do that later in the book.

US
DFW
13401
Dallas
607
KDFW
-97.0380020141602
airport
US-TX
7
32.896800994873
Dallas/Fort Worth International Airport

The 'values' step can take parameters that tell it to only return the values for the provided key names. The queries below return the values of some specific properties.

// Return just the city name property
g.V().has('airport','code','DFW').values('city')

Dallas

// Return the 'runways' and 'icao' property values.
g.V().has('airport','code','DFW').values('runways','icao')

KDFW
7

3.2.2. Does a specific property exist on a given vertex or edge?

You can test to see if a property exists as well as testing for it containing a specific value. To do this, we can just provide 'has' with the name of the property we are interested in. This works equally well for both vertex and edge properties.

// Find all edges that have a 'dist' property
g.E().has('dist')

// Find all vertices that have a 'region' property
g.V().has('region')

// Find all the vertices that do not have a 'region' property
g.V().hasNot('region')

// The above is shorthand for
g.V().not(has('region'))

3.2.3. Counting things

A common need when working with graphs is to be able to count how "many of something" there are in the graph. We will look in the next section at other ways to count groups of things, but first, let’s look at some examples of using the 'count' step to count various things in our 'air-routes' graph. First, let’s find out how many vertices in the graph represent airports.

// How many airports are there in the graph?
g.V().hasLabel('airport').count()

3504

Now, looking at edges that have a 'route' label, let’s find out how many flight routes are stored in the graph. Note that the 'outE' step looks at outgoing edges. In this case we could also have used the 'out' step instead. The various ways that you can look at outgoing and incoming edges are discussed in the "Starting to walk the graph" section that is coming up soon.

// How many routes are there?
g.V().hasLabel('airport').outE('route').count()

50637

You could shorten the above a little as follows but this would cause more edges to get looked at as we do not first filter out all vertices that are not airports.

// How many routes are there?
g.V().outE('route').count()

50637

You could also do it this way but generally starting by looking at all the Edges in the graph is considered bad form as property graphs tend to have a lot more edges than vertices.

// How many routes are there?
g.E().hasLabel('route').count()

50637

We have not yet looked at the 'outE' step used above. We will look at it very soon however in the "Starting to walk the graph" section.

3.2.4. Counting groups of things

Sometimes it is useful to count how many of each type (or group) of things there are in the graph. This can be done using the 'group' and 'groupCount' steps. While for a very large graph it is not recommended to run queries that look at all vertices or all of the edges in a graph, for smaller graphs this can be quite useful. For the air-routes graph we could easily count the number of different vertex and edge types in the graph as follows.

// How many of each type of vertex are there?
g.V().groupCount().by(label)

If we were to run the query, we would get back a map where the keys are label names and the values are the counts for the occurrence of each label in the graph.

[continent:7,country:237,version:1,airport:3504]

There are other ways we could write the query above that will yield the same result. One such example is shown below.

// How many of each type of vertex are there?
g.V().label().groupCount()

[continent:7,country:237,version:1,airport:3504]

We can also run a similar query to find out the distribution of edge labels in the graph. An example of the type of result we would get back is also shown.

// How many of each type of edge are there?
g.E().groupCount().by(label)

[contains:7008,route:50637]

As before, we could rewrite the query as follows.

// How many of each type of edge are there?
g.E().label().groupCount()

[contains:7008,route:50637]

By way of a side note, the examples above are shorthand ways of writing something like this example which also counts vertices by label.

// As above but using group()
g.V().group().by(label).by(count())

[continent:7,country:237,version:1,airport:3504]

There are often a number of different ways to write Gremlin that will obtain the same result. See the A word about idiomatic Gremlin Section for more details.

We can be more selective in how we specify the groups of things that we want to count. In the examples below we first count how many airports there are in each country. This will return a map of key:value pairs where the key is the country code and the value is the number of airports in that country. As the fourth and fifth examples show, we can use 'select' to pick just a few values from the whole group that got counted. Of course, if we only wanted a single value, we could just count the airports connected to that country directly, but the last two examples are intended to show that you can count a group of things and still selectively only look at part of that group.

// How many airports are there in each country?
g.V().hasLabel('airport').groupCount().by('country')

// How many airports are there in each country? (look at country first)
g.V().hasLabel('country').group().by('code').by(out().count())

We can easily find out how many airports there are in each continent using 'group' to build a map of continent codes and the number of airports in that continent. The output from running the query is shown below also.

// How many airports are there in each continent?
g.V().hasLabel('continent').group().by('code').by(out().count())

[EU:605,AS:971,NA:989,OC:305,AF:321,AN:0,SA:313]

These queries show how 'select' can be used to extract specific values from the map that we have created. Again, you can see the results we get from running the query.

// How many airports are there in France (having first counted all countries)
g.V().hasLabel('airport').groupCount().by('country').select('FR')

59

// How many airports are there in France, Greece and Belgium respectively?
g.V().hasLabel('airport').groupCount().by('country').select('FR','GR','BE')

[FR:59,GR:39,BE:5]

The 'group' and 'groupCount' steps are invaluable when you want to count groups of things or collect things into a group using a selection criteria. You will find a lot more examples of grouping and counting things in the section called "Counting more things".

3.3. Starting to walk the graph

So far we have mostly just explored queries that look at properties on a vertex or count how many things we can find of a certain type. Where the power of a graph really comes into play is when we start to 'walk' or 'traverse' the graph by looking at the connections (edges) between vertices. The term 'walking the graph' is used to describe moving from one vertex to another vertex via an edge. Typically, when using the phrase 'walking a graph', the intent is to describe starting at a vertex traversing one or more vertices and edges and ending up at a different vertex or sometimes, back where you started in the case of a 'circular walk'. It is straightforward to traverse a graph in this way using Gremlin. The journey we took while on our 'walk' is often referred to as our 'path'. There are also cases when all you want to do is return edges or some combination of vertices and edges as the result of a query, and Gremlin allows this as well. We will explore a lot of ways to modify the way a graph is traversed in the upcoming sections.

The table below gives a brief summary of all the steps that can be used to 'walk' or 'traverse' a graph using Gremlin. You will find all of these steps used in various ways throughout the book. Think of a graph traversal as moving through the graph from one place to one or more other places. These steps tell Gremlin which places to move to next as it traverses a graph for you.

To better understand these steps, it is worth defining some terminology. One vertex is considered to be 'adjacent' to another vertex if there is an edge connecting them. A vertex and an edge are considered 'incident' if they are connected to each other.

Table 1. Where to move next while traversing a graph
out *	Outgoing adjacent vertices.
in *	Incoming adjacent vertices.
both *	Both incoming and outgoing adjacent vertices.
outE *	Outgoing incident edges.
inE *	Incoming incident edges.
bothE *	Both outgoing and incoming incident edges.
outV	Outgoing vertex.
inV	Incoming vertex.
otherV	The vertex that was not the vertex we came from.

Note that the steps labeled with a '*' can optionally take the name of one or more edge labels as a parameter. If omitted, all relevant edges will be traversed.

3.3.1. Some simple graph traversal examples

To get us started, in this section we will look at some simple graph traversal examples that use some of the steps that were just introduced. The 'out' step is used to find vertices connected by an outgoing edge to that vertex, and the 'outE' 'step' is used when you want to examine the outgoing edges from a given vertex. Conversely, the 'in' and 'inE' steps can be used to look for incoming vertices and edges. The 'outE' and 'inE' steps are especially useful when you want to look at the properties of an edge as we shall see in the "Examining the edge between two vertices" section. There are several other steps that we can use when traversing a graph to move between vertices and edges. These include 'bothE', 'bothV' and 'otherV'. We will encounter those in the "Other ways to explore vertices and edges using 'both', 'bothE', 'bothV' and 'otherV'" section.

So let’s use a few examples to help better understand these graph traversal steps. The first query below does a few interesting things. Firstly, we find the vertex representing the Austin airport (the airport with a property of 'code' containing the value 'AUS'). Having found that vertex, we then go 'out' from there. This will find all the vertices connected to Austin by an outgoing edge. Having found those airports, we then ask for the values of their 'code' properties using the 'values' step. Finally, the 'fold' step puts all the results into a list for us. This just makes it easier for us to inspect the results in Gremlin Console.

// Where can I fly to from Austin?
g.V().has('airport','code','AUS').out().values('code').fold()

Here is what you might get back if you were to run this query in your console.

[PHL,PDX,DTW,OKC,ONT,CLT,CUN,MEM,CVG,IND,MCI,DAL,STL,ABQ,MKE,MDW,OMA,TUL,PVR,NAS,LIT,
 JAX,PVD,SMF,BHM,SDF,BUF,BOI,LBB,ECP,HRL,RNO,CMH,DSM,CZM,AMA,BTR,CHS,GDL,GRR,LIR,PNS,
 SJD,TYS,VPS,XNA,HDN,SFB,BUR,ASE,BZN,BKG,PIE,ATL,BNA,BOS,BWI,DCA,DFW,FLL,IAD,IAH,JFK,
 LAX,MCO,MIA,MSP,ORD,PHX,RDU,SEA,SFO,SJC,TPA,SAN,LGB,SNA,SLC,LAS,DEN,SAT,YYZ,HNL,YVR,
 MSY,LHR,EWR,LGW,HOU,FRA,ELP,AMS,CLE,YYC,OAK,MEX,TUS,PIT]

All edges in a graph have a label. However, one thing we did not do in the previous query was specify a label for the 'out' step. If you do not specify a label, you will get back any connected vertex regardless of its edge label. In this case it does not cause us a problem as airports only have one type of outgoing edge, labeled 'route'. However, in many cases, in graphs you create or are working with, your vertices may be connected to other vertices by edges with differing labels, so it is good practice to get into the habit of specifying edge labels as part of your Gremlin queries. So we could change our query just a bit by adding a label reference on the 'out' step as follows.

// Where can I fly to from Austin?
g.V().has('airport','code','AUS').out('route').values('code').fold()

Despite having just stated that consistently using edge labels in queries is a good idea, unless you truly do want to get back all edges or all connected vertices, We will break our own rule quite a bit in this book. The reason for this is purely to save space and shorten the queries we present.

Here are a few more simple queries similar to the previous one. The first example can be used to answer the question, "Where can I fly to from Austin, with one stop on the way?". Note that, as written, coming back to Austin will be included in the results as this query does not rule it out!

// Where can I fly to from Austin, with one stop on the way?
g.V().has('airport','code','AUS').out('route').out('route').values('code')

This query uses an 'in' step to find all the routes that come into the London City Airport (LCY) and returns their IATA codes.

// What routes come in to LCY?
g.V().has('airport','code','LCY').in('route').values('code')

This query is perhaps a bit more interesting. It finds all the routes from London Heathrow airport in England that go to an airport in the United States and returns their IATA codes.

// Flights from London Heathrow (LHR) to airports in the USA
g.V().has('code','LHR').out('route').has('country','US').values('code')

3.3.2. What vertices and edges did I visit? — Introducing 'path'

A Gremlin method (often called a step) that you will see used a lot in this book is 'path'. After you have done some graph walking using a query, you can use 'path' to get a summary back of where you went. A simple example of a 'path' step being used is shown below. Throughout the book you will see numerous examples of 'path' being used including in conjunction with one or more 'by' steps to specify how the path result should be formatted.

This particular query will return the vertices and outgoing edges starting at the London City (LCY) airport vertex. You can read this query like this: "Start at the LCY vertex, find all outgoing edges and also find all the vertices that are on the other ends of those edges". The 'inV' step gives us the vertex at the other end of the outgoing edge.

// This time, for each route, return both vertices and the edge that connects them.
g.V().has('airport','code','LCY').outE().inV().path()

If you run that query as-is, you will get back a series of results that look like this. This shows that there is a route from vertex 88 to vertex 77 via an edge with an ID of 15142.

[v[88],e[15142][88-route->77],v[77]]

While this result is useful, we might want to return something more human-readable such as the IATA codes for each airport and perhaps the distance property from the edge that tells us how far apart the airports are. We could add some 'by' modulators to our query to do this. The Apache TinkerPop documentation uses the phrase 'modulator' to describe steps that are not really independent steps but instead alter the behavior of the steps that they are associated with.

A 'modulator' is a step that influences the behavior of the step that it is associated with. Examples of such modulator steps are 'by' and 'as'.

Take a look at the modified form of the query shown below and an example of the results that it will now return. If this is not fully clear yet, don’t panic. Both 'path' and 'by' are used a lot throughout this book.

g.V().has('airport','code','LCY').outE().inV().
      path().by('code').by('dist')

When you run this modified version of the query, you will receive a set of results that look like the following line.

[LCY,456,GVA]

The 'by' modulator steps are processed in a round-robin fashion. If there are not enough modulators specified for the total number of elements in the path, Gremlin just loops back around to the first 'by' step and so on. So even though there were three elements in the path that we wanted to have formatted, we only needed to specify two 'by' modulators. This is because the first and third elements in the path are of the same type, namely airport vertices, and we wanted to use the same property name, 'code', in each of those cases. If we instead wanted to reference a different property name for each element of the path result, we would need to specify three explicit 'by' modulator steps. This would be required if, for example, we wanted to reference the 'city' property of the third element in the path rather than its 'code'.

The 'by' modulator steps are processed in a round-robin fashion in cases where there are more results to apply them to than the number of 'by' modulators specified.

The example above is equivalent to this longer form of the same query.

g.V().has('airport','code','LCY').outE().inV().
      path().by('code').by('dist').by('code')

The example below shows a case where three different 'by' modulators are used. This time the third 'by' modulator step references the 'city' property rather than the airport 'code'. As you can see from the sample output, this time the city name 'Geneva' appears rather than the airport code 'GVA'.

g.V().has('airport','code','LCY').outE().inV().
      path().by('code').by('dist').by('city')

[LCY,456,Geneva]

Sometimes it is necessary to use a 'by' modulator that has no parameter as shown below. This is because the element in the path is not a vertex or edge containing multiple properties but rather a single value, in this case, an integer.

g.V().has('airport','code','LCY').out().limit(5).
      values('runways').
      path().by('code').by('code').by()

The results show the codes for the airports we visited along with a number representing the number of runways the second airport has.

[LCY,CDG,4]
[LCY,FRA,4]
[LCY,DUB,2]
[LCY,FCO,3]
[LCY,AMS,6]

It is also possible to use a traversal inside a 'by' modulator. Such traversals are known as '"anonymous traversals"', and they are discussed in greater details in the "Deep dive on traversal terminology" section.

For now, it is enough to know that they allow us to do things like combine multiple values as part of a path result. The example below finds five routes that start in Austin and creates a path result containing the airport code and city name for both the source and destination airports. In this case, the anonymous traversal contained within the 'by' modulator is applied to each element in the path.

g.V(3).out().limit(5).path().by(values('code','city').fold())

[[AUS,Austin],[PHL,Philadelphia]]
[[AUS,Austin],[PDX,Portland]]
[[AUS,Austin],[DTW,Detroit]]
[[AUS,Austin],[OKC,Oaklahoma City]]
[[AUS,Austin],[ONT,Ontario]]

To demonstrate that just about any arbitrary traversal can be placed inside the 'by' modulator, here is one more example that counts the number of outgoing routes for the source and destination airports as part of generating the 'path' result.

g.V(3).out().limit(5).path().by(outE().count())

[98,147]
[98,80]
[98,145]
[98,33]
[98,23]

3.3.3. Modifying a 'path' using 'from' and 'to' modulators

The 'from' and 'to' modulators for the 'path' step enables us to not return the entire path of a traversal but instead to be more selective.

First, look at the example below. In this case we have just used the same 'path' constructs used in the prior examples. The query returns the first 10 routes found starting at Austin (AUS) with one stop on the way.

g.V().has('airport','code','AUS').out().out().path().by('code').limit(10)

As expected, the results show each airport that was visited.

[AUS,PHL,SXM]
[AUS,PHL,RIC]
[AUS,PHL,LEX]
[AUS,PHL,ISP]
[AUS,PHL,SWF]
[AUS,PHL,DSM]
[AUS,PHL,MYR]
[AUS,PHL,CAE]
[AUS,PHL,CHA]
[AUS,PHL,CHS]

Given that every journey starts in Austin, we might not want the AUS airport code to be part of the returned results. We might just want to capture the places that we ended up visiting after leaving Austin. This can be achieved by labeling the parts of the traversal that we care about using 'as' steps and then using 'from' and 'to' modulators to tell the 'path' step what we are interested in. Take a look at the modified version of the query below.

g.V().has('airport','code','AUS').out().as('a').out().as('b').
      path().by('code').from('a').to('b').limit(10)

This time AUS is not included in the 'path' results.

[PHL,SXM]
[PHL,RIC]
[PHL,LEX]
[PHL,ISP]
[PHL,SWF]
[PHL,DSM]
[PHL,MYR]
[PHL,CAE]
[PHL,CHA]
[PHL,CHS]

Because after skipping the AUS part of the path, we did in fact want the rest of the results, we could have left off the 'to' modulator and written the query as follows.

g.V().has('airport','code','AUS').out().as('a').out().
      path().by('code').from('a').limit(10)

As you can see, the results are the same as before.

[PHL,SXM]
[PHL,RIC]
[PHL,LEX]
[PHL,ISP]
[PHL,SWF]
[PHL,DSM]
[PHL,MYR]
[PHL,CAE]
[PHL,CHA]
[PHL,CHS]

There are a lot of ways that 'from' and 'to' can be used. By way of one final example, let’s create a version of the query with three 'out' steps. Note that a bit later we will see how 'repeat' can be used when the same steps need to be used repeatedly like this, but that is not important to this specific example.

g.V().has('airport','code','AUS').out().out().out().
      path().by('code').limit(10)

As expected, we now have an additional stop added to each of the journeys.

[AUS,PHL,SXM,NEV]
[AUS,PHL,SXM,AXA]
[AUS,PHL,SXM,EIS]
[AUS,PHL,SXM,EUX]
[AUS,PHL,SXM,SAB]
[AUS,PHL,SXM,ATL]
[AUS,PHL,SXM,BOS]
[AUS,PHL,SXM,FLL]
[AUS,PHL,SXM,IAD]
[AUS,PHL,SXM,JFK]

Let’s now modify the query to limit which parts of the path are returned.

g.V().has('airport','code','AUS').out().as('a').out().as('b').out().
      path().by('code').from('a').to('b').limit(10)

As you can see, only the parts of the journey that we selected have been returned.

[PHL,SXM]
[PHL,SXM]
[PHL,SXM]
[PHL,SXM]
[PHL,SXM]
[PHL,SXM]
[PHL,SXM]
[PHL,SXM]
[PHL,SXM]
[PHL,SXM]

We could also have written the query as shown below to only show the results of each path up to a certain point.

g.V().has('airport','code','AUS').out().out().as('b').out().
      path().by('code').to('b').limit(10)

This time only the first three airports visited are included in each result.

[AUS,PHL,SXM]
[AUS,PHL,SXM]
[AUS,PHL,SXM]
[AUS,PHL,SXM]
[AUS,PHL,SXM]
[AUS,PHL,SXM]
[AUS,PHL,SXM]
[AUS,PHL,SXM]
[AUS,PHL,SXM]
[AUS,PHL,SXM]

By way of a side note, in cases like this where more than one of the results is identical, you may want to remove the duplicates. That is where the 'dedup' step is useful. You will find coverage of 'dedup' in the "Removing duplicates - introducing 'dedup'" section. However, as a little taste test, let’s add a 'dedup' step to the end of our previous query and see what happens.

g.V().has('airport','code','AUS').out().out().as('b').out().
      path().by('code').to('b').limit(10).dedup()

[AUS,PHL,SXM]

As you can see, all the duplicate results have now been removed. Hopefully, this gives you a good basic understanding of the 'path' step. You will see it used a lot throughout the remainder of this book. However, there are a few things to be aware of when using 'path'. Those concerns are explained in the "A warning that path finding can be memory and CPU intensive" section a bit later.

Does an edge exist between two vertices? You can use the 'hasNext' step to check if an edge exists between two vertices and get a Boolean (true or false) value back. The first query below will return true because there is an edge (a route) between AUS and DFW. The second query will return false because there is no route between AUS and SYD.

g.V().has('code','AUS').out('route').has('code','DFW').hasNext()

true

g.V().has('code','AUS').out('route').has('code','SYD').hasNext()

false

3.3.4. Using 'as', 'select' and 'project' to refer to traversal steps

Sometimes it is useful to be able to remember a point of a traversal by giving it a name (label) and refer to it later on in the same query. The query below uses an 'as' step to attach a label at two different parts of the traversal, each representing different vertices that were found. A 'select' step is later used to refer back to them.

g.V().has('code','DFW').as('from').out().
      has('region','US-CA').as('to').
      select('from','to')

This query, while a bit contrived, and in this case probably a poor substitute for using 'path', returns the following results.

[from:v[8],to:v[151]]
[from:v[8],to:v[181]]
[from:v[8],to:v[244]]
[from:v[8],to:v[384]]
[from:v[8],to:v[605]]
[from:v[8],to:v[865]]
[from:v[8],to:v[872]]
[from:v[8],to:v[877]]
[from:v[8],to:v[13]]
[from:v[8],to:v[23]]
[from:v[8],to:v[24]]
[from:v[8],to:v[26]]
[from:v[8],to:v[28]]
[from:v[8],to:v[42]]

In the example above, only the vertices themselves were selected. We can also use a 'by' modulator to specify which property to retrieve from the selected vertices.

g.V().has('code','DFW').as('from').out().
      has('region','US-CA').as('to').
      select('from','to').by('code')

This time the results contain the airport codes.

[from:DFW,to:ONT]
[from:DFW,to:PSP]
[from:DFW,to:SMF]
[from:DFW,to:FAT]
[from:DFW,to:BUR]
[from:DFW,to:BFL]
[from:DFW,to:MRY]
[from:DFW,to:SBA]
[from:DFW,to:LAX]
[from:DFW,to:SFO]
[from:DFW,to:SJC]
[from:DFW,to:SAN]
[from:DFW,to:SNA]
[from:DFW,to:OAK]

While the prior example was perhaps not ideal, it does show how 'as' and 'select' work. For completeness, here is the same query but using 'path'. You will see both the 'select' and 'path' steps used a lot throughout this book.

g.V().has('code','DFW').out().
      has('region','US-CA').
      path().by('code')

Which would produce the following results. Notice that this time the results do not have labels associated with them but are otherwise the same.

[DFW,ONT]
[DFW,PSP]
[DFW,SMF]
[DFW,FAT]
[DFW,BUR]
[DFW,BFL]
[DFW,MRY]
[DFW,SBA]
[DFW,LAX]
[DFW,SFO]
[DFW,SJC]
[DFW,SAN]
[DFW,SNA]
[DFW,OAK]

While the 'path' step is a lot more convenient, in some cases it can be costly in terms of memory and CPU usage, so it is worth remembering these alternative techniques using 'as' and 'select'. That topic is discussed in more detail in the "A warning that path finding can be memory and CPU intensive" section.

You can also give a point of a traversal multiple names and refer to each later on in the traversal/query as shown below.

g.V().has('type','airport').limit(10).as('a','b','c').
      select('a','b','c').
        by('code').by('region').by(out().count())

The 'project' step can achieve the same results as obtained from the combination of 'as' and 'select' steps. The example below shows the previous query, rewritten to use 'project' instead of 'as' and 'select'.

g.V().has('type','airport').limit(10).
      project('a','b','c').
        by('code').by('region').by(out().count())

This query, and the prior query, would return the following results.

[a:ATL,b:US-GA,c:242]
[a:ANC,b:US-AK,c:41]
[a:AUS,b:US-TX,c:98]
[a:BNA,b:US-TN,c:75]
[a:BOS,b:US-MA,c:143]
[a:BWI,b:US-MD,c:91]
[a:DCA,b:US-DC,c:96]
[a:DFW,b:US-TX,c:253]
[a:FLL,b:US-FL,c:158]
[a:IAD,b:US-VA,c:158]

In the prior example we gave our variables simple names like 'a' and 'b'. However, it is sometimes useful to give our traversal variables and named steps more meaningful names, and it is perfectly OK to do that. Let’s rewrite the query to use some more descriptive variable names.

g.V().has('type','airport').limit(10).
      project('IATA','Region','Routes').
        by('code').by('region').by(out().count())

When we run the modified query, here is the output we get.

[IATA:ATL,Region:US-GA,Routes:242]
[IATA:ANC,Region:US-AK,Routes:41]
[IATA:AUS,Region:US-TX,Routes:98]
[IATA:BNA,Region:US-TN,Routes:75]
[IATA:BOS,Region:US-MA,Routes:143]
[IATA:BWI,Region:US-MD,Routes:91]
[IATA:DCA,Region:US-DC,Routes:96]
[IATA:DFW,Region:US-TX,Routes:253]
[IATA:FLL,Region:US-FL,Routes:158]
[IATA:IAD,Region:US-VA,Routes:158]

The 'project' step can be applied to inputs other than graph elements. It can actually operate on any incoming traverser. For example, when 'project' follows 'valueMap' the incoming traverser is a 'Map' object.

g.V().has('code','IAD').valueMap().
  project('r','ct').by('runways').by(count(local))

[r:[4],ct:12]

The prior example shows how 'project' can access the "runways" key and count the number of entries in the 'Map'.

3.3.5. Traits of 'by' modulators

We’ve seen enough use of the 'by' modulator now to dive deeper on their general behavior. As we learned earlier, a 'by' modulator is a form of step that influences the behavior of the step that it is associated with. Moreover, The 'by' modulators are processed in a round-robin fashion in cases where there are more results to apply them to than the number of 'by' modulators specified.

In addition to those basic definitions, there are some other points worth exploring. First, let’s take a look at the list of steps that support 'by' modulation, as 'by' cannot be used on all steps:

'aggregate'

'cyclicPath'

'dedup'

'group'

'groupCount'

'math'

'order'

'path'

'project'

'propertyMap'

'sack'

'sample'

'select'

'simplePath'

'store'

'tree'

'valueMap'

'where'

Next, let’s revisit some usage with 'by' where it is commonly used with 'path' and 'project'.

g.V().has('airport','code','AUS').
  out().out().
  path().by('code').
  limit(10)

[AUS,PHL,SXM]
[AUS,PHL,RIC]
[AUS,PHL,LEX]
[AUS,PHL,ISP]
[AUS,PHL,SWF]
[AUS,PHL,DSM]
[AUS,PHL,MYR]
[AUS,PHL,CAE]
[AUS,PHL,CHA]
[AUS,PHL,CHS]

g.V().has('type','airport').limit(10).
      project('a','b','c').
        by('code').
        by('region').
        by(outE().count())

[a:ATL,b:US-GA,c:242]
[a:ANC,b:US-AK,c:41]
[a:AUS,b:US-TX,c:98]
[a:BNA,b:US-TN,c:75]
[a:BOS,b:US-MA,c:143]
[a:BWI,b:US-MD,c:91]
[a:DCA,b:US-DC,c:96]
[a:DFW,b:US-TX,c:253]
[a:FLL,b:US-FL,c:158]
[a:IAD,b:US-VA,c:158]

In both of the prior examples, the 'by' modulators were 'productive', which means that when the 'by' is used by the modulated step, it generates a result. When the 'by' does not emit a result, it is said to be 'unproductive' and introduces an important aspect of Gremlin semantics. Let’s modify the first example with 'path' to make it unproductive by introducing an invalid property key and referring to it as '"cde"' instead of '"code"':

g.V().has('airport','code','AUS').
  out().out().
  path().by('cde').
  limit(10)

// no results

The prior example returns no results. The unproductive 'by' to 'path' forces it to behave a bit like a filter. Since 'path' can’t use '"cde"' to access a valid property key, it simply drops that traverser to prevent an error. Now, let’s modify the second example to make it unproductive for some cases, particularly those where the number of outgoing edges is less than 100:

g.V().has('type','airport').limit(10).
      project('a','b','c').
        by('code').
        by('region').
        by(outE().count().is(gt(100)))

[a:ATL,b:US-GA,c:101]
[a:ANC,b:US-AK]
[a:AUS,b:US-TX]
[a:BNA,b:US-TN]
[a:BOS,b:US-MA,c:101]
[a:BWI,b:US-MD]
[a:DCA,b:US-DC]
[a:DFW,b:US-TX,c:101]
[a:FLL,b:US-FL,c:101]
[a:IAD,b:US-VA,c:101]

As you can see, the 'project' step will not emit a key for '"c"' when a modulator is unproductive. Each step will have its own semantics for what it does with an unproductive 'by', but typically there is a form of filtering operation that occurs akin to what we’ve seen in the prior example. We will see more examples of 'by' introducing these kinds of behaviors as we learn new steps in future sections.

Let’s look at another example with 'path' where the 'by' modulator is not productive.

g.V().has('airport','code','AUS').
  outE().inV().outE().inV().
  path().by('dist').
  limit(10).
  sum(local)

// no results

The above example is similar to the earlier one but expands the path Gremlin travels to include 'Edge' objects. In this case, the query seeks to sum the value of dist properties on edges in the path. The 'Vertex' objects will not have a dist property key and therefore be unproductive. One way to change that is to ensure that vertices produce a value with their 'by' modulator, by supplying a 'constant(0)' for those cases.

g.V().has('airport','code','AUS').
  outE().inV().outE().inV().
  path().
    by(constant(0)).
    by('dist').
  limit(10).
  sum(local)

3102
1628
1948
1559
1557
2399
1904
1952
2069
1980

With that small adjustment, vertices in the path can produce a '0' integer value which can be summed with the dist property value from the edges.

3.3.6. Using multiple 'as' steps with the same label

It is actually possible using 'as' step to give more than one part of a traversal the same label (name). In the example below, the label ''a'' is used twice, but you will notice that when the label is selected, only the last item added is returned.

g.V(1).as('a').V(2).as('a').select('a')

v[2]

There are some special keywords that can be used in conjunction with the 'select' step in cases like this one. These keywords are 'first', 'last' and 'all', and their usage is shown below.

g.V(1).as('a').V(2).as('a').select(first,'a')

v[1]

g.V(1).as('a').V(2).as('a').select(last,'a')

v[2]

g.V(1).as('a').V(2).as('a').select(all,'a')

[v[1],v[2]]

Here is another example of a query that labels two different parts of a traversal with the same ''a'' label. As you can see from the results, only the second one is used because of the 'last' keyword provided on the 'select' step.

g.V().has('code','AUS').as('a').
      out().as('a').limit(10).
      select(last,'a').by('code').fold()

[PHL,PDX,DTW,OKC,ONT,CLT,CUN,MEM,CVG,IND]

Here is the same query but using the 'first' keyword this time as part of the 'select' step.

g.V().has('code','AUS').as('a').
      out().as('a').limit(10).
      select(first,'a').by('code').fold()

[AUS,AUS,AUS,AUS,AUS,AUS,AUS,AUS,AUS,AUS]

Note that when the same name is used to label a step, the data structure created by Gremlin is essentially a List. As such, the 'by' modulator cannot be used when the 'all' keyword is used on the 'select' step. To get the values of each element in the list, we can use an 'unfold' step as shown below.

g.V().has('code','AUS').as('a').
      out().as('a').limit(10).
      select(all,'a').unfold().values('code').fold()

[AUS,AUS,AUS,AUS,AUS,AUS,AUS,AUS,AUS,AUS,
 PHL,PDX,DTW,OKC,ONT,CLT,CUN,MEM,CVG,IND]

Keywords such as 'all', 'first' and 'last' are discussed further in the "Important Classes and Enums to be aware of" section later on in the book.

3.3.7. Returning selected parts of a path

Sometimes, even using the 'from' and 'to' modulating steps along with a 'path' step will not give you the results you are interested in. Using a 'select' step and some 'as' steps in a similar way to the example in the previous section, we can select specific parts of a traversal’s "path". Consider the query below that finds a route from Los Angeles (LAX) and returns the path.

g.V().has('code','LAX').
      out().
      out().
      out().
      out().
      out().
      limit(1).
      path().by('code')

[LAX,BER,SVG,KSU,OSL,BOS]

Now, imagine we want to just return every other stop as the result from our query. The example below shows how to do just that.

g.V().has('code','LAX').
      out().as('stop').
      out().
      out().as('stop').
      out().
      out().as('stop').
      limit(1).
      select(all,'stop').
      unfold().
      values('code').fold()

[BER,KSU,BOS]

3.3.8. Examining the edge between two vertices

Sometimes, it is the edge between two vertices that we are interested in examining and not the vertices themselves. Typically, this is because we want to look at one or more properties associated with that edge. By way of an example, let’s imagine we wanted to know how many miles the flight is between Miami (MIA) and Dallas Fort Worth (DFW). In our air-routes graph, the distances between vertices are stored using a property called 'dist' on any edge that has a 'route' label. We can use the 'outE' and 'inV' steps to find the edge connecting Miami and Dallas. We can also use the 'select' and 'as' steps that we just learned about to help with this task. Take a look at the query below. This will find the outgoing 'route' edge from MIA to DFW, store it in the traversal variable 'e' and at the end of the query use 'select' to return it as the result of the query.

g.V().has('code','MIA').outE().as('e').inV().has('code','DFW').select('e')

If we were to run the query, we would get back something similar to this

e[4266][16-route->8]

So we found the 'route' edge that connects the vertex with an ID of 16 (MIA) with the airport that has an ID of 8 (DFW). While interesting, this is not exactly what we set out to achieve. What we actually are interested in is the distance property of that edge, so we can see how far it is from Miami to Dallas Fort Worth. We need to add one additional step to our query that will look at the 'dist' property of the edge. Let’s modify our query to do that.

g.V().has('code','MIA').outE().as('e').
      inV().has('code','DFW').select('e').values('dist')

If we run the query again, we get back what we were looking for. We can see that it is 1,120 miles from Miami to Dallas Fort Worth.

As a side note, we could have written the query using 'inE' and 'outV' and achieved the same result by looking at the edge from Dallas to Miami.

g.V().has('code','MIA').inE().as('e').
      outV().has('code','DFW').select('e').values('dist')

1120

Throughout the remainder of the book you will find lots of examples that use steps such as 'outE', 'inE', 'outV' and 'inV'.

3.4. Limiting the amount of data returned

It is sometimes useful, especially when dealing with large graphs, to limit the amount of data that is returned from a query. As shown in the examples below, this can be done using the 'limit' and 'tail' steps. A little later in this book we also introduce the 'coin' step that allows a pseudo random sample of the data to be returned.

// Only return the FIRST 20 results
g.V().hasLabel('airport').values('code').limit(20)

// Only return the LAST 20 results
g.V().hasLabel('airport').values('code').tail(20)

Depending upon the implementation, it is probably more efficient to write the query like this, with 'limit' coming before 'values' to guarantee fewer airports are initially returned, but it is also possible that an implementation would optimize both the same way.

// Only return the FIRST 20 results
g.V().hasLabel('airport').limit(20).values('code')

Note that 'limit' provides a shorthand alternative to 'range'. The first of the two examples above could have been written as follows.

// Only return the FIRST 20 results
g.V().hasLabel('airport').range(0,20).values('code')

We can also limit a traversal by specifying a maximum amount of time that it is allowed to run for. The following query is restricted to a maximum limit of ten milliseconds. The query looks for routes from Austin (AUS) to London Heathrow (LHR). All the parts of this query are explained in detail later on in this book, but we think what they do is fairly clear. The 'repeat' step is explained in detail in the "Shortest paths (between airports) - introducing 'repeat'" section.

// Limit the query to however much can be processed within 10 milliseconds
g.V().has('airport','code','AUS').
      repeat(timeLimit(10).out()).until(has('code','LHR')).path().by('code')

Here is what the query above returned when run on our laptop.

[AUS,LHR]
[AUS,PHL,LHR]
[AUS,PDX,LHR]
[AUS,DTW,LHR]
[AUS,CLT,LHR]
[AUS,NAS,LHR]

If we give the query another 10 milliseconds to run, so 20 in total, you can see that a few more routes were found.

// Limit the query to 20 milliseconds
g.V().has('airport','code','AUS').
      repeat(timeLimit(20).out()).until(has('code','LHR')).path().by('code')

[AUS,LHR]
[AUS,PHL,LHR]
[AUS,PDX,LHR]
[AUS,DTW,LHR]
[AUS,CLT,LHR]
[AUS,NAS,LHR]
[AUS,CHS,LHR]
[AUS,ATL,LHR]
[AUS,BNA,LHR]
[AUS,BOS,LHR]
[AUS,BWI,LHR]
[AUS,DFW,LHR]

3.4.1. Retrieving a range of vertices

Gremlin provides various ways to return a sequence of vertices. We have already seen the 'limit' and 'range' steps used in the previous section to return the first 20 elements of a query result. We can also use the 'range' step to select different ranges of vertices by giving a non-zero starting offset and an ending offset. The 'range' offsets are zero-based and are inclusive/exclusive.

// Return the first two airport vertices found
g.V().hasLabel('airport').range(0,2)

v[1]
v[2]

The starting value given to a 'range' step does not have to be '0'. In the example below we ask for the 3rd, 4th and 5th results found by specifying a range of '"(3,6)"'.

// Return the fourth, fifth and sixth airport vertices found (zero based)
g.V().hasLabel('airport').range(3,6)

v[4]
v[5]
v[6]

Here is an example of how we can use the index '-1' to mean '"until the end of the list"'. This is similar to the convention used in many programming languages when working with arrays and lists.

// Return all the remaining vertices starting at the 3500th one
g.V().range(3500,-1)

Here is another example that uses the 'range' step, this time looking only at vertices with a label of 'country'. Notice how this time we found vertices with much higher ID values.

g.V().hasLabel('country').range(0,2)

v[3505]
v[3506]

There is no guarantee as to which airport vertices will be selected as this depends upon how they are stored by the back end graph. Using TinkerGraph, the airports will most likely come back in the order they are put into the graph. This is not likely to be the case with other graph stores such as JanusGraph. So do not rely on any ordering expectations when using 'range' to process sets of vertices.

You can use 'a skip' step as an alternative to 'range' in some cases. The 'skip' step can be used whenever you would otherwise use 'range' where the second parameter would be '-1' meaning "all remaining".

The two examples below will produce the same results.

g.V().has('region','US-TX').skip(5).fold()

g.V().has('region','US-TX').range(5,-1).fold()

Here is the output you might get from running either query.

[v[39],v[186],v[273],v[278],v[289],v[314],v[356],v[357],v[358],v[361],v[368],
 v[370],v[390],v[394],v[404],v[405],v[423],v[426],v[428],v[1118],v[3313],v[3416]]

To prove that the 'skip' and 'range' steps used above worked again, we can run the query again with 'skip' removed and look at the results. You will notice, the first five vertices listed were not included as part of the results from the prior queries.

g.V().has('region','US-TX').fold()

[v[3],v[8],v[11],v[33],v[38],v[39],v[186],v[273],v[278],v[289],v[314],v[356],v[357],
 v[358],v[361],v[368],v[370],v[390],v[394],v[404],v[405],v[423],v[426],v[428],
 v[1118],v[3313],v[3416]]

You can also use the 'local' keyword to have 'skip' work on an incoming collection within a traversal. The example below, while contrived, applies skip to the list generated by the 'fold' step.

g.V().has('region','US-TX').fold().skip(local,3)

[v[33],v[38],v[39],v[186],v[273],v[278],v[289],v[314],v[356],v[357],v[358],v[361],
 v[368],v[370],v[390],v[394],v[404],v[405],v[423],v[426],v[428],v[1118],v[3313],
 v[3416]]

There are many other ways to specify a range of values using Gremlin. You will find several additional examples in the "Testing values and ranges of values with P" section.

3.4.2. Working with the end of a stream

When we want to take items from the end of the stream, we typically use the 'tail' step. With 'tail', you specify the number of elements to take from the end of the stream.

g.V().has('airport','code', within('IAD','DAL','JFK','DCA','DFW')).
  order().by('code').
  tail(1).
  values('code').fold()

[JFK]

g.V().has('airport','code', within('IAD','DAL','JFK','DCA','DFW')).
  order().by('code').
  tail(2).
  values('code').fold()

[IAD,JFK]

In the prior examples, we use 'order' to ensure consistent results for demonstration, and you often must use 'order' in your own query writing to achieve the same, but note that if you are using 'order', it might be better to rewrite the first as follows:

g.V().has('airport','code', within('IAD','DAL','JFK','DCA','DFW')).
  order().by('code',desc).
  limit(1).
  values('code').fold()

[JFK]

By using a 'desc' order, we can move JFK to the front of our stream and simply take the first item it encounters, which likely speeds up the query considerably as most graph databases should optimize the 'order,' and we likely spare Gremlin from having to iterate the entire stream to get to the last item with 'tail'. Depending upon your requirements, you might find that you could use the same tactic for the second query but note that the results have a slight difference.

g.V().has('airport','code', within('IAD','DAL','JFK','DCA','DFW')).
  order().by('code',desc).
  limit(2).
  values('code').fold()

[JFK,IAD]

In the prior example you can see that we get the same two results, but they are in reversed order. You would have to reverse the 'order' to get the same result as the first query. In these examples, we are dealing with a small dataset with a small set of five results, so the performance implications are not big for either approach, but for larger scales you may find yourself making choices with these patterns that can have significant impact.

Sometimes you may find that you want to take the second-to-last item in the stream. Finding the penultimate element of the stream just means taking the first item found in the list of the last two:

g.V().has('airport','code', within('IAD','DAL','JFK','DCA','DFW')).
  order().by('code').
  tail(2).limit(1).
  values('code').fold()

[IAD]

Gremlin does not have a step for getting all elements of the stream except for the last one. The following pattern allows you to do this by using 'store' to hold the 'tail' of the stream, which you can then use later to filter it away.

g.V().has('airport','code', within('IAD','DAL','JFK','DCA','DFW')).
  order().by('code').
  fold().
  sideEffect(tail(local).unfold().
             local(aggregate('t'))).
  unfold().where(neq('t')).by().by(unfold()).
  values('code').fold()

[DAL,DCA,DFW,IAD]

The pattern demonstrated in the prior examples shows how flexible Gremlin can be. With a solid command of the steps Gremlin offers and a clever application of them, you can accomplish a great many things.

3.5. Removing duplicates - introducing 'dedup'

It is often desirable to remove duplicate values from query results. The 'dedup' step allows us to do this. If you are already familiar with Groovy collections, the 'dedup' step is similar to the 'unique' method that Groovy provides. In the example below, the number of runways for every airport in England is queried. Note that in the returned results, there are many duplicate values.

g.V().has('region','GB-ENG').values('runways').fold()

[2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,2,1,2,2,1,3,1,3,3,4,1,1]

If we only wanted a set of unique values in the result, we could rewrite the query to include a 'dedup' step. This time the query results only include one of each value.

g.V().has('region','GB-ENG').values('runways').dedup().fold()

[2,1,3,4]

It is also possible to use a 'by' modulator to specify how 'dedup' should be applied. In the example below we only return one airport for each unique number of runways.

g.V().has('region','GB-ENG').dedup().by('runways').
      values('code','runways').fold()

[LHR,2,LCY,1,BLK,3,LEQ,4]

There is one more form of the 'dedup' step. In this form, one or more strings representing labeled steps are provided as parameters. First, take a look at the query below. It finds vertex 'V(3)' and labels it ''a''. It then finds vertex 'V(4)' and labels it ''c''. Next it finds all the vertices connected to V(4) and labels those ''b''. Only the first 10 are retrieved. Lastly, a 'select' step is used to return the results. As expected, vertices 3 and 4 are present in all the results.

g.V(3).as('a').V(4).as('c').both().as('b').limit(10).
  select('a','b','c')

[a:v[3],b:v[1],c:v[4]]
[a:v[3],b:v[3],c:v[4]]
[a:v[3],b:v[5],c:v[4]]
[a:v[3],b:v[6],c:v[4]]
[a:v[3],b:v[7],c:v[4]]
[a:v[3],b:v[8],c:v[4]]
[a:v[3],b:v[9],c:v[4]]
[a:v[3],b:v[10],c:v[4]]
[a:v[3],b:v[11],c:v[4]]
[a:v[3],b:v[12],c:v[4]]

Taking the same query but adding a 'dedup' step that references the ''a'' and ''c'' labels, removes all duplicate references that include those vertices from the results, so this time even though a 'limit' of 10 is used, we only actually get one result back.

g.V(3).as('a').V(4).as('c').both().as('b').limit(10).
  dedup('a','c').select('a','b','c')

[a:v[3],b:v[1],c:v[4]]

A bit later we will take a look at the concept of 'local' scope when working with traversals. There are some examples of 'local' scope being used in conjunction with 'dedup' in the "Using 'local' scope with collections" section.

It is also possible to use 'sets' to achieve similar results as we shall see in some of the following sections such as the "Introducing 'toList', 'toSet', 'bulkSet' and 'fill'" section that is coming up soon.

3.6. Using 'valueMap' to explore the properties of a vertex or edge

A call to 'valueMap' will return all the properties of a vertex or edge as an array of key:value pairs. In Java terms this is called a 'HashMap'. You can also select which properties you want 'valueMap' to return if you do not want them all. Each element in the map can be addressed using the name of the key. By default, the ID and label are not included in the map unless a parameter of 'true' is provided.

The query below will return the keys and values for all properties associated with the Austin airport vertex.

// Return all the properties and values the AUS vertex has
g.V().has('code','AUS').valueMap().unfold()

If you are using the Gremlin Console, the output from running the previous command should look something like this. The 'unfold' step at the end of the query is used to make the results easier to read.

country=[US]
code=[AUS]
longest=[12250]
city=[Austin]
elev=[542]
icao=[KAUS]
lon=[-97.6698989868164]
type=[airport]
region=[US-TX]
runways=[2]
lat=[30.1944999694824]
desc=[Austin Bergstrom International Airport]

Notice how each key, like 'country', is followed by a value returned as an element of a list. This is because it is possible (for vertices but not for edges) to provide more than one property value for a given key by encoding them as a list or as a set. We discuss how to better control this output in the paragraphs that follow.

Here are some more examples of how 'valueMap' can be used. If a parameter of 'true' is provided, then the results returned will include the ID and label of the element being examined.

// If you also want the ID and label, add a parameter of true
g.V().has('code','AUS').valueMap(true).unfold()

id=3
label=airport
country=[US]
code=[AUS]
longest=[12250]
city=[Austin]
elev=[542]
icao=[KAUS]
lon=[-97.6698989868164]
type=[airport]
region=[US-TX]
runways=[2]
lat=[30.1944999694824]
desc=[Austin Bergstrom International Airport]

You can also include 'true' along with requesting the map for specific properties. The next example will just return the ID, label and 'region' property.

// If you want the ID, label and a specific field like the region, you can do this
g.V().has('code','AUS').valueMap(true,'region')

[id:3,region:[US-TX],label:airport]

If you only need the keys and values for specific properties to be returned, it is recommended to pass the names of those properties as parameters to the 'valueMap' step so it does not return a lot more data than you need. Think of this as the difference, in the SQL world, between selecting just the columns you are interested in from a table rather than doing a 'SELECT *'.

As shown above, you can specify which properties you want returned by supplying their names as parameters to the 'valueMap' step. For completeness, it is worth noting that you can also use a 'select' step to refine the results of a 'valueMap'.

// You can 'select' specific fields from a value map
g.V().has('code','AUS').valueMap().select('code','icao','desc')

[code:[AUS],icao:[KAUS],desc:[Austin Bergstrom International Airport]]

As an additional example of Gremlin’s flexibility, note that you can also restrict the selected keys to all but those you specify.

 g.V('3').valueMap().as('vm').unfold().
   filter(select(keys).is(without('city','desc')))

country=[US]
code=[AUS]
longest=[12250]
elev=[542]
icao=[KAUS]
lon=[-97.6698989868164]
type=[airport]
region=[US-TX]
runways=[2]
lat=[30.1944999694824]

If you are reading the output of queries that use 'valueMap' from the console, it is sometimes easier to read the output if you add an 'unfold' step to the end of the query as follows. The 'unfold' step will unbundle a collection for us. You will see it used in many parts of this book.

g.V().has('code','AUS').valueMap(true,'code','icao','desc','city').unfold()

code=[AUS]
city=[Austin]
icao=[KAUS]
id=3
label=airport
desc=[Austin Bergstrom International Airport]

You can also use 'valueMap' to inspect the properties associated with an edge. In this example, the edge with an ID of 5161 is examined. As you can see, the edge represents a route and has a distance ('dist') property with a value of 1357 miles.

g.E(5161).valueMap(true)

[id:5161,label:route,dist:4663]

There are other ways to control the results that a 'valueMap' step returns using the 'with' modulator.

The valueMap configuration options are described in the official documentation at the following link https://tinkerpop.apache.org/docs/current/reference/#valuemap-step.

Instead of using 'valueMap(true)' to include the ID and label of an element (a vertex or an edge) in the results, the new 'with(WithOptions.tokens)' construct can now be used as shown below.

g.V().has('code','SFO').valueMap().with(WithOptions.tokens).unfold()

id=23
label=airport
country=[US]
longest=[11870]
code=[SFO]
city=[San Francisco]
lon=[-122.375]
type=[airport]
elev=[13]
icao=[KSFO]
region=[US-CA]
runways=[4]
lat=[37.6189994812012]
desc=[San Francisco International Airport]

All the possible values that can be specified using WithOptions can be found in the official Apache TinkerPop Javadoc documentation https://tinkerpop.apache.org/javadocs/current/full/org/apache/tinkerpop/gremlin/process/traversal/step/util/WithOptions.html.

You can still include the ID and label in the results, along with a subset of the properties, by explicitly naming the property keys you are interested in. In the example below, only the 'code' property is requested.

g.V().has('code','SFO').valueMap('code').with(WithOptions.tokens).unfold()

id=23
label=airport
code=[SFO]

You can use additional 'WithOptions' qualifiers to select just the labels.

g.V().has('code','SFO').
      valueMap('code').with(WithOptions.tokens,WithOptions.labels).
      unfold()

label=airport
code=[SFO]

In the same way you can choose to just have the ID value returned without the label.

g.V().has('code','SFO').
      valueMap('code').with(WithOptions.tokens,WithOptions.ids).
      unfold()

id=23
code=[SFO]

As discussed in the previous section, the property values returned by 'valueMap' are by default represented as lists even if there is only a single property value present.

You can very easily request that these values be returned as single values, not wrapped in lists. This can be done using a 'by' step modulator as shown below.

g.V().has('code','SFO').valueMap().by(unfold()).unfold()

Notice how all the values, such as the city name, "San Francisco", are now just simple strings or numeric values and not a single value wrapped in a list of length one.

country=US
code=SFO
longest=11870
city=San Francisco
elev=13
icao=KSFO
lon=-122.375
type=airport
region=US-CA
runways=4
lat=37.6189994812012
desc=San Francisco International Airport

There are additional 'WithOptions' settings we can use to change how properties with meta-properties are returned by 'valueMap' This is covered later as part of the "Using 'unfold' and 'WithOptions' with Meta-Properties" section.

3.7. An alternative to 'valueMap' - introducing 'elementMap'

The 'elementMap' step is similar in many ways to the 'valueMap' step but makes some things a little easier. When using 'valueMap', you need to explicitly request that the ID and label of a vertex or an edge are included in query results. This is not necessary when using 'elementMap'.

g.V().has('code','AUS').elementMap().unfold()

id=3
label=airport
country=US
code=AUS
longest=12250
city=Austin
elev=542
icao=KAUS
lon=-97.6698989868164
type=airport
region=US-TX
runways=2
lat=30.1944999694824
desc=Austin Bergstrom International Airport

As with 'valueMap', you can request only certain property values be included in the resulting map. Note, however, that the property values are not returned as list members. This is a key difference from 'valueMap'. In fact, if the value for a given property is a list or set containing multiple values, 'elementMap' will only return the first member of that list or set. If you need to return 'set' or 'list' cardinality values, you should use 'valueMap' instead.

g.V().has('code','AUS').elementMap('city')

[id:3,label:airport,city:Austin]

The biggest difference between 'elementMap' and 'valueMap' becomes apparent when looking at edges. For a given edge, as well as the ID and label and properties, information about the incoming and outgoing vertices is also returned.

g.V(3).outE().limit(1).elementMap()

[id:3840,label:route,IN:[id:45,label:airport],OUT:[id:3,label:airport],dist:1430]

A similar result could be generated using 'valueMap' as shown below, but it is definitely a bit more work.

g.E(5161).project('v','IN','OUT').
            by(valueMap(true)).
            by(inV().union(id(),label()).fold()).
            by(outV().union(id(),label()).fold())

[v:[id:5161,label:route,dist:4663],IN:[132,airport],OUT:[1,airport]]

To make the output look even closer to the results returned by 'elementMap', we could decide to add some additional 'project' steps.

g.E(5161).project('v','IN','OUT').
            by(valueMap(true)).
            by(project('id','label').
              by(inV().id()).
              by(inV().label())).
            by(project('id','label').
              by(outV().id()).
              by(outV().label())).
            unfold()

The results of running the query are shown below. We added an 'unfold' step to the query just to make the results a little easier to read.

v={id=5161, label=route, dist=4663}
IN={id=132, label=airport}
OUT={id=1, label=airport}

3.8. Assigning query results to a variable with a terminal step

It is extremely useful to be able to assign the results of a query to a variable. The example below stores the results of the 'valueMap' call shown above into a variable called 'aus'.

// Store the properties for the AUS airport in the variable aus.
aus=g.V().has('code','AUS').valueMap().next()

It is necessary to add a call to 'next' to the end of the query in order for this to work. Forgetting to add the call to 'next' is a very commonly made mistake by people getting used to the Gremlin query language. The call to 'next' terminates the traversal part of the query and generates a concrete result that can be stored in a variable. We refer to 'next' as a "terminal step". There are other terminal steps such as 'toList' and 'toSet' that also perform this traversal termination action. We will see those steps used later on.

Once you have some results in a variable, you can refer to it as you would in any other programming language. We will explore mixing Java and Groovy code with your Gremlin queries later in this book. For now let’s just use the Groovy 'println' to display the results of the query that we stored in 'aus'. We will take a deeper look at the use of variables with Gremlin later in the book when we look at mixing Gremlin and Groovy in the "Making Gremlin even Groovier" section.

// We can now refer to aus using key:value syntax
println "The AUS airport is located in " + aus['city'][0]

The AUS airport is located in Austin

Properties are stored as arrays of values. Even if there is only one property value for the given key, we still have to add the '[0]' when referencing it; otherwise the whole array will be returned if we just used 'aus['city']'. We will explore why property values are stored in this way in the "Attaching multiple values (lists or sets) to a single property" section.

As a side note, the 'next' step can take a parameter value that tells it how much data to return. For example, if you want the next three vertices from a query like the one below, you can add a call to 'next(3)' at the end of the query. Note that doing this turns the result into an ArrayList. Each element in the list will contain a vertex.

verts=g.V().hasLabel('airport').next(3)

v[1]
v[2]
v[3]

We can call the Java 'getClass' method to verify the type of the values returned.

verts.getClass()

class java.util.ArrayList

verts.get(1).getClass()

class org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerVertex

When using the Gremlin Console, you can check to see what variables you have defined using the command ':show variables'.

3.8.1. Introducing 'toList', 'toSet', 'bulkSet' and 'fill'

It is often useful to return the results of a query as a list or as a set. One way to do this is to use 'toList' or 'toSet' methods. Below you will find an example of each. The call to 'join' is used just to make the results easier to read on a single line.

// Create a list of runway counts in Texas
listr = g.V().has('airport','region','US-TX').
              values('runways').toList().join(',')

2,7,5,3,4,3,3,3,3,4,2,3,2,3,2,2,3,2,1,3,2,3,4,3,4,2,1

Now let’s create a set and observe the different results we get back.

// Create a set of runway counts in Texas (no duplicates)
setr = g.V().has('airport','region','US-TX').
             values('runways').toSet().join(',')

1,2,3,4,5,7

As a side note, in many cases we can use the 'dedup' step to remove duplicates from a result. However, it is worth knowing that a set can be created as a result type as in some cases this can be highly useful. The example below performs the same 'runways' query using a 'dedup' step. We added an 'order' step so that it is easier to compare the results with the previous query.

// Create a list of runway counts in Texas (no duplicates)
g.V().has('airport','region','US-TX').
      values('runways').dedup().order().fold()

[1,2,3,4,5,7]

Finally, let’s create the list again, but without the call to 'join', as that creates a single string result, which is not what we want in this case.

listr = g.V().has('airport','region','US-TX').
        values('runways').toList()

The variable can now be used as you would expect.

listr[1]
7

listr.size()
26

listr[1,3]
7
3

TinkerPop also provides a third method called 'bulkSet' that can be used to create a collection at the end of a traversal. The difference between a 'bulkSet' and a 'set' is that 'bulkSet' is a so-called 'weighted set'. A 'bulkSet' stores every value but includes a count of how many of each type are present. Let’s look at a few examples. First, we can check that the 'bulkSet' does indeed contain all the values.

setb= g.V().has('airport','region','US-TX').values('runways').toBulkSet().join(',')

2,2,2,2,2,2,2,2,7,5,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,1,1

A 'bulkSet' offers some additional methods that we can call. One of these is 'uniqueSize' that will tell us how many unique values are present.

setb= g.V().has('airport','region','US-TX').values('runways').toBulkSet()

// How many unique values are in the set?
setb.uniqueSize()
6

// How many total values are present?
setb.size()
27

The 'asBulk' method returns a map of key/value pairs where the key is the number and the value is the number of times that number appears in the set.

setb.asBulk()

2=8
7=1
5=1
3=11
4=4
1=2

There is another way to store the results of a query into a collection. This is achieved using the 'fill' method. Unlike 'toList' and the other methods that we just looked at, 'fill' will store the results into a pre-existing variable. The query below defines a list called 'a' and stores the results of the query into it. This will produce the same result as using 'toList'.

a = []
g.V().has('airport','region','US-TX').values('runways').fill(a)

a.size()
27

a[1,3]
7
3

We can define a variable that is a set and use 'fill' to achieve the same result as using 'toSet'.

s = [] as Set
g.V().has('airport','region','US-TX').values('runways').fill(s)

println s

[2, 7, 5, 3, 4, 1]

3.8.2. Ignoring query results with 'iterate'

There are a number of cases where the results of a query are not of interest to you. A common situation where this happens is when you have a graph mutation query and are only interested in persisting those changes to the database but are not interested in doing anything with the output that the traversal produces.

g.addV('airport').property('code','AUS').as('aus').
  addV('airport').property('code','DFW').as('dfw').
  addE('route').from('aus').to('dfw').iterate()

In the example above, you can see that we add two vertices and an edge between them with the return value being the edge. As we used 'next' as the terminal step, the query would have returned the newly created edge. With the use of 'iterate' as the terminal step, the query has no value returned. The vertices are added to the graph along with the edge, and the edge itself is discarded. Additional calls to 'next' after 'iterate' will result in a 'NoSuchElementException'.

The use of 'iterate' can bring some performance improvements because it signals to the graph that you are not interested in the results, which could save some processing costs (e.g., serialization, network).

3.9. Deep dive on traversal terminology

In Gremlin, we use the word "traversal" often, and there is a fair bit of terminology associated with it. Understanding that terminology will greatly help with your understanding of Gremlin. To get started in this learning, let’s return to how we create "g", known as the 'GraphTraversalSource', which is the connection to the graph from which we can write Gremlin to query our data.

// create the Graph object which in this case is a TinkerGraph
graph = TinkerGraph.open()

// create the GraphTraversalSource object
g = traversal().with(graph)

In looking at the above code, you might wonder where the 'traversal()' method comes from. This function is a member of the 'AnonymousTraversalSource' class, and it is considered good form to statically import it so that it can be called without the class. Calling 'traversal()' constructs an anonymous traversal source, which is like having a graph traversal source without a connection to a graph. Calling 'with' on this object binds it to a particular graph data source, and then "g" is ready for you to write your queries.

You may see examples in this book, blog posts or other documentation that shorthand the creation of 'g' like

g = TinkerGraph.open().traversal()

While you can take this approach, it does preclude you from having reference to the 'Graph' object. While this is typically not a problem for TinkerGraph, it can be an issue for other TinkerPop-enabled graphs that might require that reference to release resources or access provider-specific APIs.

Another tempting shorthand is to skip the creation of "g" like

graph = TinkerGraph.open()
graph.traversal().V()

If you take this approach, you unnecessarily create 'GraphTraversalSource' objects which will just end up in garbage collection. Typically, you create 'g' once and then reuse it. One of the other benefits to creating 'g' once is that it allows you to set configurations on the graph traversal source once rather than for each traversal you execute.

Configuration steps are used to set up options on the graph traversal source. These steps are prefixed with 'with' and work in a builder pattern style as follows

graph = TinkerGraph.open()
g = traversal().with(graph)

// construct a GraphTraversalSource configured with ReadOnlyStrategy and a
// List side effect named "d"
g = g.withStrategies(ReadOnlyStrategy.instance()).
      withSideEffect('d', [1,2,3])

Once 'g' is configured, you can then use start steps to spawn a 'Traversal'. In TinkerPop, a "traversal" is a term we consider synonymous with "query". In the context of querying graphs, the word "traversal" implies movement which aligns with the visual image of traversing about the vertices and edges of a graph to get the data that you need.

While Gremlin has dozens of steps, not all of them can be used to spawn a 'Traversal'. Only those found on the 'GraphTraversalSource' can be used to do that. The most commonly used start step is 'V', but there are a number of others, like 'E' to start by traversing edges or 'addV' to add a new vertex to the graph. It is important to realize that a 'Traversal' object spawned from 'g' does little on its own and requires that a terminal step be added to get the query results. This point was introduced in the "Assigning query results to a variable with a terminal step" section, but it is worth revisiting again as it is a common mistake made by new Gremlin users. It is easy to illustrate this point clearly with an example in Java.

PRACTICAL GREMLIN:An Apache TinkerPop Tutorial

1. INTRODUCTION

1.1. Welcome to the second edition

1.2. How this book came to be

1.3. Providing feedback

1.4. Some words of thanks

1.5. Thoughts on the Second Edition

1.6. What is this book about?

1.7. Introducing the book sources, sample programs, and data

1.8. Apache TinkerPop Evolution

1.8.1. TinkerPop 3.4

1.8.2. TinkerPop 3.5

1.8.3. TinkerPop 3.6

1.8.4. TinkerPop 3.7

1.8.5. TinkerPop 3.8

1.8.6. TinkerPop 4.0

1.9. So what is a graph database and why should I care?

1.10. A word about terminology

2. GETTING STARTED

2.1. What is Apache TinkerPop?

2.2. The Gremlin Console

2.2.1. Download, install, and launch the Gremlin Console

2.2.2. Saving output from the Gremlin Console to a file

2.2.3. Setting up console preferences

2.3. Introducing TinkerGraph

2.4. Introducing the air-routes graph

2.4.1. Updated versions of the air route data

2.5. Loading the air-routes graph using the Gremlin Console

2.6. Turning off some of the Gremlin Console’s output

2.7. A word about indexes and schemas

3. WRITING GREMLIN QUERIES

3.1. Introducing Gremlin

3.1.1. A quick look at Gremlin and SQL

3.2. Some fairly basic Gremlin queries

3.2.1. Retrieving property values from a vertex

3.2.2. Does a specific property exist on a given vertex or edge?

3.2.3. Counting things

3.2.4. Counting groups of things

3.3. Starting to walk the graph

3.3.1. Some simple graph traversal examples

3.3.2. What vertices and edges did I visit? — Introducing 'path'

3.3.3. Modifying a 'path' using 'from' and 'to' modulators

3.3.4. Using 'as', 'select' and 'project' to refer to traversal steps

3.3.5. Traits of 'by' modulators

3.3.6. Using multiple 'as' steps with the same label

3.3.7. Returning selected parts of a path

3.3.8. Examining the edge between two vertices

3.4. Limiting the amount of data returned

3.4.1. Retrieving a range of vertices

3.4.2. Working with the end of a stream

3.5. Removing duplicates - introducing 'dedup'

3.6. Using 'valueMap' to explore the properties of a vertex or edge

3.7. An alternative to 'valueMap' - introducing 'elementMap'

3.8. Assigning query results to a variable with a terminal step

3.8.1. Introducing 'toList', 'toSet', 'bulkSet' and 'fill'

3.8.2. Ignoring query results with 'iterate'

3.9. Deep dive on traversal terminology

PRACTICAL GREMLIN:
An Apache TinkerPop Tutorial