Graph Database Queries: Cypher and Gremlin

Graph databases store data as nodes and edges, representing entities and their relationships. Unlike relational databases that rely on JOIN operations to connect data across tables, graph databases...

Key Insights

  • Cypher uses declarative ASCII-art pattern matching that reads like SQL, making it intuitive for developers familiar with relational databases, while Gremlin employs imperative graph traversals with method chaining similar to functional programming.
  • For database portability across multiple graph databases, Gremlin is the better choice as an Apache TinkerPop standard, whereas Cypher offers superior readability and is now available beyond Neo4j through openCypher implementations.
  • Query performance depends more on proper indexing and graph modeling than language choice—both Cypher and Gremlin can achieve similar performance when optimized correctly.

Introduction to Graph Query Languages

Graph databases store data as nodes and edges, representing entities and their relationships. Unlike relational databases that rely on JOIN operations to connect data across tables, graph databases make relationships first-class citizens. This fundamental difference demands specialized query languages designed for traversing connected data.

Two query languages dominate the graph database landscape: Cypher and Gremlin. Cypher, created by Neo4j, uses a declarative approach with pattern-matching syntax. Gremlin, part of Apache TinkerPop, takes an imperative approach with graph traversals. Understanding both languages is essential for working with modern graph databases.

Consider a simple social network graph:

// Cypher representation
CREATE (alice:User {name: 'Alice', age: 30})
CREATE (bob:User {name: 'Bob', age: 28})
CREATE (post:Post {title: 'Graph Databases 101', content: '...'})
CREATE (alice)-[:FOLLOWS]->(bob)
CREATE (alice)-[:AUTHORED]->(post)
CREATE (bob)-[:LIKES]->(post)

This same structure in Gremlin would be created through traversal steps, which we’ll explore shortly.

Cypher Fundamentals

Cypher’s syntax resembles ASCII art, making relationship patterns visually intuitive. Nodes are represented in parentheses (), relationships in square brackets [], and arrows -> indicate direction. This visual approach significantly reduces the cognitive load when reading queries.

The basic query pattern follows a MATCH-WHERE-RETURN structure:

// Find all users that Alice follows
MATCH (alice:User {name: 'Alice'})-[:FOLLOWS]->(following:User)
RETURN following.name, following.age

// Find posts liked by Bob's followers
MATCH (bob:User {name: 'Bob'})<-[:FOLLOWS]-(follower)-[:LIKES]->(post:Post)
WHERE post.createdAt > datetime('2024-01-01')
RETURN post.title, count(follower) AS likes
ORDER BY likes DESC

Creating data in Cypher uses the CREATE or MERGE keywords:

// Create a new user and relationship
CREATE (charlie:User {name: 'Charlie', age: 35})

// MERGE creates only if it doesn't exist
MATCH (alice:User {name: 'Alice'})
MATCH (charlie:User {name: 'Charlie'})
MERGE (alice)-[:FOLLOWS]->(charlie)

// Create with RETURN to get the created node
CREATE (newPost:Post {title: 'Learning Cypher', createdAt: datetime()})
RETURN newPost

Filtering with WHERE clauses allows complex conditions:

MATCH (u:User)-[:AUTHORED]->(p:Post)
WHERE u.age > 25 
  AND p.createdAt > datetime() - duration({days: 7})
  AND size((p)<-[:LIKES]-()) > 10
RETURN u.name, p.title, size((p)<-[:LIKES]-()) AS totalLikes

Gremlin Fundamentals

Gremlin takes a fundamentally different approach. Instead of declaring what pattern you want to find, you imperatively describe how to traverse the graph. Think of it as giving step-by-step directions through the graph structure.

The basic Gremlin pattern starts with g.V() for vertices or g.E() for edges, then chains traversal steps:

// Find all users that Alice follows
g.V().hasLabel('User').has('name', 'Alice')
  .out('FOLLOWS')
  .hasLabel('User')
  .values('name', 'age')

// Find posts liked by Bob's followers
g.V().hasLabel('User').has('name', 'Bob')
  .in('FOLLOWS')
  .out('LIKES')
  .hasLabel('Post')
  .has('createdAt', gt(datetime('2024-01-01')))
  .group()
    .by('title')
    .by(count())
  .order(local).by(values, desc)

Creating vertices and edges in Gremlin:

// Add a new user
g.addV('User')
  .property('name', 'Charlie')
  .property('age', 35)

// Create relationship between existing vertices
g.V().hasLabel('User').has('name', 'Alice')
  .as('a')
  .V().hasLabel('User').has('name', 'Charlie')
  .addE('FOLLOWS').from('a')

Gremlin’s functional style shines with filters and transformations:

g.V().hasLabel('User')
  .where(
    out('AUTHORED')
      .hasLabel('Post')
      .has('createdAt', gt(datetime().minus(7, ChronoUnit.DAYS)))
      .count().is(gt(0))
  )
  .filter(values('age').is(gt(25)))
  .project('name', 'postCount', 'followerCount')
    .by('name')
    .by(out('AUTHORED').count())
    .by(in('FOLLOWS').count())

Comparative Query Patterns

Let’s examine common graph operations side-by-side to highlight the differences in approach.

Friend-of-friend recommendations (people you might know):

// Cypher: declarative pattern matching
MATCH (me:User {name: 'Alice'})-[:FOLLOWS]->()-[:FOLLOWS]->(recommendation:User)
WHERE NOT (me)-[:FOLLOWS]->(recommendation) AND me <> recommendation
RETURN recommendation.name, count(*) AS mutualFriends
ORDER BY mutualFriends DESC
LIMIT 10
// Gremlin: imperative traversal
g.V().hasLabel('User').has('name', 'Alice').as('me')
  .out('FOLLOWS').out('FOLLOWS')
  .where(neq('me'))
  .where(without('me').by(out('FOLLOWS')))
  .groupCount().by('name')
  .order(local).by(values, desc)
  .limit(local, 10)

Shortest path queries:

// Cypher: built-in shortest path function
MATCH path = shortestPath(
  (alice:User {name: 'Alice'})-[:FOLLOWS*]-(bob:User {name: 'Bob'})
)
RETURN length(path) AS distance, nodes(path)
// Gremlin: path traversal with repeat
g.V().hasLabel('User').has('name', 'Alice')
  .repeat(out('FOLLOWS').simplePath())
  .until(has('name', 'Bob'))
  .path()
  .limit(1)

Aggregation and grouping:

// Cypher: SQL-like aggregation
MATCH (u:User)-[:AUTHORED]->(p:Post)<-[:LIKES]-(liker)
RETURN u.name, count(DISTINCT p) AS posts, count(liker) AS totalLikes
ORDER BY totalLikes DESC
// Gremlin: functional grouping
g.V().hasLabel('User')
  .project('name', 'posts', 'totalLikes')
    .by('name')
    .by(out('AUTHORED').count())
    .by(out('AUTHORED').in('LIKES').count())
  .order().by(select('totalLikes'), desc)

Performance Considerations

Both languages require proper indexing for optimal performance. Indexes dramatically improve lookup operations on property values.

// Cypher: create indexes
CREATE INDEX user_name FOR (u:User) ON (u.name)
CREATE INDEX post_created FOR (p:Post) ON (p.createdAt)

// Profile query performance
PROFILE
MATCH (u:User {name: 'Alice'})-[:FOLLOWS*2]->(friend)
RETURN friend.name
// Gremlin: index creation (database-specific)
// For JanusGraph:
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
mgmt.buildIndex('byUserName', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.commit()

// Profile with explain
g.V().hasLabel('User').has('name', 'Alice')
  .out('FOLLOWS').out('FOLLOWS')
  .explain()

Key performance principles apply to both languages:

  1. Filter early: Apply WHERE clauses or has() steps as early as possible to reduce the working set
  2. Use indexes: Always index properties used in frequent lookups
  3. Limit traversal depth: Unbounded traversals can explode exponentially
  4. Avoid Cartesian products: Be careful with multiple MATCH clauses or complex traversals

Choosing Between Cypher and Gremlin

Your choice depends on several factors:

Choose Cypher when:

  • Working primarily with Neo4j
  • Team has SQL background
  • Readability and maintainability are priorities
  • You need rapid development with intuitive syntax
  • Building internal tools where portability isn’t critical

Choose Gremlin when:

  • Requiring database portability (JanusGraph, Amazon Neptune, Azure Cosmos DB)
  • Team prefers functional programming paradigms
  • Building applications that may switch graph databases
  • You need fine-grained control over traversal execution
  • Working with the Apache TinkerPop ecosystem

Learning curve: Cypher is generally easier to learn, especially for developers with SQL experience. Gremlin requires understanding functional programming concepts and graph traversal mechanics.

Ecosystem: Neo4j’s ecosystem around Cypher is mature with excellent tooling, drivers, and documentation. Gremlin benefits from TinkerPop’s vendor-neutral standard and works across multiple databases.

Expressiveness: Both languages can express the same graph operations, but some queries feel more natural in one versus the other. Complex pattern matching often reads cleaner in Cypher, while algorithmic traversals may be clearer in Gremlin.

Conclusion

Cypher and Gremlin represent two philosophies for querying graph data: declarative pattern matching versus imperative traversal. Cypher’s ASCII-art syntax makes it approachable and readable, ideal for teams transitioning from SQL. Gremlin’s functional traversal approach offers portability and precise control over graph walking.

Neither language is objectively superior—the choice depends on your specific requirements. For Neo4j-centric projects prioritizing developer experience, Cypher excels. For multi-database strategies requiring portability, Gremlin is the pragmatic choice. Many organizations find value in understanding both, as graph database adoption continues growing and the lines between them blur through initiatives like openCypher and GQL standardization efforts.

The future likely holds convergence through the upcoming ISO GQL standard, but for now, mastering both Cypher and Gremlin gives you maximum flexibility in the graph database landscape.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.