Explore the future of Web Scraping. Request a free invite to ScrapeCon 2024

GraphQL Queries in Python

Taking a script from rough sketch to maintainable production code.

image

I recently had the challenge of migrating images from one online platform to another. This became particularly interesting since the platform we needed to move the images to presented a GraphQL API (The “QL” stands for “Query Language” and “API” is short for Application Programming Interface).

Let’s dive into the world of GraphQL and how we can use it in Python.

Contents

· Getting Started
· Our First Graphql Query
· Preparing Pagination
· Extracting GraphQL Queries to Separate Files
· Splitting up logic
· Handling Errors
· A Finished Product
· Conclusion and Next Steps

Getting Started

The images that need to be migrated are associated with particular products. But not all of the products are available on the destination platform. It makes sense then to start by asking the destination platform what products were available.

Our First Graphql Query

The SKU (Stock Keeping Unit) code of each product is common across both platforms, so we can start by querying the destination platform’s API for the SKUs of all products. The basic query looks like this:

query {
   products {
    sku
  }
}

One of the great things about GraphQL is how instantly readable the syntax is. We can easily see that this is a query to request the SKUs of all products.

Querying in Python

We can use a Python library to execute this query. Doing so requires a bit of setup. We can use pip to download the gql (short for Graph Query Language) and aiohttp libraries. To handle the configuration parameters required by the API, it would also be useful to have the python-dotenv library.

We can then create a simple script to instantiate the GraphQL client:

from dotenv import dotenv_values
from gql import gql, Client
from gql.transport.aiohttp import AIOHTTPTransport

env = dotenv_values()
headers = {"Authorization": f"Bearer {env['DEST_TOKEN']}"}
transport = AIOHTTPTransport(url=env['DEST_URL'], headers=headers)
client = Client(transport=transport)

Let’s walk through this quickly. We’re starting with the library for handling .env files, which are a great way of keeping configuration parameters such as passwords out of your code. Our .env file for this project looks something like this:

DEST_URL="https://destination-api.example.com"
DEST_TOKEN="super_secret_generated_token"

Our code can use these values to set up the header and http transport layer used by the gql client, which then makes queries for us like this:

...
products_query = gql("""
  query {
    products {
      sku
    }
  }
""")
response = client.execute(products_query)
print(len(response['products']))

The gql function prepares the query string. Then the client sends the query and parses the JSON (JavaScript Object Notation) response into a dictionary containing the key ‘products’, which should contain a list of dictionaries containing all the product parameters we’ve requested — in this case, SKUs. The response should look something like this:

[{'sku': '12345'}, {'sku': '23456'}, {'sku': '34567'}]

But looking at the length of that list, and comparing it to the number of expected products (numbering in the thousands) it’s immediately clear that not all the products are being returned. What’s wrong?

Preparing Pagination

To avoid sending a lot of information all at once, which might overwhelm both the requesting and responding servers, many APIs will only provide a small portion of the entries available, for example, 100 records at a time. The documentation of this product API shows that it permits these parameters:

  • “first” — allowing us to ask for the first n items.
  • “skip” — the number of records to skip over before providing a response.

In GraphQL syntax we can pass these parameters like this:

query {
   products(first: 100, skip: 100 {
    sku
  }
}

This would return us 100 products, starting with the 101st product. In other words, it returns the second “page” of records. But how many records are there? Do we need to prepare a separate query for each set? That would get pretty messy. Fortunately, GraphQL allows us to pass parameters to our query and use them in place of previously hard-coded values like this:

query ($first: Int!, $skip: Int!) {
  products(first: $first, skip: $skip) {
    sku
  }
}

GraphQL parsers understand that any word following a “$” is now considered a variable. Adding the exclamation point to the parameter type (Int) makes these required parameters, which means we’ll get an error if we forget to include them.

We can now easily loop through all our products like this:

...
products_query = gql("""
  query ($first: Int!, $skip: Int!) {
    products(first: $first, skip: $skip) {
      sku
    }
  }
""")
page_size = 100
skip = 0

while True:
    vars = {"first": page_size, "skip": skip}
    response = client.execute(products_query, variable_values=vars)
    skip += page_size

    if not response['products']:
        break
    <fetch images and upload etc...>

Here we create an infinite loop and keep asking for more products, increasing the “skip” value by the page size on each iteration. When no products are returned, we break our loop.

Extracting GraphQL Queries to Separate Files

An obvious “code smell” is seeing code written in another language encapsulated in a string variable. We can easily extract our GraphQL query into its own file, naming it something like products_query.graphql and load it with a function:

def load_query(path):
    with open(path) as f:
        return gql(f.read())
products_query = load_query('products_query.graphql')

This reads much more nicely than having a multi-line string containing our query, especially when we need to add more properties to the query later on. If the query file isn’t available our script will crash, but that’s actually for the best. There’s no point in executing an empty query.

Splitting up logic

Our script is already getting pretty long. Let’s try encapsulating our GraphQL logic in its own class.

from gql import gql, Client
from gql.transport.aiohttp import AIOHTTPTransport

class ProductProvider:
    def __init__(self, conf):
        url = conf['DEST_URL']
        headers = {"Authorization": f"Bearer {conf['DEST_TOKEN']}"}
        transport = AIOHTTPTransport(url=url, headers=headers)
        self._client = Client(transport=transport)
        self._query = self._load_query('products_query.graphql')
    def get_products(self, page_size, skip):
        v = {"first": page_size, "skip": skip}
        return self._client.execute(self._query, variable_values=v)

    def _load_query(self, path):
        with open(path) as f:
            return gql(f.read())

This is ok, but it still means our script will have to handle pagination. A much nicer technique would be to use Python’s yield keyword, making our function into a generator.

Generators - Python Wiki

We can then return products one by one and still handle pagination within our function.

...
def get_products(self, page_size=100):
    skip = 0
    while True:
        v = {"first": page_size, "skip": skip}
        products = self._client.execute(
            self._query,
            variable_values=v
        )
        if not products['products']:
            return

        skip += page_size
        for product in products['products']:
            yield product
...

Including the optional page_size variable makes it obvious that pagination is occurring and is handled within the function.

Handling Errors

One nice finishing touch is some query error handling. Looking at the gql library we can see that it returns errors with multiple messages.

gql/exceptions.py at master · graphql-python/gql

After some experimentation we can see that we usually only care about the first error message from this class, so we could add a function that wraps the client.execute method and transforms that error into a custom exception.

...
from gql.transport.exceptions import TransportQueryError

class ProductProviderError(Exception):
    pass

class ProductProvider:
...
    def _execute(self, query, vars):
        try:
            return self._client.execute(query, variable_values=vars)
        except TransportQueryError as err:
            raise ProductProviderError(err.errors[0]['message'])

We can then easily catch and print exceptions like this:

from product_provider import ProductProvider, ProductProviderError
...
try:
    for product in product_provider.get_products():
      print(product)
except ProductProviderError as err:
    print(err)

A Finished Product

With the encapsulation of our GraphQL calls into a custom class, we have a very simple script to run.

#!/usr/bin/env python3

from dotenv import dotenv_values
from product_provider import ProductProvider, ProductProviderError

conf = dotenv_values()
product_provider = ProductProvider(conf)

try:
	for product in product_provider.get_products():
		print(product)
except ProductProviderError as err:
	print(err)

Our ProductProvider class is a bit more complicated but encapsulates the API’s pagination and the gql library’s exceptions. It also provides the potential for reuse.

from gql import gql, Client
from gql.transport.aiohttp import AIOHTTPTransport
from gql.transport.exceptions import TransportQueryError

class ProductProviderError(Exception):
	pass

class ProductProvider:
	def __init__(self, conf):
		headers = {"Authorization": f"Bearer {conf['DEST_TOKEN']}"}
		transport = AIOHTTPTransport(url=conf['DEST_URL'], headers=headers)
		self._client = Client(transport=transport)
		self._sku_query = self._load_query('products_query.graphql')

	def get_products(self, page_size=100):
		skip = 0
		while True:
			vars = {"first": page_size, "skip": skip}
			products = self._execute(self._sku_query, vars)
			if not products['products']:
				return

			skip += page_size
			for product in products['products']:
				yield product

	def _load_query(self, path):
		with open(path) as f:
			return gql(f.read())

	def _execute(self, query, variable_values):
		try:
			return self._client.execute(query, variable_values=variable_values)
		except TransportQueryError as err:
			raise ProductProviderError(err.errors[0]['message'])

Our query is in its own file for easy readability, updating and reusability.

query ($first: Int!, $skip: Int!) {
	products(first: $first, skip: $skip) {
		sku
	}
}

Finally, for dependency management, our requirements.txt file should look something like this:

gql~=3.0.0a6
aiohttp~=3.7
python-dotenv~=0.18

To install them all at once, we can run pip3 install -r requirements.txt

Conclusion

Good development practices are usually iterative. We create a rough idea that works and then make it more readable, robust and reusable.

Our script started out very rough, but since we could quickly prove that the concepts worked, we could follow up with best practices and we were able to distribute responsibility and create a reusable component.

We’ll look at a way of handling image downloads and uploads for this project in a future post.

Want more Python tips?

I Solved a Codility Challenge with One Line of Python

Some recommendations about useful tools:

Tools for New Developers — Part 1




Continue Learning