2019-06-21

モダンBFFを活用した既存APIサーバーの再構築

mobile

技術部の青木峰郎です。去年までは主にデータ分析システムの構築を担当していましたが、最近はなぜかレシピサービスのサービス開発をやっています。今日は、そのサービス開発をする過程で導入したBFF（Backends for Frontends）であるOrchaについて、導入の動機と実装の詳細をお話しします。

Orcha導入にいたる経緯

まずはOrcha導入までの経緯、動機からお話ししましょう。

最初のきっかけは、わたしが去年から参加しているブックマークのようなサービスの開発プロジェクトでした。このプロジェクトの実装のために新しいmicroserviceを追加することになったのですが、そのときにいくつかの要望（制約）がありました。

1つめは、撤退するとなったときに、すぐに、きれいに撤退できること。

2つめが、スマホアプリからのAPI呼び出し回数はできるだけ増やしたくない、という要望です。

図1を見てください。既存APIサーバーとは別に新しいmicroservice（API）を追加してスマホアプリから呼べば、今回追加する部分はきれいに分かれていて実装も簡単です。しかし、それではスマホアプリからのAPI呼び出し回数が増えてしまいます。

f:id:mineroaoki:20190621215345j:plain — 図1: 単純にサービスを増やすとAPI呼び出し回数が増えてしまう

例えばクックパッドアプリのトップページは現在でもすでに10以上のAPIを呼んでいるので、もうできるかぎりAPI呼び出し回数を増やしたくありません。

かと言って、既存APIサーバー（Pantry）の改修もしたくありません。図2のように、Pantryから新サービスを叩くように変更すればAPI呼び出しを1つにまとめることはできます。しかしこのPantryというサーバーは以前の記事で説明した「世界最大のモノリシックなRailsアプリケーション」であり、理由はよくわからないがとにかくこれをさわるだけで開発期間が3倍になる優れモノです。できることならいっさいPantryにさわることなく開発を終えたいわけです。

f:id:mineroaoki:20190621215427j:plain — 図2: Pantryをいじれば目的は達成できるが対価が必要

つまり、API呼び出し回数は増やしたくないのでできれば既存のAPIに値を追加する形で実装したい。しかしそのためにPantryはいじりたくない。

API呼び出し回数を増やしたくない……既存のAPIに手を加えたい……でもPantryはいじりたくない……。

この3つの思いが謎の悪魔合体を遂げて生まれたのがOrcha（オルカ）なのです。

Orcha 〜クックパッドのためのBFF〜

Orchaを導入した後のアーキテクチャを図3に示しました。見てのようにOrchaはリバースプロキシと既存のAPIサーバーであるPantryの間にはさまって、スマホアプリに特化したAPIを提供します。

f:id:mineroaoki:20190621215234j:plain — 図3: Orchaのアーキテクチャ

今回は既存APIに新規サービスの情報を追加したいというのがそもそもの目的だったので、まずOrchaがPantryのAPIを呼んで、レスポンスで得たJSONに新規サービスからの情報を差し込むことで目的を達成しています。この場合のOrchaは「高機能なJSON用sed」のような働きをします。

OrchaはクックパッドのiOS/Androidアプリに特化したAPIを提供することを主眼としたシステムなので、いわゆるBFF（Backends for Frontends）だとも言えます。 BFFとは、スマホアプリやウェブフロントエンドのような特定のクライアントに特化したAPIサーバーのことです。汎用のAPIではなく、あるクライアントに密着した固有のAPIを提供することを目的にしています。 BFFについての詳細はこのあたりの記事をお読みください。

ちなみに、当初はBFFというよりオーケストレーション層を作るぞ！という気持ちのほうが強かったので、 "Orchestration Layer" の先頭を適当に切ってOrchaと命名しました。

すべてのAPIはカバーしない

Orchaはこのような経緯で導入したため、現在のところ、スマホアプリが必要とするすべてのAPIを提供しているわけではありません。スマホアプリのトップページのすべてのデータを返すトップページAPIなどの、スマホアプリに特化したAPIの一部のみを提供しています。残りのAPIについては、現在もリバースプロキシからPantryへ直接リクエストを投げています。

そのような中途半端な入れかたをした1つめの理由は、もし今回のプロジェクトがうまくいかなかったときは新規サービスをOrchaごと捨てて撤退する予定だったからです。まるごと捨てるなら、必要最小限のAPIだけをOrchaにサーブさせておいたほうが、当然捨てるのも簡単です。

2つめは消極的な理由で、Orcha経由にするメリットが特にないからです。既存のAPIをOrcha経由で呼ぶようにしたところで、単にレイテンシが数ミリ秒増えるだけで、たいしていいことがありません。強いて言うとこの記事を書くときに「一部しか経由しませんよ」という説明をしなくて済むくらいでしょう。それにもし将来メリットが発生してOrchaを経由するように変えようと思ったら、その時に変えればいいだけです。したがって、当初は実装量がより少ないほうを選ぶことにしました。

Orchaの実装設計

Orchaの実装言語はしばし悩んだのちJavaに決めました。 Spring WebFluxとSpring Reactorを使って、非同期のリクエスト処理を実装しています。

JavaとSpringを選択した第一の理由はパフォーマンスです。 Pantryはやたらとリソース食いなので、 1 ECS task（サーバー台数とおおむね同じ意味）が3 CPUコア、メモリ4GBで、毎日のピークタイムには150以上のECS taskが必要になっています。これと同じ調子でリソースをバカ食いするサーバーをもう1つ立てるのはさすがに避けたいところです。

またレイテンシについても気を遣う必要があります。OrchaをPantryの前に立てるということは、 Orchaでかかったレイテンシーがそのまま既存のレイテンシーに追加されるということです。 Orchaのレイテンシーはできるだけ小さくしておかなければ、スマホアプリの使い勝手を大きく悪化させてしまうことになるでしょう。それを避けるには例えば、複数システムへのAPI呼び出しを並列化するなどの工夫をすべきです。

さらに、どのようなAPIを呼ぶことになるかは予測できないので、非常に遅いAPIもあるかもしれません。そのような場合にもワーカーを使い果たして停止するようなことのないアーキテクチャを選択する必要があります。ここまで来ると選択肢は非同期I/Oしかないでしょう。

非同期リクエストのフレームワークがあり、実行効率が高いとなると、定番はJVM系かGoです。そこで結局、何度もJavaを利用した実績があったこと、 Java 8とJava 9での改善およびLombokの登場により言語仕様に目立った不満がなくなったこと、さらに品質の高いAWS SDKやDBドライバがあること、の3点からJavaとSpringに落ち着きました。

なお正確に言うと、限定公開を始めた当初はリクエスト数も非常に少なかったため、 Spring WebMVCを使って同期リクエスト処理を実装しました。その後、全体公開することが決まった時点で、 API単位でSpring WebFluxに切り替えて非同期化していきました。ここは同期・非同期のフレームワークが両方あり、しかも同居が可能なSpring Frameworkの利点が最大限に活きたところです。

認証処理の共通化

パフォーマンス向上という点ではOrcha導入にあたってもう1つ配慮したポイントがあります。それは認証処理の共通化です。

図4はOrcha導入前のクックパッドアプリの認証経路です。一言で言うとAuthCenterというシステムがすべての認証を請け負っており、マイクロサービス各位はそれぞれ独立に認証を行うという仕組みです。

f:id:mineroaoki:20190621215506j:plain — 図4: これまでの認証処理

これまではそれでも大きな問題はありませんでした。なぜなら、基本的に1 APIは1システムによってハンドルされていたため、 APIリクエスト数と認証回数が等しかったからです。

しかしOrcha導入後は、1 APIリクエストにつき2回以上の認証処理が発生します。つまり、最悪の場合はAuthCenterへのリクエスト数が突如として2倍以上になる可能性があるわけです（図5）。

f:id:mineroaoki:20190621215527j:plain — 図5: 何も考えずにOrchaを導入したときの認証処理

AuthCenterは現時点でもすでに社内随一のリクエスト数を誇る人気サービスで、ぶっちゃけた話DBがけっこうパツパツだったりするので、いきなりリクエスト数が倍になれば陥落する可能性もあります。それはいくらなんでもまずかろうということで、 Orchaの全体公開に合わせてID tokenを使った認証処理の共通化を実装しました（図6）。

f:id:mineroaoki:20190621215614j:plain — 図6: ID tokenを使った認証の共通化

仕組みはこうです。まず最初にリクエストを受けたOrchaはAuthCenterにアクセストークンを渡して検証してもらい、認可などのためのメタデータを含むID tokenを受け取ります。 Orchaはアクセストークンの代わりに、AuthCenterから受け取ったID tokenを各サーバーに付与してリクエストします。

ID tokenは、JWTという形式のJSONを秘密鍵で署名したものです。秘密鍵に対応する公開鍵は社内の全サービスに共有されているため、そのトークンは間違いなくAuthCenterが発行したものであることが検証できます。つまり各サービス内でその検証だけ行えば、いちいちAuthCenterに問い合わせなくとも認証を完了することができるのです。

このへんは~~めんどくさかったので~~わたしにはあまり知見がなかったので、弊社の無敵万能エンジニア id:koba789 に仕様決めから全部ぶんなげて実装してもらいました。そのうち id:koba789 が詳しいことを書いてくれると思います。

サービスメッシュを使った他システムとの連携

Orchaと他の上流システムとの通信は、すべてクックパッドの標準的なサービスメッシュシステムを介して行いました。サービスメッシュは特にBFFだから使うというものではありませんが、個人的に今回いくつか利点を実感できたので述べたいと思います。

まずサービスメッシュでの自動リトライ機能について。エンドポイントごとにタイムアウトを設定でき、タイムアウトした場合は自動的にリトライする、それでもだめならしばらく通信を止める（サーキットブレーカー）という機能があり、これが非常に便利です。最初はリトライなんていつ起きるんだよと疑っていたのですが、実際に試してみたら毎分起きていて認識を改めました。また障害などで大量にエラーが発生したときにはサーキットブレーカーが働いて輻輳を防止してくれるので、高いレベルで可用性を高めてくれます。

第二にクライアントサイドロードバランシングが容易に実装できる点。 Orchaには上流システムにgRPCのシステムがいくつかあるのですが、普通のHTTP通信でも、クライアントサイドロードバランシングのgRPCでも、こちら側の設定はほぼ同じ設定で通信できるようになるのでとても楽でした。

最後に、他システムとの通信のメトリクスが自動的に取得されて視覚化される点です（図7）。これは正確に言えばサービスメッシュ自体の機能ではなく「サービスメッシュがあると容易に実装できる機能の1つ」です。自システムで発生したエラーの数はもちろん、どの上流システムとの通信で500がいくつ出ているのか、どのシステムとの通信が遅くなっているかも一目でわかるため、性能調査や障害調査に役立ちまくりでした。他社のエンジニアにこの画面を見せると異常にうらやましがられる画面です。この画面のためだけにでもサービスメッシュを実装する価値があると思います。

f:id:mineroaoki:20190621215640p:plain — 図7: 他システムとの通信のモニター画面

Orcha導入後の評価

以上が、Orchaを入れた経緯とその設計などの詳細です。これを踏まえて、現時点までの結果と評価を述べます。

まず、当初の目的であった「API呼び出し回数を増やさずに、撤退しやすい仕組みで、新規サービスを高速に追加すること」は問題なく達成できたと思います。 Orchaと新規サービスを合わせて、インフラ構築からとりあえず動き出すまでをわたし1人だけで、1週間で完了できました。これはPantryで開発をしていてはとても達成できない目標でした。また、現在は新卒で入ったばかりのエンジニアにOrchaの開発をしてもらっているのですが、こちらもスムーズに開発できています。これもPantryではありえないことです。

第二に、かなり真剣に考えたパフォーマンスについても、全体公開後の数値を見るかぎり問題なさそうです。現在、ピークタイムでも全プロセスの合計リソースがCPU 1コア、メモリ8GBで余裕をもって全リクエストをさばけています。もちろんECS（Docker）で動いていますし、オートスケールを設定してあるので、必要なときは勝手にECS task数が増減されます。非同期処理に特有のつらい点として、「ものすごい勢いでメモリリークする」などの問題が全体公開直後に発生したりしましたが、これも早期に解決できました（タイムアウト設定の問題でした）。

第三にJavaとSpringの選択についても満足しています。 Springについてはいろいろいい点はありましたが、まずデフォルトでアプリケーション設定がファイル（application.yml）と環境変数で透過的に設定できる点が便利です。開発環境ではいろいろと便利なデフォルトや設定例を提供したいのでapplication.ymlをレポジトリにコミットしておき、本番環境ではDocker前提なのですべての設定を環境変数で設定する、ということが簡単にできるので大変便利でした。また当然ながら設定項目はアノテーション一発でオブジェクトに自動マッピングしたうえDIで注入できます。

ちょっとした追加の機能実装をしたいときにほぼ間違いなくライブラリがある点も有利です。例えば開発環境でだけ動く単純なリバースプロキシ機能を追加したくなったのですが、 Spring Cloud Gatewayを導入し、application.ymlを少し書くだけで簡単に実装できました。このへんのライブラリの充実っぷりはさすがです。

総じて、アーキテクチャ・実装設計ともに現時点では満足しています。次のチェックポイントは、スマホアプリのエンジニアがさわるようになったときでしょう。

これからのOrcha開発ロードマップ

最後に、今後のOrchaの開発ロードマップについて今考えていることを述べます。

直近の目標は、Orchaをより完全な集約層にすることです。具体的には、既存のAPIサーバー（Pantry）に存在する、実質的に集約層として機能しているAPIのコードをすべて剥がしてOrchaに移動することです。

集約層的なAPIは全体からすれば数は少ないですが、実装が複雑なので分量はけっこうあります。このコード移動を完遂して、スマホアプリの開発者がOrchaをいじれるようになることが当面のゴールです。

また、集約層的なAPIの移動が完了すれば、残るAPIはすべてリソースを処理するAPIになるはずなので、そちらは小さいシステムに分割してgRPCにしてしまいたいところですが…… これが終わるにはあと何年かかるやら、という感じです。終わりが見えない。

まとめ

本稿では、クックパッドのレシピサービスに新たに追加したBFF "Orcha" に関して、その動機と実装、評価をお話ししました。今回、個人的に一番うまくやれたと思う点は、既存システムの改善と新機能の追加を両立できたことです。通常、この2つは利益相反の関係にあることが多く、どちらを取るかジレンマに悩まされがちです。しかしOrchaについては珍しいことに両者を同時に満たす一石二鳥の手を打てたので大変満足しています。

では最後にいつものやつです。

弊社は世界最大のモノリスを共に崩していく仲間を募集中です。三度のメシよりRailsが大好きなかたも、 RailsアプリをJavaに書き換えてこの世から消滅させたいかたも、あとついでに今回の話とは関係ないですがデータエンジニアも S3とSQSとLambdaでAWSピタゴラスイッチしたい人も、ともに大募集しております。興味を持たれたかたはぜひ以下のサイトよりご応募ください。

クックパッド採用サイト: https://info.cookpad.com/careers/

2019-06-17

レシピ検索を支えるレガシーでクリティカルな大規模バッチを刷新した話

こんにちは、会員事業部の新井です。余暇を全て Auto Chess に喰われています。

過去このブログにはサービス開発に関する記事*1を投稿させていただいているのですが、今回はシステム改修についての記事になります。クックパッドには検索バッチと呼ばれる大規模なバッチが存在するのですが、今回それを刷新することに成功しました。そこでこの記事では旧システムに存在していた問題点、新システムの特徴や実際の開発について述べたいと思います。

背景

クックパッドのレシピ検索では Apache Solr を検索サーバーとした全文検索を利用しています。古くは Tritonn を利用して MySQL に作られた専用 table を対象に全文検索を実行していたようですが、その頃から「検索バッチ」と呼ばれるバッチが存在していました。このバッチは、簡単に言うと「検索インデックス」と呼ばれる検索用メタデータを生成するものです。関連各所からデータを収集し、分かち書きやスコアの計算といった処理を実行して検索インデックスを生成し、現在はそれを Solr にアップロードするところまでを実行するバッチ群となっています。

この検索バッチは 10 年以上利用されており、年々検索のメタデータとして使用したいデータ（field）が増加してきたこともあって、種々の問題を抱えたレガシーシステムとなっていました。サービスにとって非常に重要なシステムであるがゆえに思い切った改修に踏み切れなかったのですが、今回のプロジェクトはその一新を目的としたものでした*2。

旧検索バッチの問題点

複数の DB やサービスに依存している

検索バッチはレシピ情報にとどまらず、レシピに紐づく様々なメタデータや、別バッチによって集約された情報などを収集する必要があるため、依存先のサービスや DB が多岐にわたっていました。 DB でいえばレシピサービスが利用している main, 集約されたデータが格納されている cookpad_summary, 検索や研究開発関連のデータが格納されている search_summary などなど……。サービスへの依存についても、料理動画サービスの API を叩いてそのレシピに料理動画が紐付けられているかを取得してくるなどの処理がおこなわれており、新規事業が次々に増えている現在の状況を考えると、この手の依存はこれからも増大することが予想されていました。

cookpad_all に存在している

旧検索バッチは cookpad_all と呼ばれる、レシピ Web サービスとその管理画面や関連するバッチ群、mobile アプリ用 API などがすべて盛り込まれたレポジトリ上に存在しており、各サービス間でモデルのロジック等が共有されていました。このこと自体はそれほど大きくない規模のサービスであれば問題になることはありません。しかし、クックパッドについて言えば、ロジック共有を通したサービス間の依存を把握するのが困難な規模になっており、「レシピサービスに変更を加えたらバッチの挙動が意図せず変わった」というようなことが起こる可能性がありました。このような状況であったため、特に新しいメンバーがコードに変更を加える際に考えるべき要素が多すぎて生産性が著しく低下し、バグを埋め込んでしまう可能性も高くなってしまっていました。

不必要に Rails である

cookpad_all に存在するバッチ群は kuroko と呼ばれていますが、それらが Rails で実装されていたことから、旧検索バッチも Rails で実装されていました。しかし、このバッチの実態は「大量のデータを収集して処理」することであり「user facing な Web アプリケーションをすばやく開発することができる」という Rails の強みが活かされるようなものではありませんでした。実際の実装としても、その大部分が「データを取得するためだけに ActiveRecord のインスタンスを大量に生成する」といったロジックで構成されており、オーバーヘッドの大きさが目立つものになっていました。

責務が大きすぎる

旧検索バッチでは、検索インデックスにおける全ての field が一つのメソッド内で生成されていました。そのため、新たな field の追加や既存の field の編集において必ずそのメソッドに手を入れる必要があり、メンテナンス性に問題を抱えていました。

たとえば、新たな field を追加する際に該当メソッド内に存在する既存のロジックを踏襲したとします。しかし、クックパッドには「レシピ」を表現するモデルが複数存在するため、既存ロジックで利用されていた「レシピ」を表現するモデルと、新たな field のロジックが参照するべきだった「レシピ」のモデルが食い違っており、意図した挙動になっていなかったといったような問題が起こることがありました。

また、研究開発部の施策で検索インデックスに field を追加したいケースなど、レシピサービスにおける検索機能の開発以外を主業務としているメンバーも検索バッチに手を入れることがありました。このように、ステークホルダーの観点から見ても複数の理由からこのバッチが編集されており、「単一責任の原則」が満たされていないシステムになってしまっていました。

実行時間が長すぎる

旧検索バッチではすべての field を生成する処理が直列に実行されているため、Rails での実装であるということも相まって実行時間が非常に長くなってしまっていました。この時間はバッチの構成がそのままであるうちは今後も field の増加に伴って増大していくことが予想されていましたし、実行時間短縮のために自前で並列実行の実装をおこなっていたのも可読性やメンテナンス性に影響を与えていました。

f:id:spicycoffee:20190617111728p:plain — 旧検索バッチの構成

新検索バッチ概要

上に挙げた旧検索バッチの問題点を解消するため、新検索バッチ（以下 fushigibana*3）は以下の要素を実現するように実装されました。

データフローの一本化

先ほど「検索バッチはその性質上多くの箇所からデータを収集し加工する必要がある」と述べましたが、現在クックパッドには組織内のあらゆるデータが集約されている DWH が存在します。各種データソースから DWH へのインポートという作業が存在するためデータの更新頻度に関する問題はありますが、旧検索バッチの時点で検索結果の更新は日次処理であったことも鑑み、fushigibana が利用するデータソースは DWH に限定しました。こうすることで各種 DB やサービスへの依存が解消され、データフローを一本化することが可能となりました。

Rails ならびに cookpad_all からの脱却

fushigibana は redshift-connector を用いて DWH から取得したデータに、Ruby で分かち書きなどの処理を施して検索インデックスを生成し、それを S3 にアップロードするというつくりになっており、plain Ruby で実装されています(Ruby を選択したのは社内に存在する分かち書き用の gem などを利用するため)。その過程で cookpad_all からもコードベースを分離し、完全に独立したバッチ群として存在することになりました。

クラスの分割と並列実行

fushigibana では検索インデックスの生成処理をサービスにおける意味やアクセスする table などの観点から分割し、「単一責任の原則」を満たすよう実装しています。分割されたクラスはそれぞれいくつかの field を持つ検索インデックスを生成します。最後にそれらのインデックスを join することですべての field を持った検索インデックスを生成しています。こうすることで、それぞれのクラスを並列実行することが可能になり、バッチの実行時間が短縮されました。

また、検索インデックスに新しく field を追加する際にも、既存のロジックに手を加えることなく新しいクラスを実装することで対応が可能となり、システム全体で見ても「オープン・クローズドの原則」を満たしたバッチとなりました。

f:id:spicycoffee:20190617144349p:plain — fushigibana の構成

fushigibana の開発と移行作業

ここからは、実際にどのようにして fushigibana を開発し、それをどのようにして本番環境に適用したかについて述べていきます。

開発の流れ

fushigibana の開発は、大まかに次のような流れでおこなわれました。

現状の調査
ロジックの SQL 化
新ロジックの検証
staging 環境における検索レスポンスの検証
kuroko 上での本番運用
コードベースの分離

現状の調査

旧検索バッチの改修に入る前にまずは現在利用されていない field を洗い出し、少しでも移行時の負担を軽減することを目指しました。コードを grep して一見使われていなさそうな field について識者に聞いて回ります。この辺りは完全に考古学の域に入っており、作業の中で過去のサービスについていろいろなエピソードを知ることができておもしろかったです。この作業の後、最終的には 111 の field についてロジックの移行をおこなうことになりました。

ロジックの SQL 化

既存の Rails ロジックを凝視しながらひたすら SQL に書き換えていきます。中には既存のロジックの挙動が明確でないもの、単純にバグっているものなども存在しており、適宜直しながらひたすら SQL に書き換えます。最終的には 32 のクラスで 111 の field を生成することになりました。

新ロジックの検証

新旧ロジックで生成した検索インデックス同士を比較することで、新ロジックの妥当性を検証します。「データソースが変わるためそもそもデータの更新タイミングが違う」「ロジック改修の際にバグ改修もおこなった」などの理由から厳密な比較は不可能でしたが、できる範囲で新ロジックの不具合を潰していきました。

開発環境における検索レスポンスの検証

それぞれの検索インデックスをアップロードした Solr に対して検索リクエストを投げ、そのレスポンスを比較します。実際に開発環境のレシピサービスを利用して手動で挙動を確認することはもちろん、検索回数上位 1000 キーワードほどについてスクリプトを回し、「人気順」「新着順」「調理時間絞り込み」など、利用ユーザー数や重要度の観点から選択した、いくつかの機能で発行される検索クエリのレスポンス件数や順序を比較しました。ここでも厳密な比較はできないものの、ユーザー視点で重要な体験に絞った上で「ある程度の誤差を許容する」「誤差の原因を特定することを目的とする」ことで費用対効果を意識して検証作業を進めました。

旧実行環境上での本番運用

本番運用に入るにあたっていきなりコードベースを分離するのではなく、まずは既存のバッチが動いているシステムの上で新ロジックを走らせる方針を取りました。これは、なにか問題があった際に、それが「コードベースの分離ではなく新しいロジックそのものに問題がある」ことを保証するためのステップです。

コードベースの分離

上記のステップで新ロジックに問題がないことを確認した上でコードベースを分離していきます。実際には cookpad_all 内に存在したロジックをいくつか社内 gem に移行するなどの作業が発生したため、新ロジックの妥当性が完全に保証された状態でコードが分離できたわけではありませんでしたが、それでも一度既存のシステム上で問題なく実行できていたため比較的不安なく分離を進めることができました。

移行作業における安全性の保証

検索バッチが影響を与えるレシピサービスは非常に多くのユーザーが利用しているサービスであり、移行作業に際して不具合が発生する可能性は可能な限り抑える必要がありました。今回の開発ではシステムの安全性を以下の 4 地点で検証してから反映しています。

新ロジックの生成する検索インデックスと、旧ロジックの生成する検索インデックスを比較
新検索インデックスをアップロードした Solr が返すレスポンスと、現在の Solr が返すレスポンスを比較
新検索インデックスを production の Solr にアップロードした後、現在の検索結果と前日の検索結果を比較
ユーザーからの問い合わせを監視

このうち、1 と 2 については上述した「開発の流れ」における「新ロジックの検証」と「開発環境におけるレスポンスの検証」そのものであるため、ここでは 3 と 4 について述べます。

f:id:spicycoffee:20190617111720p:plain — 移行時の検証作業

前日との検索結果の比較

クックパッドの Solr は master-slave 構成で運用されており、検索インデックスが master Solr にアップロードされた後、ユーザーからのリクエストを受ける slave Solr がそれをレプリケーションしてくる形になっています（厳密にはこれに加えてキャッシュ機構があったりします）。逆にいうと検索インデックスをアップロードしても、slave のレプリケーション処理をおこなわなければユーザーへの影響は出ないということになります。

この仕組みを利用して、検索インデックスをアップロードした後検索回数上位の各キーワードについて前日の検索結果と新しい検索結果を件数ベースで比較し、大きな差があった場合レプリケーションを実行しないというテスト機構が存在していました。この機構は検索インデックスの生成ロジックを変更しても問題なく利用できるものであったため、そのまま活用することになりました。

ユーザーからの問い合わせ監視

いくら開発段階での検証を繰り返しても、実際に不具合の出る可能性を 0 にすることはできません。当然のことではありますが、本番適用日はインフラやサポートチームに共有し、万が一のときにすばやくロールバックできるよう検索インデックスをユーザーからアクセスのこない slave Solr にバックアップした上で反映作業を実施しました。その後もユーザーから届くお問い合わせには定期的に目を通し、fushigibana 導入による不具合らしきものが報告されていないかどうかを確認していました。

プロジェクトの振り返り

成果

以上に述べたように検索バッチの改修をおこなった結果、どのような成果を得ることができたのかをまとめます。冒頭に上げた「旧検索バッチの問題点」についてはそれぞれ

複数の DB やサービスに依存している
- DWH をデータソースとすることで解消した
cookpad_all に存在している
- 別レポジトリに切り出して実装することで解消した
- 結果 cookpad_all から 1,357 行のコードを削除することに成功した
不必要に Rails である
- plain Ruby として実装することで解消した
責務が大きすぎる
- index-generator を複数のクラスに分割して実装することで解消した
- 「小さな処理を並列で実行する」形に改修したことでリトライ処理も入れやすくなり、バッチ全体の安定性も向上した
- 同様の理由でバッチ実行基盤の spot instance 化も達成され、将来的にはコスト削減にも繋がりそう
実行時間が長すぎる
- 分割実装した index-generator を並列実行することで解消した
- 具体的には全体で 7.5h かかっていたものが 4.5h となり、約 3 時間の短縮化に成功した

という形で解決することができました。丁寧に検証フェーズを重ねたこともあり、今のところ不具合やユーザーからのお問い合わせもなく安定して稼働しています。また、上記に加えて「コードの見通しが改善したことによる開発の容易化」や「ドキュメンテーションによるシステム全体像の共有」といった成果もあり、検索バッチ周りの状況は今回のプロジェクトによって大きく改善されました。

反省

今回の一番大きな反省点はプロジェクトの期間が間延びしてしまったことです。着手してみた結果見積もりが変わった・プロジェクトのスコープが広くなっていったという事実もあるため仕方のないところもありましたが、特に検証フェーズにおいてはより費用対効果の高い方法を模索することができたのではないかと思います。

たとえば検索インデックスの比較と Solr レスポンスの比較はかなり近いレイヤーに属するものであり、どちらか一方を省略しても検証の精度に大きな差は存在しなかった可能性があります。結果として「不具合が出ていない」という事実は喜ばしいことですが、組織にとってはエンジニアリソースも重要な資源ですし、今後は「かかるコスト」についてもしっかりと意識をしてプロセスやアーキテクチャの選定をしていきたいと思います。

今後の課題

今回のバッチ改修はあくまで「レシピ検索」についてのものであり「つくれぽ検索」「補完キーワード検索」などについては（また別の）古いシステムで動いています。今後はそれらの検索インデックスを生成するシステムについても改修をおこなう必要があると思いますし、その際に fushigibana に乗せるのか、あるいはどう関係させるのかというのはしっかりと考慮する必要があると思います。

fushigibana そのものについての課題としては、現在 AWS Glue へのスキーマ登録を AWS console から手動でおこなう必要があることがあげられます。ドキュメントは残しているものの、この作業だけ fushigibana のリポジトリ上で完結しないのは開発者に優しくないと感じていますし「スキーマ定義ファイルの内容に従って AWS Glue の API を叩くスクリプトを実装する」といったような解決策を取るべきであると思っています。

まとめ

今回の記事ではクックパッドにおける検索バッチシステムの改修について解説しました。「現状のシステムを調査することで洗い出した問題点を解決する構成を考え、技術を用いて可能な限りシンプルに実現する」という当然かつ難しいことを、規模の大きなシステムに対して実践するのは非常にやりがいがあり、エンジニア冥利に尽きる仕事でした。システムの構成も現時点における「普通」にかなり近いものになっており、今後の開発にもいい影響があると期待されます。

クックパッドには検索バッチ規模のシステムが多数存在し、その多くはよりよい実装に改修されることが期待されているものです。もちろんそのためには多くのリソースが必要であり、弊社は年がら年中エンジニアを募集しています。大規模なシステムの開発に挑戦したいエンジニア、多くのユーザーを支えるサービスに関わりたいエンジニア、技術の力でサービスをよくしたいエンジニアなど、少しでも興味を持たれた方は是非ともご応募ください。

採用サイト

*1: https://techlife.cookpad.com/entry/2018/02/10/150709 と https://techlife.cookpad.com/entry/2018/12/07/121515

*2:クックパッドでは 2017 年よりレシピサービスのアーキテクチャ改善を目的とするお台場プロジェクトが進んでおり、それに貢献する意味もありました

*3:Solr にデータを撃ち込む → ソーラービーム → フシギバナ

2019-06-14

Working with AWS AppSync on iOS

iOS AWS

Hi, this is Chris from Cookpad's Media Product Globalization department.

I'm going to discuss some pitfalls we've run into while working with AWS AppSync for iOS. This post is not a primer on AppSync, nor is it a general review of whether you should or should not use AppSync for your project. My goal is to point out some various lessons we've learned so far that weren't obvious at first. My second disclaimer is that AppSync itself is under active development, so you can probably expect that some of the points I cover in this post will be altered in the future.

Background

My team has been working on a standalone iOS app for shooting, editing, and sharing 1-minute, top-down recipe videos called Cookpad Studio (here's a completed example video). At the time of this posting, our app is still in closed beta.

The shooting and editing parts are local to an iOS device.

f:id:christopher-trott:20190614105506p:plain — Video editor screens

But the sharing to the community part relies on a server backend to share data between users.

f:id:christopher-trott:20190614105609p:plain — Community screens using AWS AppSync

For the community part of the app, we decided to use AWS AppSync and various AWS backend technologies as an alternative to more established frameworks like Ruby on Rails.

Our AppSync setup is a bit different than the standard use case. AppSync is designed to be configured by app developers through the Amplify CLI. Since our team has dedicated backend engineers, we've opted to do most configuration and server development through the AWS component services directly (e.g. AWS Lambda, DynamoDB, etc.).

SDKs

AppSync on the iOS side is an amalgamation of a few different AWS SDKs. Luckily, all of them are open source and you can dive into their code when necessary. The three SDKs we're using so far are:

Authentication - The SDK that facilitates user authentication via Cognito.
Storage - The SDK that facilitates file uploads/downloads to/from S3.
API - The GraphQL client that facilitates fetching and mutating records in DynamoDB.

The first thing to understand about these SDKs is that they're all very different. They were written at different times by different teams with different technologies and have evolved with different goals in mind.

To give you an idea of what I mean by different, here's some various specs about each SDK:

Authentication
- Objective-C & some Swift wrappers
- Uses AWSTask, a fork of Facebook's Bolts Framework, for async communication, alongside Cocoa conventions (e.g. delegates, closures, GCD).
Storage
- Objective-C
- Uses AWSTask alongside Cocoa conventions.
API
- Swift
- Uses a custom Promise implementation for async communication, alongside Cocoa conventions.
- Uses .graphqlconfig.yml for additional GraphQL configuration.

Authentication SDK

Singletons

I generally prefer to use initializer-based dependency injection over singletons. This is often unavoidable, even when only using Apple's first-party SDKs.

I was pleased to find that code completion gave me a couple different initialization options for AWSMobileClient, the primary class for interfacing with the Cognito authentication APIs. The most complete of the initializers being:

- (instancetype)initWithRegionType:(AWSRegionType)regionType
                    identityPoolId:(NSString *)identityPoolId
                     unauthRoleArn:(nullable NSString *)unauthRoleArn
                       authRoleArn:(nullable NSString *)authRoleArn
           identityProviderManager:(nullable id<AWSIdentityProviderManager>)identityProviderManager;

I went down this path, discovering later that using this initializer leaves the AWSMobileClient instance in a very broken state.

AWSMobileClient is actually a Swift wrapper and subclass of the Objective-C _AWSMobileClient class. Inside you'll find some code that certainly stretches my understanding of subclassing rules across Swift and Objective-C:

public class AWSMobileClient: _AWSMobileClient {
    static var _sharedInstance: AWSMobileClient = AWSMobileClient(setDelegate: true)
    
    @objc override public class func sharedInstance() -> AWSMobileClient {
        return _sharedInstance
    }
        
    @objc public var isSignedIn: Bool {
        get {
            if (operateInLegacyMode) {
                return _AWSMobileClient.sharedInstance().isLoggedIn
            } else {
                return self.cachedLoginsMap.count > 0
            }
        }
    }
    
    // ...
}

Additionally, the initialize method that must be called by the client references itself and several other singletons:

_AWSMobileClient.sharedInstance()
DeviceOperations.sharedInstance
AWSInfo.default() - reads from awsconfiguration.json in the bundle.
AWSCognitoAuth.registerCognitoAuth(...)

Takeaway: For this SDK and the other AWS SDKs, you have to use the singletons.

Keychain credentials

The Authentication SDK uses the keychain APIs to store user credentials securely.

We changed server environments a few times during development. First, we had a prototype environment, then changed to a more long-term development environment, and finally to a production development in parallel with the development environment. By environment, I mean the keys used to locate our apps resources (e.g. PoolId, Arn, ApiUrl, ApiKey, etc.).

A few of our team members had installed and ran a release build of the app in the prototype environment at some point, thereby storing some Cognito tokens in their keychain. When we switched to the development environment, we started seeing deadlocks during our authentication bootstrapping process. The bootstrapping process happens on a cold launch and runs the required asynchronous AWSMobileClient initialization methods.

The debugging steps of deleting the app and reinstalling did not work because the keychain contents are retained by iOS across app installs for the same bundle ID.

Once we had determined that AWSMobileClient could not handle loading "bad" environment user credentials – user credentials created with a different AWS configuration parameters – I had to create special builds for these devices that called AWSMobileClient.sharedInstance().signOut() immediately on launch.

We actually saw a similar deadlock in AWSMobileClient when running the app on the iOS simulator during development, which threw me off the trail a bit during debugging.

Takeaway: Be careful when changing environment configuration parameters.

Drop in Authentication UI

The Authentication SDK includes a drop-in UI. Because we wanted to ship our app to beta users as quickly as possible to start gathering feedback, I was particularly pleased that I wouldn't need to write a custom UI for authentication.

Unfortunately, we found a few dealbreakers that prevented us from using the drop-in UI.

First, the drop-in UI has no support for localization. Since our first market is Japan, we definitely needed the UI to support Japanese. The localization issue has appeared in other contexts as well, especially errors returned by the SDK. I would keep this point in mind if the product you're working on requires any other language besides English.

Second, I was planning on presenting the authentication view controller from our root view controller, an instance of UIViewController. I found that the entry point to the drop-in UI requires a UINavigationController:

+ (void)presentViewControllerWithNavigationController:(UINavigationController *)navigationController
                                        configuration:(nullable AWSAuthUIConfiguration *)configuration
                                    completionHandler:(AWSAuthUICompletionHandler)completionHandler;

This seemed like an odd requirement since the drop-in UI view controller seemed to be presented modally. Digging into the code, I came to the same conclusion as this GitHub Issue: the only API used is the UIViewController presentation API.

There's also this long-running GitHub Issue with feature requests for the drop-in UI.

Takeaway: Using the drop-in UI may not be feasible for your use case.

Is `initialize` an asynchronous task?

The signature of AWSMobileClient's required initialization method is:

public func initialize(_ completionHandler: @escaping (UserState?, Error?) -> Void)

From this signature, I would assume this function is asynchronous, and therefore anything that depends on the result of this call needs to wait until the completionBlock is called.

However, if we look at the implementation:

internal let initializationQueue = DispatchQueue(label: "awsmobileclient.credentials.fetch")

public func initialize(_ completionHandler: @escaping (UserState?, Error?) -> Void) {
    // Read awsconfiguration.json and set the credentials provider here
    initializationQueue.sync {
        // ... full implementation
    }
}

I wasn't sure what to expect when stepping through this code, but it looks like if initialize is called on the main thread, the implementation within the sync closure continues to be executed on the main thread. After the completion handler is called within initialize and that code runs, control flow returns to the end of initialize.

f:id:christopher-trott:20190614105742p:plain — Callstack during `AWSMobileClient.initialize`

Takeaway: You can probably assume that AWSMobileClient.sharedInstance().initialize(...) is synchronous. However, if you're paranoid about the implementation changing at some point, treat it in your calling code as asynchronous.

Storage SDK

Initialization

Similar to our takeaway from the Authentication's section above about singletons, I recommend being extra cautious about the set up of your AWSS3TransferUtility instance.

Internally, AWSS3TransferUtility the class maintains a static dictionary of instances and a default instance.

// AWSS3TransferUtility.m
static AWSSynchronizedMutableDictionary *_serviceClients = nil;
static AWSS3TransferUtility *_defaultS3TransferUtility = nil;

There are some directions in the API docs about how to register an instance with custom configuration options.

However, if you decide to use the default instance like I did, you need to set the service configuration in a different singleton before calling AWSS3TransferUtility.default() for the first time. (I only learned this by eventually finding my way to the implementation of AWSS3TransferUtility.default() after struggling for hours with various unauthorized errors at runtime when trying to perform uploads).

AWSServiceManager.default()!.defaultServiceConfiguration = AWSServiceConfiguration(region: .APNortheast1, credentialsProvider: AWSMobileClient.sharedInstance())
let transferUtility = AWSS3TransferUtility.default()

Takeaway: Register your own AWSS3TransferUtility. Or if you want to use the default, set an AWSServiceConfiguration in the AWSServiceManager singleton before calling AWSS3TransferUtility.default() for the first time.

AWSTask for upload & download operations

The Storage SDK uses AWSTask throughout. AWSTask is a fork of Facebook's Bolts Framework.

Tasks... make organization of complex asynchronous code more manageable.

The usage of the primary Storage SDK's APIs for uploading and downloading are shown in the API docs, but since I wanted to ensure all codepaths for errors were handled properly, I had to dig a little deeper to understand how these tasks work under the hood. I'll use multi-part uploading as an example, but this applies to all three scenarios (uploading, multi-part uploading, and downloading).

I've annotated the types so that you can see the identity of what's actually flowing around all these closures.

let expression = AWSS3TransferUtilityMultiPartUploadExpression()
expression.progressBlock = { (task: AWSS3TransferUtilityMultiPartUploadTask, progress: Progress) in
    DispatchQueue.main.async(execute: {
        // ...
    })
}

let completionHandler: AWSS3TransferUtilityMultiPartUploadCompletionHandlerBlock = { (task: AWSS3TransferUtilityMultiPartUploadTask, error: Error?) -> Void in
    DispatchQueue.main.async {
        // ...
    }
}

let taskQueuedHandler: (AWSTask<AWSS3TransferUtilityMultiPartUploadTask>) -> Any? = { (task: AWSTask<AWSS3TransferUtilityMultiPartUploadTask>) -> Any? in
    DispatchQueue.main.async {
        if let result = task.result {
            // An `AWSS3TransferUtilityMultiPartUploadTask` was queued successfully.
        } else if let error = task.error {
            // The `AWSS3TransferUtilityMultiPartUploadTask` was never created.       
        } else {
            // Not sure if this code path is even possible.        
        }
    }
    return nil
}

let task: AWSTask<AWSS3TransferUtilityMultiPartUploadTask> = transferUtility.uploadUsingMultiPart(fileURL: fileURL, bucket: bucketName, key: objectKey, contentType: contentType, expression: expression, completionHandler: completionHandler)
task.continueWith(block: taskQueuedHandler)

The overloaded use of the identifier Task in the types caused me some confusion at first. AWSS3TransferUtilityMultiPartUploadTask is not a subclass or in any way related to AWSTask as a concept.

Let's start at the bottom. The transferUtility.uploadUsingMultiPart(...) method takes some parameters, two closures, and returns an AWSTask<AWSS3TransferUtilityMultiPartUploadTask>: an AWSTask that will asynchronously return an AWSS3TransferUtilityMultiPartUploadTask? or an Error? to the block provided to continueWith.

The moment of understanding I had was realizing that just creating an AWSS3TransferUtilityMultiPartUploadTask is an asynchronous, fallible operation, with an error case that must be handled. That is why we've defined taskQueuedHandler above.

Keep in mind that taskQueuedHandler may be called on a background queue.

completionHandler will always get called if the if let result = task.result code path in taskQueuedHandler executes. completionHandler still has to handle both success and failure cases.

If, for example, you start a UIActivityIndicatorView as loading before calling uploadUsingMultiPart, but you don't handle the task.continueWith error, it's possible that the UIActivityIndicatorView will spin forever.

Takeaway: If you're expecting the result of an upload or download at some point in the future, you need to handle the error case in task.continueWith.

AWSTask for `get{*}Tasks`

Since AWSS3TransferUtility maintains its own database of tasks, even across app cold launches, you may need to retrieve these tasks. This use case is shown in the API docs.

let downloadTasks = transferUtility.getDownloadTasks().result
let uploadTasks = transferUtility.getUploadTasks().result
let multiPartUploadTasks = transferUtility.getMultiPartUploadTasks().result

Note that even though these getter functions return an AWSTask, they're not asynchronous and the result is available immediately. There's also no way for the returned AWSTask to contain an error.

Takeaway: Sometimes the AWS SDKs return AWSTasks for synchronous operations. Sometimes they return AWSTasks for operations that are not fallible. However, be careful relying on this behavior because the underlying implementation could always be changed in a future version without your knowledge.

API SDK

Because AWSAppSyncClient in built on top of ApolloClient, some of the below points are applicable to Apollo GraphQL as well.

Offline Mutations

One of the marketing points of AppSync is that mutations (i.e. POST, PUT, or DELETE in the REST world) can be triggered by a user while they're offline, and the mutations will be queued in local storage and relayed to the server when the user's device has connectivity again.

This is a feature set available in certain types of apps, including many of Apple's own stock apps like Reminders or Contacts.

However, this behavior does not always make sense for all types of mutations. Even when it does make sense, it often comes with an additional heavy burden of proper UX design. Handling errors. Handling conflicts. These are problems that even the most mature apps still struggle with.

In our app, we have a pretty straightforward createUser mutation (i.e. sign up). createUser is a particularly poor candidate for offline mutation support:

It has several server-side validation rules for form elements (e.g. unique username).
The app is logically partitioned to only allow registered users to access certain parts of the app.

Before learning that offline mutations were the default in AppSync and could not be turned off, I was struggling to understand why when simulating network errors, the completion block to my mutation was never getting called, even beyond the timeout duration.

When I realized this behavior was intentional, it took more time to figure out a workaround that didn't require the huge maintenance burden of subclassing or implementing manual timeout code throughout the app.

It turns out the workaround is as simple as using the underlying appSyncClient.apolloClient instance.

// Before
appSyncClient.perform(mutation: mutation, queue: .main, optimisticUpdate: nil, conflictResolutionBlock: nil) { (result, error) in
    // ...
}

// After
appSyncClient.apolloClient?.perform(mutation: mutation, queue: .main) { (result, error) in
    // ...
}

From my reading of the AWSAppSyncClient source, it's safe to force unwrap apolloClient at the moment. But certainly use caution in your particular use case.

With the above code, mutations attempted while offline will fail with an error after the default timeout (60 seconds) and call the completion block.

Takeaway: Use appSyncClient's underlying apolloClient directly to perform mutations that shouldn't be queued offline.

Errors

Overall, GraphQL is a welcome addition of structure compared to REST. However, I've found the error story to be a little disappointing.

When writing my first AppSync API request handler, I soon found the control flow for errors to be a little overwhelming. All layers of the stack have their own set of errors, and Swift's untyped errors don't help the situation.

Let's look at an example fetch request. I've set up and documented the completion handler.

appSyncClient.fetch(query: query) { (result: GraphQLResult<Query.Data>?, error: Error?) in
    
    // 1
    if let networkError = error as? AWSAppSyncClientError {
        // The first layer of error handling is a network stack error.
    
    // 2
    } else if let unknownError = error {
        // This case probably shouldn't happen, but I don't know the network stack
        // well enough to guarantee that.
    
    // 3
    } else if let data = result?.data? {
        // This is sort of the happy path. We got the data we requested.
        // However, `result?.errors?` may still contain errors!
        // It depends on your use case whether you want to ignore them if
        // `data` is non-null.
    
    // 4
    } else if let graphQLErrors = result?.errors?, !graphQLErrors.isEmpty {
        // According to the GraphQL spec, graphQLErrors will be a non-empty list.
        // These errors are also more or less untyped.
    
    // 5
    } else {
        // Although logically we should have covered all the cases,
        // the compiler can't statically guarantee we have so we should throw
        // an `unknown` error from here.
    }
}

The network stack is provided by AWSAppSyncHTTPNetworkTransport and throws AWSAppSyncClientError. In the .requestFailed case, the Cocoa error can be extracted and the localizedDescription shown to the user. The other cases probably aren't that useful. Note that although AWSAppSyncClientError conforms to LocalizedError, the error messages are English only and usually add various codes that would probably be unideal to show users.
I haven't dug through the network stack enough to know whether there are other error types that can be thrown, but the presence of an error at this level of the stack probably means that result will be nil.
The GraphQL spec says that result can contain both data and errors. It's up to you to determine whether you need to handle this case, and if so, how to handle it. For many use cases though, getting data means success.
The GraphQL spec defines an error as a map with a message that's intended for developers, and optionally locations and path fields. As of the June 2018 spec, user fields should be contained within the extensions field. However, the AppSync spec was based on the October 2016 GraphQL spec, and therefore defines an errorType field in the root of the error map. errorType is a String type which makes it more readable to developers, but also more error prone.
All those nullable fields have left us with an else case.

I really wish errors were typed in GraphQL (and Swift too!).

Takeaway: Handling the results of a fetch or perform requires some knowledge about the various layers of the network stack. Make sure you've considered the possible errors at each layer and how can you help your user recover from them.

Equatable structs

The codegen utility included in AWS Amplify and part of Apollo's tooling does not support generating structs that conform to Equatable. Generated enums do conform to Equatable.

The way structs are laid out, all the struct's data is stored in a dictionary [String: Any?] (typealiased as Snapshot). Its typed properties are decoded from or encoded into that dictionary on the fly in a property's getter and setter, respectively.

Equatable could probably be generated the old fashioned way by comparing all properties. I'm unsure of whether this could introduce performance problems for deeply nested structs due to the lazy (and non-cached) decoding.

This was discussed in a (now closed) GitHub issue.

Takeaway: Code generated enums conform to Equatable. Code generated structs do not conform to Equatable. If you need Equatable structs, you'll have to write the == function yourself manually, generate it with a tool like Sourcery, or create wrapper structs.

Query watching

AWSAppSyncClient has a useful watch feature that allows you to receive updates to any resources fetched by the query you're watching throughout the lifetime of the watch. Experimenting with this feature, I've found a few conceptual points to keep in mind.

watch works by first adding a subscription to any changes to the store. Next, it makes a normal fetch with the same configurable cache policy options available to fetch. The results of this initial fetch are used to create a list of dependentKeys. When the cache notifies the GraphQLQueryWatcher that its contents have changed, the GraphQLQueryWatcher checks if any of the changed keys are contained in its dependentKeys, and if so, it fetches the query again (with cache policy .returnCacheDataElseFetch) then calls the closure registered in watch with the result.

Set up the cache key identifier on your store

As stated in the docs, you have to tell apolloClient how you uniquely identify your resources:

// Use something other than "id" if your GraphQL type is different
appSyncClient?.apolloClient?.cacheKeyForObject = { $0["id"] }

In their example, it says that a Post with id = 1 would be cached as Post:1. However, in my testing, only the id itself is used (i.e. 1). Currently, we have ids that are unique across our resources, but if you don't, you may need to investigate this more to ensure you don't have key collisions in the cache.

A fetch must succeed before watching will work

Since dependentKeys are derived from the results of the first fetch (and is regenerated on subsequent fetches), this fetch has to be successful in order for the watch to respond to changes produced by other queries.

If you use watch, you have to allow your user to retry in case the initial fetch fails. Call GraphQLQueryWatcher.refetch(). Even if the same query is fetched from a different part of your app, this query must succeed at least once in order to receive changes.

Use a pessimistic cache policy

You essentially cannot (safely) use the .returnCacheDataDontFetch cache policy with watch.

Granted, it's rare case to want to do so. But if you thought that the partial results from a different query in your app could be picked up by a watch query, this won't work. It has to be the exact same query and it has to have been fetched before with the exact same parameters from the server.

If you used .returnCacheDataDontFetch as the cache policy and the fetch resulted in a cache miss, you would have to call refetch() anyway to make a fetch to the server.

It's not straightforward to use watch with paging queries

It's common in GraphQL to use a Connection type to implement indefinite paging.

Let's look at the following GraphQL schema:

type MovieConnection {
  movies: [Movie!]! # contains a maximum of 10 items
  nextToken: String
}

type Query {
  getLatestMovies(nextToken: String): MovieConnection!
  getMovie(id: Int!): Movie!
}

For example, if you set up a watch for the first call to getLatestMovies(nextToken: nil), this watch will only respond to changes to the 10 Movie resources returned by the query. If you make a normal fetch request for the next page using nextToken, the watch you have set up will not observe changes in the Movie resources returned in the second request.

If you wanted to respond to changes to any Movie returned in any pages, you'd have to do a watch for each page and add the GraphQLQueryWatcher to a collection. The logic in your result handlers would depend heavily on how you structured your data source since the result could be an add or an update.

It's not possible to watch resources outside a query

It's probably obvious from the interface to watch since the first parameter is of type GraphQLQuery, but you cannot watch for changes to an arbitrary key in the cache. For example, if there was a resource in your database keyed by id 12345, you can't simply register a watcher with a dependent key for this id.

Any connection between resources and queries must be resolved by the server

If you have two different queries that you know reference the same object, that relationship must be codified by the server.

Continuing with the getLatestMovies example in the previous section, imagine we received a list of 10 Movies and wanted to watch for granular changes in the Movie with id = 12345.

To accomplish this you might think you could simply call:

let watcher = appSyncClient.watch(query: GetMovieQuery(id: 12345), cachePolicy: .returnCacheDataDontFetch, queue: .main, resultHandler: { (result, error) in ... }

But this would not work! It would result in a cache miss and the watch would be inert until refetch() was called.

Although the Movie returned by GetMovieQuery(id: 12345) is already in the cache, the association between the query itself and the Movie resource can't be resolved by AppSync/Apollo until the server returns the result for the query and this result is cached too.

Conclusion

In this post, I outlined some development points to watch out for in the Authentication, Storage, and API SDKs of AWS AppSync. I hope the takeaways from this post are valuable for current and future users of AWS AppSync.