2018-06-18

Header Bidding 導入によるネットワーク広告改善の開発事情

こんにちは。メディアプロダクト開発部の我妻謙樹（id:kenju）です。サーバーサイドエンジニアとして、広告配信システムの開発・運用を担当しています。

cookpad における広告開発

2015年11月に、"クックパッドの広告エンジニアは何をやっているのか" というタイトルで、広告開発部の開発内容について紹介する記事が公開されていますが、それから 2 年余り経過し、広告配信システム周りの状況も大きく変化しました。はじめに、現在の cookpad における広告開発の概要について、軽くご紹介します。

まず、私が所属しているメディアプロダクト開発部では、広告配信システムに加え、動画配信サービスの開発も担当しています。過去には同じチームから、動画配信周りの技術について以下のような投稿もありますので、そちらもご覧ください。

広告配信システムの開発で担当しているサービスの一覧は、以下の通りです：

広告配信サーバーの開発 (Rails)
社内向け広告入稿システムの開発 (Rails)
広告ログ基盤の運用（Python, Kinesis Streams, DynamoDB, Lambda）
広告配信用 SDK の開発（各プラットフォームに準拠。WEB 向けは JavaScript）

プロジェクト別で言えば、以下がスコープです：

既存自社広告商品の改善
新広告商品の新規開発
ネットワーク広告商品の開発や改善

例えば、昨年末から今年はじめにかけて、とある新広告商品の開発に携わっていたのですが、その時のプロジェクトの一部についてスライドが公開されているので、そちらもご覧ください。

"広告配信サーバーと広告配信比率最適化問題"

今回は、その中でも「ネットワーク広告商品の開発や改善」における一プロジェクトについてご紹介します。

背景

cookpad では、以下の 2 種類の広告を配信しています。

自社広告 ¹
ネットワーク広告 ²

この内、ネットワーク広告においては、 Supply Side Platform （以下、SSP）各社と連携して複数のアドサーバー経由で広告を配信しているのですが、それらの広告は頭打ちになってきています。そのため、広告の収益改善に取り組む必要がありました。

安易に広告枠を増やすことはユーザー体験の低下やネットワーク負荷増加に繋がるため、避けなければなりません。したがって、現在配信されている広告の買付け額や配信フローを改善させる必要があります。

そこで、近年日本でも導入が進んできている Header Bidding と呼ばれる仕組みを導入することになりました。

About Header Bidding

Header Bidding とは、アドサーバーに広告のリクエストをする前に、SSP 各社に広告枠の最適な額を入札します。

仕組み的には、<head> タグ内（= Header）で事前に入札リクエスト（= Bidding）を行うことから、"Header Bidding" と呼ばれています。

Without Header Bidding

例えば、Header Bidding を経由しない、従来のパターンでの広告枠買付け方式を見てみましょう。ここで図の用語の定義は以下の通りとします：

Client ... 広告を表示させる側ここでは cookpad の本体サイト
Ad Server ... アドサーバー
SSP ... SSP各社
Floor ... フロアプライス ³
Bid ... 入札結果 ⁴
Winning bid ... 買付けに成功した入札価格

f:id:itiskj:20180613203016p:plain

既存のアドサーバーの入札ロジックは、基本的にウォーターフォール方式（⁵）で買付けが行われます。したがって、上記図のケースの場合、

広告枠に対してフロアプライスが $1.0
SSP α の入札結果が $0.8（フロアプライス以下）
SSP β の入札結果が $1.2（フロアプライス以上）

上から順番に問い合わせていった結果、最初にフロアプライス以上の価格で入札してきた「SSP β」の広告が表示されることになります。

ここで「SSP δ」の入札結果が、$2.0 であることに着目してください。もし、「SSP δ」入札結果を反映できていたら、その広告枠の価値は $2.0 になります。つまり、広告枠本来の価値は $2.0 といえます。しかし、ウォーターフォール形式の仕組み上の制約によって、$1.2 の入札結果が反映されてしまいました（差し引き $0.8 の機会損失ですね）。

これを解決するのが、Header Bidding です。

With Header Bidding

次に、入札サーバを挟む場合で見ていきましょう。なお、この場合は後述する Server-to-Server 方式で説明していきます。

f:id:itiskj:20180613203037p:plain

この場合、

Client -> Bid Server ...Header Bidding を入札サーバにリクエスト
Bid Server -> SSP ... 入札。このとき、各社への入札は 同じタイミングで入札 される
Bid Server -> Client ... 入札結果を返す。ここでは、「SSP δ」が $2.0 で買付けをすレスポンスを返してきたので、Winning bid が $2.0 になる
Client -> Ad Server ... 入札結果が返ってきてから広告をリクエストする。このとき、Winnig bid を伝えることで、「SSP δ」広告が返却されることになる ⁶

という順番で処理が実行されます。Header Bidding を利用しないケースと違って、一番買付け額が高い SSP の入札結果が反映されたことがわかります。

ポイントは、

アドサーバーにリクエストする前に入札を行うこと
SSP 各社へのリクエストが並行に行われること

です。これによって、本来失われてしまっていた入札価格を最適化することができます。

Client vs S2S Header Bidding

Header Bidding には

Client Header Bidding
Server-to-Server Header Bidding

の2種類の方式が存在します。

Client Header Biddingは、クライアント側で入札を行う形式です。技術的には、<script> タグ内で、入札先の SSP 一覧を指定して、それぞれに入札リクエストを行います。それらの入札リクエストの結果を待って、一番 eCPM の優れた入札結果を選択する形式です。

Server-to-Server Header Bidding は、サーバー側で入札を行う形式です。Client Header Bidding との違いは、入札サーバーに 1 回だけリクエストを送信すれば良い点です。また、入札ロジック（例：SSP 各社からの入札結果の待ち合わせ、タイムアウト処理、入札結果の比較）を入札サーバーが担ってくれるので、クライアント側の責務が大幅に削減されることです。

現在は、Server-to-Server 方式が主流です。

Header Bidding Services

なお、Transparent Ad Marketplace（以下、TAM）という、Amazon が提供する Header Bidding の広告サービスを採用しています。

設計・実装

基本的には、TAMの提供するドキュメントに沿って、<head> タグ入札サーバーにリクエストするスクリプトを埋め込めば、導入は完了します。

しかし、弊社の場合

独自の広告入稿・配信サーバーを介して自社広告・ネットワーク広告すべての広告を配信している
しかも、ページごとに配信される広告スロット（⁷）は静的ではなく動的に変化する

といった制約のため、スムーズな導入ができず改修が必要でした。

以下の図が、Header Bidding を行うまでの大幅な流れです。ここで、

ads ... 社内広告配信サーバ
display.js ... 広告表示用の JavaScript SDK
cookpad_ads-ruby ... display.js を埋め込むための Rails 用ヘルパーを定義した簡易な gem
apstag ... TAM の提供する Header Bidding 用ライブラリ
googletag ... DFP の提供するアドネットワーク用ライブラリ

だとします。なお、googletag の公式ドキュメントは、https://developers.google.com/doubleclick-gpt/reference からご覧になれます。

f:id:itiskj:20180613203104p:plain

Header Bidding を実行するまでの大まかな流れは、以下の通りです：

JavaScript SDK が、広告配信サーバーから表示すべき広告リクエストする
広告にネットワーク広告が含まれている場合、Header Biddingの一連の処理を開始する
まずは、apstag, googletag それぞれの初期化を行う（例：デフォルトのタイムアウト設定）
apstag を用いて TAM に Header Bidding リクエストを送る
入札結果をもとに、DFP に広告リクエストを送る
DFP から広告リクエストが返却されたら、広告を表示する

ポイントは、Header Bidding をリクエストしている間、

googletag.pubads().disableInitialLoad() で DFP リクエストを中断し、Header Bidding を行う
入札結果が返ってきたら、googletag.pubads().refresh([opt_slots...]) で広告のレンダリングフローを再開する

という点です。

結果

以上を持って、Header Bidding を導入するまでの一連の流れを説明してきました。具体的な数字はここでは伏せますが、今回の導入によってネットワーク広告の収益改善を実現することができました。

新たな課題

広告レンダリングフローのパフォーマンス悪化

しかし、ここで新たな課題も発生してしまいました。

それは、Header Bidding リクエストの分、ネットワーク広告が表示されるまでのレイテンシが増加してしまった、という点です。

広告が表示されるまでの一連のレンダリングプロセスを、以下の図に示しました。

（Processing, DOMContentLoaded, load は、ブラウザが HTML/CSS をパースしてレンダリングするまで一連のフローの一般的用語です。気になる方は、Ilya Grigorik(⁸) による"Measuring the Critical Rendering Path" をご覧ください）

f:id:itiskj:20180613203121p:plain

ぱっと見て気づくのは、社内広告配信サーバの ads へのリクエストから、Header Bidding 、そして DFPへのリクエストまですべてがシリアルに実行されていることです。今回 Header Bidding を導入したことによって、その分レイテンシが増加したのです、大体 150 ~ 400 (ms) と、かなり致命的なパフォーマンス低下になってしまいました。

広告レンダリングフローの可視化がされていない

筆者の肌感で「150 ~ 400ms」と説明しましたが、実はクライアント環境で実行されるまでの広告レンダリングフローは、今まで計測・可視化されていませんでした。

上記で挙げた広告レンダリングフローのパフォーマンスを改善したいものの、ボトルネックが正確にはどこになるのか、計測するまでわかりません。計測できないものは改善できないと言われるように、まずは計測・可視化のフローを導入しました。

ここで幸いなことに、Fluentd にログを流し、分析可能なデータウェアハウス（cookpad の場合は、現状 Redshift）にテーブルを構築するまでの仕組みはすでに存在していました。したがってクライアント側と多少のテーブル定義を書くだけで実現できました。

（補足：cookpad におけるデータ活用基盤については、"クックパッドのデータ活用基盤" をご覧ください。）

以下は、取得したログデータをもとに可視化してみた様子です。ログを先日から取得し始めたばかりなので、可視化のフローはまだ未着手です。社内で推奨されている BI ツールにダッシュボードを作り、定点観測できるところが直近のゴールです。

f:id:itiskj:20180613203133p:plain

広告レンダリングのクリティカルパスは任意のタイミングでロガーを仕込むだけですが、DFP の場合、googletag.events.SlotRenderEndedEventを利用すると、広告枠が表示されたタイミング、広告が "Viewable"（⁹）になったタイミングでイベントを取得できます。

対策と今後の展望

以上が、Header Bidding の導入から、新たに浮上した課題への対策の説明でした。直近ですと、以下に取り組んでいく予定です。

広告レンダリングフローの可視化フェーズ
広告レンダリングフローの最適化

広告レンダリングフローの最適化

「広告レンダリングフローの最適化」では、自社の広告配信サーバーへのリクエストのタイミングを、今より前倒しにする方針で設計及び PoC の実装を行っている段階です。

具体的に言うと、現在 HTML ファイルの <body> 下部で広告レンダリングフローを開始しているのですが、それを <head> タグの可能な限り早い段階で開始するように改善をする必要があります（過去の設計の都合上、広告配信サーバーへのリクエストは、各広告スロットの HTMLElement 要素がレンダーツリー¹⁰ に挿入され、実際に描画されるタイミングにブロックされている）。

f:id:itiskj:20180614112206p:plain

パフォーマンスの可視化及びレンダリングフローの最適化についても、また別の機会にご紹介したいと思います。

まとめ

アドテク関連のエンジニア目線での事例紹介や技術詳解はあまり事例が少ないため、この場で紹介させていただきました。技術的にチャレンジングな課題も多く、非常に面白い領域です。ぜひ、興味を持っていただけたら、Twitter などからご連絡ください。

また、メディアプロダクト開発部では、一緒に働いてくれるメンバーを募集しています。少しでも興味を持っていただけたら、以下をご覧ください。

自社広告 … 自社独自の営業チームが、直接広告主と契約を結び配信している広告。自社で配信されるクリエイティブを運用できるため、意図しない広告が配信されることがない。↩
ネットワーク広告 … 他社の広告配信会社が提供している広告配信サーバーを経由して、広告の買付け・配信を行う広告。各社が提供する <script> タグを HTML に埋め込み、返却された広告を <iframe> にレンダリングする形が一般的。↩
フロアプライス … 最低落札価格のこと。例えば、「フロアプライス $0.8」広告をリクエストしたとき、$0.8 以下の広告枠の買付けは行わない。↩
入札結果 …SSP 各社が、広告枠をいくらで買い付けるかを示す価格。これがフロアプライスより低い場合、広告枠に広告が表示されることはない。↩
ウォーターフォール方式 … SSP 各社定義した順番で一つ一入札していく方式。「滝」語義が表すとおり、上から順に問い合わせ得ていく様子からこの名前で呼ばれる。↩
アドサーバー側にどのように入札結果を伝えるかは、各アドサーバー側の仕様や実装に依存する。例えば TAM が DFP に対して Header Bidding を行う場合、Key/Value Targeting の仕組みを使っている。↩
広告スロット … 広告枠が表示される枠のこと。↩
Ilya Grigorik … Google のエンジニアで、“High Performance Browser Networking” の著者、と言えばわかる方も多いかもしれません。完全に蛇足ですが、尊敬しているエンジニアの1人で、彼の著作をきっかけ Web の裏側に興味を持ちました。↩
Viewable Impression … 広告が「ユーザーに見える状態」になったかどうかでインプレッションを測定している。詳細については、“Viewabiliity and Action View”を参考のこと。↩
Render Tree … https://developers.google.com/web/fundamentals/performance/critical-rendering-path/render-tree-construction ↩

2018-06-11

Service Mesh and Cookpad

This article is a translation of the original article which was published at the beginning of May. To make up for the backgroud of this article, Cookpad is mid-size technology company having 200+ product developes, 10+ teams, 90 million monthly average users. https://www.cookpadteam.com/

Hello, this is Taiki from developer productivity team. For this time, I would like to introduce about the knowledge obtained by building and using a service mesh at Cookpad.

For the service mesh itself, I think that you will have full experience with the following articles, announcements and tutorials:

Our goals

We introduced a service mesh mainly to solve operational problems such as troubleshooting, capacity planning, and keeping system reliability. In particular:

Reduction of management cost of services
Improvement of Observability*1 *2
Building a better fault isolation mechanism

As for the first one, there was a problem that it became difficult to grasp as to which service and which service was communicating, where the failure of a certain service propagated, as the scale expanded. I think that this problem should be solved by centrally managing information on where and where they are connected.

For the second one, we further digged the first one, which was a problem that we do not know the status of communication between one service and another service easily. For example, RPS, response time, number of success / failure status, timeout, status of circuit breaker, etc. In the case where two or more services refer to a certain backend service, resolution of metrics from the proxy or load balancer of the backend service was insufficient because they were not tagged by request origin services.

For the third one, it was an issue that "fault isolation configuration has not been successfully set". At that time, using the library in each application, setting of timeout, retry, circuit breaker were done. But to know what kind of setting, it is necessary to see application code separately. There is no listing and situation grasp and it was difficult to improve those settings continuously. Also, because the settings related to Fault Isolation should be improved continuously, it was better to be testable, and we wanted such a platform.

In order to solve more advanced problems, we also construct functions such as gRPC infrastructure construction, delegation of processing around distribution tracing, diversification of deployment method by traffic control, authentication authorization gateway, etc. in scope. This area will be discussed later.

Current status

The service mesh in the Cookpad uses Envoy as the data-plane and created our own control-plane. Although we initially considered installing Istio which is already implemented as a service mesh, nearly all applications in the Cookpad are operating on a container management service called AWS ECS, so the merit of cooperation with Kubernetes is limited. In consideration of what we wanted to realize and the complexity of Istio's software itself, we chose the path of our own control-plane which can be started small.

The control-plane part of the service mesh implemented this time consists of several components. I will explain the roles and action flow of each component:

A repository that centrally manages the configuration of the service mesh.
Using the gem named kumonos, the Envoy xDS API response JSON is generated
Place the generated response JSON on Amazon S3 and use it as an xDS API from Envoy

The reason why the setting is managed in the central repository is that,

we'd like to keep track of change history with reason and keep track of it later
we would like to be able to review changes in settings across organizations such as SRE team

Regarding load balancing, initally, I designed it by Internal ELB, but the infrastructure for gRPC application went also in the the requirement *3, we've prepared client-side load balancing by using SDS (Service Discovery Service) API *4. We are deploying a side-car container in the ECS task that performs health check for app container and registers connection destination information in SDS API.

f:id:aladhi:20180501141121p:plain

The configuration around the metrics is as follows:

Store all metrics to Prometheus
Send tagged metrics to statsd_exporter running on the ECS container host instance using dog_statsd sink*5
All metrics include application id via fixed-string tags to identify each node*6
Prometheus pulls metris using EC2 SD
To manage port for Prometheus, we use exporter_proxy between statsd_exporter and Prometheus
Vizualize metrics with Grafana and Vizceral

In case the application process runs directly on the EC2 instance without using ECS or Docker, the Envoy process is running as a daemon directly in the instance, but the architecture is almost the same. There is a reason for not setting pull directly from Prometheus to Envoy, because we still can not extract histogram metrics from Envoy's Prometheus compatible endpoint*7. As this will be improved in the future, we plan to eliminate stasd_exporter at that time.

f:id:aladhi:20180502132413p:plain

On Grafana, dashboards and Envoy's entire dashboard are prepared for each service, such as upstream RPS and timeout occurrence. We will also prepare a dashboard of the service x service dimension.

Per service dashboard:

f:id:aladhi:20180501175232p:plain

For example, circuit breaker related metrics when the upstream is down:

f:id:aladhi:20180502144146p:plain

Dashboard for envoys:

f:id:aladhi:20180501175222p:plain

The service configuration is visualized using Vizceral developed by Netflix. For implementation, we developed fork of promviz and promviz-front *8. As we are introducing it only for some services yet, the number of nodes currently displayed is small, but we provide the following dashboards.

Service configuration diagram for each region, RPS, error rate:

f:id:aladhi:20180501175213p:plain

Downstream / upstream of a specific service:

f:id:aladhi:20180501175217p:plain

As a subsystem of the service mesh, we deploy a gateway for accessing the gRPC server application in the staging environment from the developer machine in our offices*9. It is constructed by combining SDS API and Envoy with software that manages internal application called hako-console.

Gateway app (Envoy) sends xDS API request to gateway controller
The Gateway controller obtains the list of gRPC applications in the staging environment from hako-console and returns the Route Discovery Service / Cluster Discovery Service API response based on it
The Gateway app gets the actual connection destination from the SDS API based on the response
From the hand of the developer, the AWS ELB Network Load Balancer is referred to and the gateway app performs routing

f:id:aladhi:20180502132905p:plain

Results

The most remarkable in the introduction of service mesh was that it was able to suppress the influence of temporary disability. There are multiple cooperation parts between services with many traffic, and up to now, 200+ network-related trivial errors*10 have been constantly occurring in an hour*11, it decreased to about whether it could come out in one week or not with the proper retry setting by the service mesh.

Various metrics have come to be seen from the viewpoint of monitoring, but since we are introducing it only for some services and we have not reached full-scale utilization due to the introduction day, we expect to use it in the future. In terms of management, it became very easy to understand our system when the connection between services became visible, so we would like to prevent overlooking and missing consideration by introducing it to all services.

Future plan

Migrate to v2 API, transition to Istio

The xDS API has been using v1 because of its initial design situation and the requirement to use S3 as a delivery back end, but since the v1 API is deprecated, we plan to move this to v2. At the same time we are considering moving control-plane to Istio. Also, if we are going to make our own control-plane, we plane to build LDS/RDS/CDS/EDS API*12 using go-control-plane.

Replacing Reverse proxy

Up to now, Cookpad uses NGINX as reverse proxy, but considering replacing reverse proxy and edge proxy from NGINX to Envoy considering the difference in knowledge of internal implementation, gRPC correspondence, and acquisition metrics.

Traffic Control

As we move to client-side load balancing and replace reverse proxy, we will be able to freely change traffic by operating Envoy, so we will be able to realize canary deployment, traffic shifting and request shadowing.

Fault injection

It is a mechanism that deliberately injects delays and failures in a properly managed environment and tests whether the actual service group works properly. Envoy has various functions *13.

Perform distributed tracing on the data-plane layer

In Cookpad, AWS X-Ray is used as a distributed tracing system*14. Currently we implement the distributed tracing function as a library, but we are planning to move this to data-plane and realize it at the service mesh layer.

Authentication Authorization Gateway

This is to authenticate and authorize processing only at the front-most server receiving user's request, and the subsequent servers will use the results around. Previously, it was incompletely implemented as a library, but by shifting to data-plane, we can recieve the advantages of out of process model.

Wrapping up

We have introduced the current state and future plan of service mesh in Cookpad. Many functions can be easily realized already, and as more things can be done by the layer of service mesh in the future, it is highly recommended for every microservices system.

*1:https://blog.twitter.com/engineering/en_us/a/2013/observability-at-twitter.html

*2:https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

*3:Our gRPC applications already use this mechanism in a production environment

*4:Server-side load balancing which simply use Internal ELB (NLB or TCP mode CLB) has disadvantages in terms of performance due to unbalanced balancing and also it is not enough in terms of metrics that can be obtained

*5:https://www.envoyproxy.io/docs/envoy/v1.6.0/api-v2/config/metrics/v2/stats.proto#config-metrics-v2-dogstatsdsink . At first I implemented it as our-own extension, but later I sent a patch: https://github.com/envoyproxy/envoy/pull/2158

*6:This is another our work: https://github.com/envoyproxy/envoy/pull/2357

*7:https://github.com/envoyproxy/envoy/issues /1947

*8:For the convenience of delivering with NGINX and conforming to the service composition in the Cookpad

*9:Assuming access using client-side load balancing, we need a component to solve it.

*10:It's very small number comparing to the traffic.

*11:Retry is set up in some partes though.

*12:https://github.com/envoyproxy/data-plane-api/blob/5ea10b04a950260e1af0572aa244846b6599a38f/API_OVERVIEW.md#apis

*13:https://www.envoyproxy.io/docs/envoy/v1.6.0/configuration/http_filters/fault_filter.html

*14:http://techlife.cookpad.com/entry/2017/09/06/115710

2018-06-08

大きな Rails アプリケーションをなんとかしよう。まずは計測と可視化からはじめよう。

こんにちは、技術部開発基盤グループの id:hogelog です。

RubyKaigi 2018 楽しかったですね。僕はおそらく RubyKaigi 2010 以来の久しぶりの参加でした。ああいう場の楽しさを思い出し、また今回はスポンサーブースから RubyKaigi に参加するという学生の頃は知らなかった楽しみも新たに知り、RubyKaigi を満喫させていただきました。

さて今回はそんな RubyKaigi で取り戻した Ruby に対する感情と関係あるようなないような、最近自分が取り組んでいるお台場プロジェクトとプロジェクト内で実施している計測と可視化について紹介します。

お台場プロジェクトの発足

クックパッドの開発といえば数年前までは cookpad_all という一つのリポジトリの中に詰め込まれた巨大なモノリシック Rails アプリケーションを社内のエンジニアが寄ってたかって開発するというのが典型的な開発スタイルでした。世界でも類を見ない規模の巨大な Rails アプリケーションの開発であるため、もちろん多様な技術的困難が発生していましたが様々な技術を用いてアプリケーションをメンテナンスし Rails の良さを損なわず開発が進められるように努力していました。*1

しかしその後クックパッドでも徐々にモノリシックアプリケーション構成から Microservices 構成への移行が進んでいきました。 techlife.cookpad.com

そして気づけば cookpad_all は社内に数多く存在する他のアプリケーションと比較してずいぶんと古臭い、触ることが忌避されがちなアプリケーションの代表となっていました。 https://cookpad.com/ のバックエンドの大部分を支える重要なアプリケーションであるというのは変わっていないのに。そこで始まったのがお台場プロジェクトでした。お台場プロジェクトとはなんなのか。その全貌を語るのはまた別の機会としますが、実施することは端的に言えば cookpad_all というアプリケーションの実装の改善です。

お台場プロジェクトではレガシーなシステムの削除、未使用コードの削除、システム分割など様々なことをおこなっており、 id:riseshia が取り組んだ Ruby の lazy loading の仕組みを利用して未使用の gem を探す - クックパッド開発者ブログや RubyKaigi 2018 LT で発表した Find out potential dead codes from diff もその一環です。

以下ではお台場プロジェクトを進めるにあたって取り組んだ cookpad_all 関連メトリクスの計測について紹介します。

cookpad_all 関連メトリクスの計測

cookpad_all の開発における困難を改善するといってもどう改善されているのか記録し、可視化しなければなにもわかりません。

そこでお台場プロジェクト開始初期にまず cookpad_all に関するメトリクスを計測し、社内で稼働させている InfluxDB に記録し、Grafana でダッシュボードを作成しメトリクスを可視化できている状態を作りました。

f:id:hogelog:20180607161915p:plain

具体的には現在 cookpad_all では以下のようなメトリクスを可視化し、改善を進めながら経過を観測し続けています。

CI Duration
App Load Time (development / production)
Loaded File Count
Code Statistics
GemCollector Up-to-date Point
Dependent Gem Count

CI Duration

これは Jenkins で実行している CI にかかった時間の計測です。気をつけることとしては失敗した時の実行時間は不安定になることが多いので、成功した時の時間のみ記録していることです。上記の図で示すように 2017/7 〜 2018/6 現在に至るまで、長いものでは 10 分程かかっていたものが 7分程度まで実行時間が削減されています。

App Load Time

開発者が手元で bin/rails s した時にアプリケーションが動き出すまでの遅さはわかりやすく辛い箇所です。cookpad_all の各アプリケーションでは定期的に以下のようなスクリプトを実行しアプリケーションのロードにかかった時間を計測しています。

def profile_app_load_time
  Benchmark.measure do
    system("./bin/rails r '1;'") or raise "error"
  end
end

# Warming disk cache, ...
puts profile_app_load_time

3.times do
  result = profile_app_load_time
  puts result
  influx.write_point("cookpad_ci_app_load_time", tags: { app: app }, values: { load_time: result.real })
end

Loaded File Count

これはアプリケーションのロードが終わった時点での $LOADED_FEATURES の数です。この数字は依存 gem の追加や削除、大規模なコード削除などで大きく数字が動き、アプリケーションになにか大きな変更があったことの観測に役立っています。

Code Statistics

bundle exec rake stats *2 の数字を記録するものです。この数字も時々誰かがどこか外部で「クックパッドの巨大 Rails アプリケーション」の発表をする時に計測する程度で、定点観測はおこなわれていませんでした。

f:id:hogelog:20180607172035p:plain

大きなシステム分割などにより時々グッと下がっている以外にも、日常的なコード掃除などで地道ながらもコード削減が進んでいることがダッシュボードを見るだけでわかるようになりました。

Dependent Gem Count

依存している gem の数の記録です。数が増えれば増えるほど gem の依存関係が深くなり、新規 gem の導入や既存 gem の更新などが難しくなっていきます。

f:id:hogelog:20180607161805p:plain

このメトリクスは git のログを遡り 2011 年頃からの値を計測してみましたが、依存する gem 数はお台場プロジェクトが始まるまでは増える一方でありアプリケーションを小さくしていこうという開発の流れはほぼ存在していなかったことがわかります。

ちなみに一瞬依存 gem 数が400個を超えたところがあるのが目を引くかもしれないので説明しておくと、これは aws-sdk を v2 -> v3 にアップグレードし、その後で必要な aws-sdk-* のみに絞るよう修正したためです。

GemCollector Up-to-date Point

これはこの gem を使っているアプリケーションを探す - クックパッド開発者ブログで紹介した GemCollector で出している gem の最新度を記録しているものです。

f:id:hogelog:20180607161828p:plain

この値は相対的なものであるため、gem のバージョンアップに追従していかないとどんどんポイントが下がっていきます。対応をおこたっていくといどんどんアプリケーションがレガシーになっていく状況を把握するのに非常に便利なグラフになっています。

まとめ

クックパッドでは現在巨大モノリシック Rails アプリケーションに頼った開発から Microservices 構成のアプリケーション群を組み合わせて使ったサービス開発への急速な移行段階にあります。その中で最後に残されている巨大 Rails アプリケーションを改善していくためのメトリクス収集と可視化ダッシュボードについて紹介しました。

私達はそういうことを一緒にやっていく仲間をもっともっと求めています。定型文じゃなくて本当に求めています。採用への応募またはどんな会社なのか聞くために遊びに来たいみたいなお声がけ、お待ちしております。

*1:どんな技術を用いていたか詳しくは Ruby on Ales 2015 で @amatsuda が発表した The Recipe for the World's Largest Rails Monolith などで詳しく説明されています

*2:実際にはちょっと特殊なディレクトリ構成に対応するため cookpad:stats という独自タスクを定義しています