建立一个流Microservice架构:与Apache火花结构化流和朋友

下载幻灯片

随着我们继续推动的边界是什么可能对管道的吞吐量和数据服务层,新的方法和技术继续出现处理更大的和更大的工作负载——从用户/行为数据的实时处理和聚合,基于规则/条件分布的事件和公制流,几乎任何数据管道/血统问题。这些工作负载是典型的在大多数现代数据平台和所有运营分析系统至关重要,数据存储系统、ML / DLbob体育客户端下载和超越。我见过的一个常见的问题在很多公司可以减少通用数据可靠性的问题。主要是由于扩展和迁移处理组件作为一个公司扩大和团队成长。几是什么系统可以快速扇出成一系列独立的组件和serving-layers所爱的人需要扩大,或与零宕机,满足世界的需求,渴望数据。在这个技术深潜水,一个新的思维模式将建立旨在重塑一个应该如何构建大规模、互联服务使用卡夫卡,谷歌协议缓冲区/ gRPC,拼花/三角洲湖/火花结构化流。材料在深海潜水是基于经验教训惨痛而建立一个大规模实时洞察平台在数据完整性和流为什么Twilio容错是我们公司提供的服务一样重要。bob体育客户端下载

看更多的火花+人工智能会话

免费试着砖

视频记录

——嘿,谢谢光临我的火花峰会会议构建流媒体Microservice架构与火花结构化流和朋友。我的名字是斯科特·海恩斯,我在一家名为为什么Twilio高级首席工程师。,我想讨论一下我自己,我的背景,然后,我们会就跳进的事情。好我,正如我之前所说的,我在一家名为为什么Twilio的公司工作。如果你不熟悉为什么Twilio,我们是一个大型通讯公司大约11开始,12年前。做像短信,打电话,我们已经从那里,在那里大约四年,总的来说,我一直在流架构大约10年了,所以我开始在雅虎。一些其他的事情,我认为是一种有趣的东西,像为什么Twilio,我带了语音和视频先架构,集团通过项目称为声音的见解,大约四年前,我也导致火花办公时间为什么Twilio,我只是喜欢分布式系统我的整个职业生涯。好的,那么今天的议程。所以我们要看一看,这基本上是大局的流体系结构是什么样子的?如何真正为不同种类的不同,API驱动架构,人们有见过吗? and then we’re also taking a look at like the actual technology that drives it, so what are protocol buffers, why do I like them? Why do I think you should like them as well? and also what is gRPC? What is RPC? what’s a protocol stream? and then we’re gonna also figure out how this actually fits into the Spark Ecosystem. So it’s kind of the agenda for today and we’re gonna just kinda pop right into it.

首先让我们来看看大局。所以放大所有的出路,看看一个架构是什么样子,这是流的第一个,所以流Microservice架构,我们会有点从左到右,考虑这基本上是一个跨度。

流Microservice架构

如果你考虑一个跨度为一个单一的事件流,从客户端到服务器,然后在整个流后端,这就是每一个服务器在这个巨大的流体系结构。但是我们要真正关注单个客户机服务的关系以及如何通过流媒体系统数据遍历。所以在第一个盒子,我们看看gRPC客户机。什么是gRPC客户机,gRPC是什么?还HTTP / 2是什么?Protobuf说,我们提炼有什么实际的单个组件在实际驱动这个架构的技术。我们可以沿着通过遍历整个演讲中,能够真正理解像这实际上是如何运作的。但这是一个非常顶级的,高水平的概述。从gRPC客户机,该客户机运行在,iOS sdk,它可以运行在JavaScript。并不重要,可以是客户端运行在Java服务器,客户端本身连接和一个远程过程调用gRPC服务器,这是沟通,与gRPC客户叫做Protobuf, Protobuf消息现在从这个客户端传递到服务器以同样的方式,你会希望看到一个正常的方法调用等客户端SDK。 So it kind of encapsulates this client to server relationship without having to really think about where the server lives, which is really, really cool, plus it’s lightening fast. Aside from there, once everything hits the server, things are, (mumbles) to Kafka we’ll talk a little bit more about Kafka, but I think everybody who’s at Sparks Summit also realizes and knows about Kafka and also the importance of Kafka. but once the prototype is in Kafka across an individual topic, but (mumbles) get picked up through Spark. So then Spark and natively interact with protobuf through something called ScalaPB. And then ScalaPB allows you to actually convert your protobuf messages directly into a data frame or a data set within the actual, within your Spark application. So you have this kind of entire kind of end to end system getting everything into Spark once things are in your actual Spark application, it’s really easy then to interact with anything with either, within the Spark ecosystem, or outside of the Spark ecosystem, just by converting the individual data frames to parquet and dropping it in something like HDFS S3 or wrapping that all with Delta. Alright, so we’re gonna take a look at a little bit further, how does this actually expand? So if you start off with a single kind of SPAN level architecture and the SPAN level architecture allows you to kind of have that client to server relationship, servers and passing an individual message to a topic in Kafka, once something’s in Kafka, then things get really interesting. So you take a look at this architecture. This is basically just a reconfiguration of exactly what we were looking at before. So once things are in HTFS like we take a look at the bottom row, or if things are in Kafka and the top row, there’s the ability to actually pass these messages, whether it’s from the gRPC client to the gRPC server, to Kafka to your Spark application, but you can also pass it back and forth. So you can go back to Hadoop or you can take your parquet data and read it, back from Hadoop again, back to an application. And what we’ve done at Twilio is we’ve used this architecture to, be able to kind of message pass reliably, between the individual streaming Spark applications, without actually the necessity to kind of schedule a bunch of different batch jobs like you’d expect in something like can Airflow, or even just like a crontab, or any different Spark applications. The cool thing too, is that at the very end, once you’re ready to kind of close books on whatever the messag is, that can then go back to gRPC, because gRPC enables bidirectional communication. So you have the ability to also kind of peer on the end result of a Spark job, and then you can, (mumbles) pass it back down to a client once, a specific job has been run through from end to end, we’re gonna take a look at what that actually looks like as we kind of, drive deeper into the individual components of this architecture, but as a high level, it’s literally a message from a client to the server. That’s encapsulated through a remote procedure call, which basically kinda gets rid of that edge, which is, client and server, from there, when things go into Kafka, then, it’s really up to you what you want to do with that. But the kind of end result is that you can really reliably pass data back and forth with compile time guarantees, which is really nice, especially for a system that’s online and running, 24/7, 365.

好吧,那么,我们现在谈论的协议,如协议缓冲区,又名protobuf,所以协议缓冲区。为什么使用它们,所以,如果你不熟悉协议缓冲区,你也许你只熟悉类似,用JSON CSV,等等。所以JSON CSV是结构化数据,但它不是严格。所以可以随时间变化和变异类型,可以打破你的下游系统,用一种糟糕的上游,提交。protobuf酷的事情是,它基本上是编译语言不可知论者。如果你觉得像一个公共消息格式,可以使用,在这种情况下封装用户是什么,一个用户可以是任何系统中它确实是你想如何定义一个用户定义的,但是,这里的大赢的是这样的,它是一种语言无关的消息格式,允许您编译到您所选择的语言。如果您选择的服务器端语言c++也许是Java,但是你的客户端库都是用Python或node . js。这允许您使用相同的信息。所以你不必担心不同的API,正在努力,吸收,JSON API编写的说,这可能会改变,所以任何改变下游基本上会使一切问题,一个糟糕的承诺,一个糟糕的推动,下游系统回滚然后休息。所以协议缓冲区基本上是一个想法出来的谷歌和它的优点是考虑到你可以注入到你的任何库版本编译,资产或资源的图书馆。 Then it allows you to really do anything that you want to, without having to worry really about how does that data change as long as you’re abiding by some of the guidelines. So I would recommend taking a look more protocol buffers, just take a look at like developer.google.com, and you can find the protocol buffers there as well. And that gives you a really good overview of how to actually use them. But really like the big win and the big takeaway is that you have the ability to compile down and have version, data that will, can flow all the way through your system, across the actual language boundaries. So from Java to Node.JS and back and forth, et cetera. So they’re really cool, we’re gonna take a little bit more look at them.

所以,正如我之前所说的,你有一种语言无关的消息类型,并可以被编译到你选择的语言。所以很酷的事情,就是你有能力基本上是自动生成,不同的施工方法或一种脚手架类。说,如果你觉得喜欢一个Java构建器,就像我们在右边,这是一个Java builder构建在Scala中与我们的价值更高,val那边,所以val用户,所以看看这基本上是说,我想创建一个新的,我想创建一个用户。现在我有一个新构建器和所有的方法实际上创建了一个无与伦比的结构,基本上都是为你写的。所以没有什么要做的。所以很喜欢谈论这个,偷懒没关系,只写你的消息,编译它下来,现在为你做的一切,所以你节省时间也准确和东西可以版本,这是真的,真的很好除了如此,协议缓冲区有自己的、序列化和反序列化的图书馆。所以你不需要担心这是如何工作在我的语言,因为实际的库,打断,protobuf和本身,在节点。js和Python, Java, Scala等等,基本上都有自己的方式序列化和编组这些对象,这样你不用担心。所以你得到很多,你基本上免费获得整个调查系统,闪电快,超级优化,这是美妙的。

好,现在我们要看看gRPC和协议流是什么。这是要捎带的,到目前为止我们看到的一切。这是使用协议缓冲区,这基本上是第一个看客户端服务器gRPC关系。

所以gRPC,它是什么?所以gRPC依托protobuf, protobuf服务器合同中定义的一种方式。如果你想到RPC, RPC的想法是远程过程调用。

GRPC

这是什么意思?好吧,如果你是一个客户端和你有一个SDK和SDK的方法将做一些情况实际上IO参与,成为这种,这几乎是一个看不见的边界如何客户端向服务器发送一条消息。我们将看一个例子像广告跟踪,这是我以前做的事情早在那一天我在雅虎。在我们使用JSON用于一切之前,我们要看看怎么做,gRPC。但对于一个例子,如果你有一个gRPC客户端,运行的JavaScript SDK说,你跟踪,客户与广告如何说,像增长营销之类的,然后gRPC客户机不需要担心如何组成一个对象,为服务器,除了创造protobuf和发送到服务器通过实际gRPC客户机。所以不用担心,你使用的是什么类型的HTTP库,这是将不同版本的JavaScript工作吗?人们使用正确的库版本,这将无限期地工作吗?这是各种封装在整个gRPC过程。所以gRPC代表一般的远程过程调用。我想它是谷歌创建远程过程调用,因为它在谷歌。 And it’s used to power a lot of the services that people actually use nowadays, from TensorFlow, to CockroachDB and like across the board from like, from Envoy as well, which is using HTTP/2 and protobuf as a way of doing a reliable proxy. but it’s really, really cool because if you compile it down, you don’t have to worry about writing, scaffolding for a lot of your different, server-side libraries or client-side libraries, given that you can actually still compile it down like you would your protocol buffer to create your scaffold. So your interfaces in Java and vice versa. The other really kind of cool thing too, is as I talked a little bit about before you have the ability to do this bidirectional streaming, and this doesn’t come necessarily out of the box by default, but there’s, Plugable support for it. So if you think about, your server pushing down to your client then your client, responding back and pushing back to the server and having this whole entire communication, that’s one of the things that gRPC actually powers. so it actually takes a lot of the effort out of, how do you do like, how do you do channel-based communication? So for example, say you’re running a different application where a customer has logged in and they want to get say, say it’s stock data, now your server can just push that down for people who are listening to the specific stock say, I don’t know, say it’s some NASDAQ stock of some sort. then you just be able to get those stocks and be able to take a look at that as they’re updating in real time, just by subscribing to a channel with, the stock trackers that you’re actually tracking, which is kind of nice and then the server-side does one thing and then broadcast it across all of the individual, peers within that individual channel. so that’s like the whole bidirectional part. And so be given that, that streaming and you can connect it to your backend, which is also streaming running Spark jobs that are streaming. You have this kind of end to end stream. That’s really kind of opens up the boundaries to a lot more, things that were very complicated in the past and just adds a framework on top of that.

所以,正如我之前所说的,我们要看一看一个实际的例子,gRPC,所以gRPC示例,我今天想展示,基本上只是AdTracking。我们看看消息看起来像什么?这是要我们的之间的接口客户端和服务器。我们要看看实际的服务器接收这些消息的代码,并承认他们卡夫卡然后gRPC客户端发回一个响应。这基本上是一个例子来展示,多少努力确实是可以,健壮和强大的生态系统内。所以考虑这些,一个非常聪明的方法数据工程,并确保所有你希望进入你的系统将进入系统和一种非常的定义,版本、数据方案。

gRPC定义消息,正如我们之前看到的,在与用户的例子中,所有的基本,它称为消息(喃喃自语)。所以消息AdImpression。我们需要一个去,这应该是一个字符串(喃喃自语)。也许可能有contextId,像什么页面或类别的页面和广告实际上是显示在一个潜在的用户id或sessionId。所以你可以跟踪这个回像一个个体,GDPR第一,确保用户标识自己没有的东西,我的社保或别的东西,但它的东西,在你的目标用户是有意义的,可以是一个IDFA或其他东西,像来自世界(喃喃自语)来确定,一种用户非常不喜欢不知道,但你将所有这些信息从一些用户的上下文的概念,回到一个人的广告在一个特定的时间。这基本上就是广告的印象。所以只有4个字段。然后你就像一个正常的反应。所以有一个状态码,个人信息会像一种典型的HTTP状态代码。,也可以是一个状态码,您将创建您自己的服务器,以允许您识别不同类型的问题或不同的问题。 So you have this kind of relay back and forth. Plus also a message, so if you want to human readable message to respond to your client, telling them why something didn’t work or telling them that something did work, all of that’s possible as well.

好的,那么我们AdImpression我还有那屏幕的右边。所以当我说过你做的基本上,protobuf然后你也有能力将你的服务定义为protobuf。在这里,所有进场。所以我们谈论这基本上是一个服务器定义或服务定义。如果我们有一个点击跟踪服务,正在一个AdImpression,然后真的所有我们做的是说,一看就像上面。所以服务器点击跟踪服务。现在我们有一个命名服务。我们的rpc方法将AdTrack AdTrack方法,以一个AdImpression,它返回一个响应。如果你考虑一切,基本上,在这个定义中,真的没有什么可以被误解,因为客户端创建一个AdImpression并添加,去,contextId, userId,和时间戳,也可以验证和说,这是不正确的,因为你已经添加了错误的数据。也许这是一个空字符串之类的。 And your response is always gonna be exactly the same response for the compiled version of your client code and for your server code and for the protobuf at a specific version. So if you consider a versioning, like people would do, across the board from like say Maven versioning, or like any kind of versioning and Artifactory then if you’re using some (mumbles) or something else, then if you’re doing like a non breaking change, great, everything will always work. If you’re doing a breaking change, then coordinate that with, downstreams so that when an upstream changes that nothing breaks down the stream, but if you’re following like really strict, API style, best practices, then no matter what you do, everything should be backwards compatible anyways, protobuf works backwards compatible. So potentially if you’re trying something new and you’re Canary testing say a new field in your AdImpression, other services downstream to have an old version of the protocol buffer for your AdImpression, wouldn’t know that they could even add a new message type. And anybody else is interrupting with that AdImpression would have something called an unknown field and they would not have to do anything with it, which is really great, so it allows you to basically opt-in if you want to, for specific messages for different fields are potentially being added, and it’s a really nice way to be able to not break a service, which is in production, just because you wanna try something new. And you’re working on say like a rollout that’s maybe, kind of opt-in with like feature flags. so all of this stuff basically comes out of the box. if you’re following a protobuf and gRPC standards.

酷,现在给我们的这个接口消息是什么样子,AdImpression反应是什么样子,我们的反应,我们有一个服务,基本上是我们的AdTrack服务。当我们编译gRPC定义,我使用它Fitbit编译、Akka Scala gRPC和所有这基本上是采取个人定义的协议缓冲区,.proto文件之前,我们看,一旦编译,它将支架接口为我点击跟踪服务。我们看看这个类本身。所以我们有点击跟踪服务实现,基本上需要一个Akka实现我不会去,因为这不是一个讨论Akka,但它延伸点击跟踪服务是我们定义我们的RPC,合同。所以当我们扩展点击跟踪服务,然后我们基本上脚手架这个方法,我们所要做的是,在这一点上我们要做的是填空的,AdTrack,所以我们看看覆盖的定义,AdTrack。我们将是我们gRPC数据,是我们AdImpression。我们知道,这正是我们应该接受作为一个在这个AdTrack,因为我们定义自己,rpc的定义。然后说我们发生的所有的包装在以后的响应。这是一种很有前途的在他们的未来,除非一切都失败了,你会得到一个回应你的客户。这基本上是一种异步的合同。 So what we’re looking at right now, if we kinda go through the code is basically just a promise of a response, before we took a look at the, high level kind of big picture of the architecture, we took a look at it from client to server server to Kafka. This is all that is showing. and it’s also a dummy Kafka client, just to kind of show you how it would work. So where we have KafkaService published record KafkaRecord, with the Try, which is a Boolean, it’s gonna always just return true. This would be connected to a real Kafka client new we are doing much more, on your end to do this, but that wouldn’t fit in the example. So Kafka.publish, we’re sending a KafkaRecord, which has a topic of an ads.click.stream. so we’re sending basically just binary data to Kafka. We’ve got a binary topic, and then we’ve also got a binary data payload. So everything’s binary, it’s lightweight. The nice thing is that as we saw before, because the part above comes with the tone serialization library, it’s very quick to be able to serialize to a byte array. So we get basic binary data in, we have the AdTrack data. We can, do what we want prior to actually sending the stuff to Kafka. So say, for example, you want to add, say a server-side only field, to the actual ad impression. You could do that and say it’s a, stamping a timestamp of like, the time that you received that message, with all said and done, once you basically take that AdImpression record and you call toByteArray on it, you now have a ByteArray, so you have now binary data that is versioned and compiled to a specific version of your protobuf, on a specific topic, which is the ads.click.stream. So if this is a successful response and we’re able to publish this to Kafka just the 200, okay, comes back, which would be like a normal kind of HPP code for your client, encapsulating the fact that, you have published, the AdImpression that you wanted to, record, that’s fairly, lightweight. I think all in all, this is about maybe 80 lines of Scala. it could, of course be a lot larger, like in a normal kind of production use case with like validations across the board and everything else, but this gives hopefully like a quick kind of tidbit into, what actually would it take to implement the service and actually go run it.

所以在右边,你会有相同的屏幕前,但这是现在会一点基本协议的流。如果你考虑客户端到服务器的关系,基本上是一个管AdImpressions的数据,这些数据基本上流直接从客户端到服务器,从服务器,一切基本上是卡夫卡流入。此时所有的基本上,进入流,花,基本上像入口点,流媒体数据沿袭管道,或者其他你在你的公司。但好处是,没有像随机原始垃圾数据,进入卡夫卡。很多人可能会挣扎在过去,,垃圾在垃圾系统。所以真正优点基本上把gRPC面前,你的数据沿袭管道,或任何的管道可靠的数据在你的公司,这允许您,实际合同数据,从客户端到服务器,从服务器到卡夫卡。然后还有在这一点上,鉴于protobuf有能力,在服务器端进行验证,什么是垃圾进入系统,如果没有不好进入系统,他们不需要防守下游,只要你遵守很好的最佳实践。这需要很多像防御性编程的实际,从实际的路径,使得下游一切简单得多,因为你没有先处理数据和删除垃圾数据。之前你可以任何实际上卡夫卡,下游加速你的整个系统。其他的好处,你可以使用这个为任何你能想到的。 So in the abstract use case, they will all, they can all go into, real time personalization, and predictive analytics for what, to show next for the ads based on how people have actually interacted with this, really in real time, so, creating really whatever you want, it’s all at your fingertips.

这基本上是迄今为止我们看过的组件,像内gRPC架构。所以我们谈到了服务,我们谈到了AdTrack trackedAd,这一切做的基本上是他们的服务器上运行一个远程过程调用,gRPC服务器,点击追踪服务实现,已trackedAd AdTrack方法。这是一个正在读二进制数据的二进制消息到服务器端,占用空间小,,不就像一个巨大的臃肿的JSON的有效载荷,所有这些可以很快被认可,因为期待什么字段。然后可以进入像我们的ad.click。流。所以ad.click。流本身仍只是protobuf,它仍然是二进制。所以它是没有不同于我们观察从像gRPC端到GRP服务器。我们并不真正改变什么。这是真的,如果没有突变,可以运行非常快的事情。 So at that point from the gRPC server to Kafka nothing changes again. So it’s just binary validated, structured data, which is really, really good as a, kind of a start to your whole data lineage pipeline. And especially like in Spark, it’s nice to not have to worry about being overly defensive with your streaming data because streaming systems can go down at any time.

构建协议流:

所以现在我们要进一步讨论我们如何,包装的一种不错的壳在这整个想法。鉴于这火花峰会,鉴于人们真的喜欢流和火花结构化流的工作完成的非常好。我们谈论结构化流和protobuf因为它真的就像蛋糕上的糖衣的整个架构,正如我之前所说的,一旦进入卡夫卡的一切,我们有话题,注定要个人,这是一个主题的协议缓冲区,所以这是一个协议流。所以结构化协议流。

结构化,用结构化流本身protobuf,好处是你可以使用的一切,火花实际上已经烤在图书馆内部通过表达式编码器,可以采取协议缓冲区,在这种情况下,编译通过ScalaPB Scala的一个例子,所以ScalaPB是一个编译的库,需要你的协议缓冲区定义和(喃喃自语)case类。

结构化流与Protobuf

鉴于案件类扩展Scala中的所谓的产品和火花与产品编码器运行,你可以隐式引进,产品的任何类型的编码器打电话只是像火花会话,值得一提的,基本上,你从那里进口。或者您可以显式地生成这个隐式导入。所以你必须在运行时担心这样做,你知道你在期待什么。所以我在这里,在表达和编码器adImpressionEncoder隐式val,这需要一个编码器AdImpression。和所有的基本上是把编码器。免费产品,船只与火花。这让你因此与protobuf本地互操作性Apache通过ScalaPB火花,然后它还允许您直接处理的数据来自卡夫卡,从客户端到服务器,卡夫卡,现在到应用程序实际的火花。所以从这里,你可以,你可以做聚合,对个人用户。你可以准备机器学习功能,发送,通过机器学习管道。然后,一切都可以滚回你的广告服务器,这可能也是一个GPS, gRPC服务器,所以它真的连接的点整个流媒体系统,在一切都期望的一个很好的方式,就像有一个期望事情会是如何工作的。

所以,正如我之前所说的,本地更好。所以有严格的原生卡夫卡数据帧转换,不需要中间类型的转换。所以如果你曾经使用过这种类型不是由火花开箱即用的支持,很多时候你要做的就是做一个变换方法,需要从卡夫卡的二进制数据,然后你把你自己的序列化器和反序列化器组合。然后从那里,你必须只选择使用抽样API,或者您需要编写自己的复杂表达式编码器。早在2018年,我做了一个火花峰会,讨论做流媒体和发现趋势。所以在代码库,实际上如何做的一个例子,如何基本上把自己的本地protobuf编码器,火花,但在这种情况下,我们只是看ScalaPB因为它容易得多。所以在右边,我基本上是,它是一个结构化的流媒体工作是什么样子的例子,这就是将在这个AdTrack信息和运行它通过像一个实际的机器学习模型。所以如果你看看右边,我们有我们的查询,这是我们的在线查询,这基本上是我们的卡夫卡数据输入流。我们加载数据,现在我们只是将这些数据转换成一个AdImpression,然后我们就加入一个叫做contentWithCategories去。所以contentWithWategories会是公平的,它会一个ETL过程,基于一个去获得更多的信息关于什么是封装在这个广告。 From there, we have all the information we would need to transform this data across our serialized, pieline. So there’s going to be a lot of other talks probably, the past few days of the Spark Summit, they’ve talked all about how to actually create like a pipeline. And how do you actually, load and save, pipeline models. And then how do you also say like a logistic regression model, linear regression model, et cetera, so this use case is actually kind of just showing what it would look like from McDonald’s like the data engineering or like machine learning engineering point of view, once those models and pipelines have been serialized off when you’re bringing them in. So it’s all in stream, so join on adId, transform to basically run our pipeline transformation that we’re going to go and basically just predict. So in the case of say like, ad serving, we want to predict the next best ad for this individual user. and then from there, we’re just gonna go write this out as parquet, and we’re gonna dump that into HTFS or Delta and allow, another system to pick that up, which can also do that in real time. So at this point, once everything hits Kafka once it’s been processed and once it’s been, it’s a predicted label has been applied to that data. Now I can get picked back up, you have this kind of full end to end cycle, which is basically, serving your ads, and the other nice thing too, is that given that there’s really nothing that should change, as long as best practices are enforced, then you can actually you can rest at night knowing that the pipelines are actually safe. And that’s, I think one of the biggest things that people worry about, especially from a data engineering point of view is that, a small mutation creates a butterfly effect that can take down an entire ad server. and to speak on ads, ads are money. so that’s something that you don’t actually really wanna do. So if you put in the right kind of safety precautions on the whole entire system, you don’t have to really worry so much about that and it’s really good for everybody, who’s also like in the downstream path as well.

其他好处的结构化protobuf流是你还可以用protobuf定义。所以你可以采取退一步说,如果你有一个protobuf,,说输入格式,甚至protobuf预期输出格式,你可以使用两个protobuf对象,来验证一个严格的输出类型。三角洲已经做了这个版本,底层结构类型存储在三角洲湖,如果你不使用三角洲等,然后你使用类似一个HDFS,这是更重要的,因为它不是那么容易能够改变你的拼花数据的格式。所以我们发现喜欢我的团队在为什么Twilio是,如果我们有一个定义,可以写出像,说像一个DDL,像一个数据帧模式的一部分,那么它真的很容易的读取数据然后验证。在右上角,基本上我有这个这个方法,所以def validateOutput数据帧,返回一个布尔值,所以,DF模式DDL =我们的报告结构类型DDL,所以struct类型DDL基本上是为每一个版本,发布的实际代码的火花和从那里,如果任何类型的变异,那么它将是无效的。一旦它是无效的,那么,要么记录是无效的,或者我们实际上,最有可能我真的把这些数据实际上说Datadog或其他监测服务,让我们知道事情已经破裂的管道。但事实上,这是一个更好,为此在gRPC级别这什么这是无效的实际进入管道,否则你必须防守的,但一旦你有一个火花正在运行的应用程序,这是说只有拼花的数据交换,然后拼花做的本身,它可以转化为你的数据帧,仍然可以验证,Protobuf或者一个struct类型本身。所以我认为真正的一大关键是只要你确切地知道什么是在您的系统输入和输出,这一点很容易创建的东西,从外面看起来很复杂,但非常坚实,从运营的角度来看,一旦它是启动和运行,只要你坚持严格的最佳实践,但往往更容易写,到包装库,执行,一些严格的数据政策,这是真正重要的东西。好了,现在我们要看看实际用例,具体的例子从一个关闭的书内运行我的团队的工作,团队在为什么Twilio的声音见解。所以我们要做的是我们有一个工作运行,几乎在每一天结束时,重新验证所有的数据是通过我们的流管道实际上是准确的。 And the nice thing about this is that we have this kind of this known kind of notion of a Close of Books and all of this data can actually be written out in to kind of a final output stream, which can then be reused to run machine learning jobs and everything else against. So we take a look at the method on the top right we have a run method, run method basically creates a stream inquiry. And this is taking our call summary data. So I work on the voice insights team. We have CDRs, which are called summaries and that call summary itself, is basically just loaded from, our parquet stream. And the parquet stream itself is then, just kind of compiled down, it’s a deduplicated and we pushed that back into a final, final stream back into HTFS. So it’s just an example of how to, how to do that. And, it’s giving you an idea of how we’ve actually wrapped this into our data engineering pipelines. So that’d be kind of interesting to show this after showing like the AdTrack and some of the other kind of example use cases to show what is a real world use case look like? And it’s actually really fairly trivial and fairly simple to actually take a look at that. So, I’m going to show that, so what we’ve looked at today, is kind of the end to end gRPC pipeline to Kafka, to Spark and then final output to parquet and other Hadoop (mumbles) and I really believe that this pipeline, works really well, especially for, as things get more and more complicated.

流Microservice架构

所以很多人将开始和你可能,单个服务器或者一个服务,两个或三个卡夫卡的话题,也许一个火花应用程序,这是处理所有数据添加越来越多和越来越多的这种类似的系统,如果你遵循相同的想法基本上看一切的跨时间。从你的客户方法请求到服务器,从服务器到卡夫卡的话题,从卡夫卡主题许多火花引发应用程序或应用程序像我们之前看的,最后到像真理的来源商店,像一个Hadoop或三角洲湖上运行Hadoop说,这是拼花格式。这给了你一种可靠的渠道继续(喃喃自语)扩大,使一个更复杂的数据沿袭,管道,或者整个的数据沿袭服务在你的公司同时看着那么简单的东西我们现在看。所以我们要看一看现在只有一个简单的回顾。

所以我们的回顾,我们看了看protobuf为了回顾protobuf只是一个语言无关的方式来组织数据通过创建一个公共消息格式。编译时间保证,这意味着一旦它被编译,它不能突变被编译的版本。它也带给表的闪电快速序列化和反序列化,这是超级有用的发送数据到线从你gRPC gRPC服务器端。然后从gRPC服务器卡夫卡作为一种二进制消息代理到卡夫卡gRPC正如我之前说的也是语言不可知论者。这是同样的Synetic属性后,Protobuf,你可以创建一个通用格式,你可以编译到你选择的语言。也是超级低延迟,因为它运行的二进制数据通过HTTP / 2,它也有编译时间保证。所以你的API不会改变你已经编译的版本,是protobuf提供的同样的事情。所以编译时保证是你的消息格式,编译时保证实际服务器中断层加上它是一个超级聪明的框架,它有很多动力。我们还看了看卡夫卡,我没有太多进入卡夫卡,因为每个人都引发了峰会我保证知道卡夫卡是什么,但是作为一个超级快速的回顾,高可用性,它有一个原始的火花连接器。所以基本上没有努力创建一个连接器,卡夫卡的生产或消费。 And you can use it to pass records to one or more downstream services, which is really great if you want to test different things, it’d be the end testing. last we took a look at structured streaming and this has been out for a long time now, but it’s a nice way of reliably handling data, end mass, the nice thing is that given that you can do protobuf to data set and data frame, conversions natively by using like the ScalaPB or library of your choice, or you can choose to write your own. it’s also really nice too, because everything can go kind of end to end protobuf then from there, the protobuf to data frame to parquet is also native in Sparks. So it allows you to take all that, all that, everything that you’ve worked hard on and reliably store it so that you don’t have a corrupt data Lake or data warehouse, for a future processing of your, valuable data assets.

这就是它,感谢你出来,火花峰会今年。我很激动,你今天学习一点关于gRPC, protobuf卡夫卡和结构化流,并找出一点关于我的团队工作的事情在过去四年,创建一个管道类似的基金,我只是今天公布。

看更多的火花+人工智能会话

免费试着砖
«回来
关于斯科特·海恩斯

为什么Twilio

斯科特·海恩斯是一个完整的堆栈工程师当前关注实时分析和情报系统。他在为什么Twilio工作,主要担任高级软件工程师的声音见解团队,推动采用火花,流管道架构,并帮助架构师和建立大规模流和批处理平台。bob体育客户端下载

除了他的声音见解团队角色,他也是一个机器学习平台的软件架构师在为什么Twilio帮助塑造未来的实验,培训和安全集成模型。bob体育客户端下载

斯科特目前运行公司宽火花办公时间提供指导,教程、车间和跨为什么Twilio实习培训工程师和团队。为什么Twilio之前,他曾为雅虎写后端Java API的游戏,以及实时游戏排名/评级引擎(建立在风暴)为1000万用户提供个性化的推荐和页面浏览量。他完成他的任期在雅虎工作的一系列分析,他写了为移动警报或通知系统。