开始使用三角洲湖

使Apache火花与三角洲湖更好

迈克尔。首席软件工程师砖
迈克尔时常要提交者和PMC成员Apache的火花,火花的原始创造者SQL。他在砖目前领导团队,设计并建造了结构化流和砖三角洲。他在2013年获得加州大学伯克利分校的博士学位,并建议由迈克尔·富兰克林,大卫·帕特森,阿曼德狐狸。他的论文侧重于建立系统,使开发人员能够快速构建可伸缩的交互式应用程序,特别是规模独立的概念定义。他的兴趣广泛包括分布式系统、大规模结构化存储和查询优化。

系列的细节

这次会议是开始的一部分与三角洲湖系列丹尼李和三角洲湖团队。

会议摘要

Michael时常要加入三角洲湖工程团队的负责人,了解他的团队建立在Apache火花带来ACID事务和其他数据可靠性技术从数据仓库世界湖泊云数据。

Apache火花是占主导地位的大数据的处理框架。三角洲湖增加了可靠性引发你的分析和机器学习计划准备访问质量,可靠的数据。这个网络研讨会涵盖三角洲湖的使用引发环境来提高数据的可靠性。

主题领域包括:

  • Apache火花在大数据处理的作用
  • 使用数据湖泊作为数据架构的一个重要组成部分
  • 数据可靠性挑战湖
  • 三角洲湖如何帮助为电火花加工提供可靠的数据吗
  • 具体改进改进,三角洲湖补充道
  • 易于采用湖三角洲湖驱动你的数据

你需要什么:
报名参加Community Edition在这里和访问车间演示材料和样本的笔记本。

视频记录

——(丹尼)嗨,每一个人。欢迎来到我们今天的研讨会,与三角洲湖使Apache火花更好。

在我们开始今天的演讲之前,我们想去一些管家的东西,确保你有最好的体验。请注意,您的音频连接将柔和的网络研讨会的看每个人的安慰。如果你有任何问题或疑问,请提出这些问题的问题小组或聊天。在面板我们鼓励你利用这段时间问尽可能多的问题和澄清任何怀疑你可能在今天的主题。我们今天的主要主持人迈克尔Armbrust,最初的创造者火花SQL和结构化流,和三角洲湖的主要创造者之一。他是首席工程师的砖,所以没有任何进一步的延迟,迈克尔把它拿走。——(Michael)谢谢你,丹尼。今天我超级兴奋,谈论如何使Apache火花更好使用三角洲湖。然而,我跳进之前,我想先讨论这个概念数据湖和为什么这么多人感到兴奋,为什么有很多的挑战当他们试图设置这些东西。

数据的承诺

首先,什么是数据湖,我这是什么意思?所以数据基本上是湖的承诺,组织有很多的数据。这可能是精心策划的客户数据在OLTP系统。可能是原始的点击流来自您的web服务器,或者它可能是一种非结构化数据来自一群传感器。湖和承诺的数据你可以把所有的数据转储的湖。这是真的强大当你比较传统的数据库,因为在传统的数据库中,你必须首先提出一个模式和做清洁。这通常被称为模式写,什么是数据湖允许你做你可以放弃这一过程,首先收集所有因为有时你不知道为什么直到很久以后数据是有价值的,如果你还没有存储它,那么你已经失去了它。所以与数据湖,它只是一堆文件在文件系统中。可能是S3或者HDFS Azure Blob存储,你可以把一切都记录下来,然后回来,以后看它。和我们的想法是当你完成,一旦你收集了这一切,然后你可以得到的见解。 You can do data science and machine learning. You can build powerful tools for your business, like recommendation engines or fraud detection algorithms. You can even do crazy things like cure cancer using genomics and DNA sequencing. However, I’ve seen this story many, many times, and typically what happens is unfortunately the data at the beginning is garbage. And so the data that you store in your data lake is garbage, and as a result, you get garbage out from these kind of more advanced processes that you try to do at the end. And why does that happen? Why is it so difficult to get quality and reliability out of these data lakes? And what does this kinda typical project look like? So I wanna walk you through a story that I’ve seen kind of happen over and over again at many organizations, when they sit down and try to extract insights from their data.

进化的尖端数据湖

它通常是这样的,这是今天被认为是最前沿的,但是在三角洲湖。所以很常见的模式是你有一连串的事件。他们进入一些系统像Apache卡夫卡,和你的任务是做两件事。你需要做流媒体分析,这样你就可以实时知道发生了什么在您的业务。你也想做人工智能和报告,你可以看看更长一段时间,进行纵向分析,实际上看有点历史和趋势,并对未来做出预测。所以我们要如何做呢?所以有点第一步,我坐在我的电脑,我知道火花很好的api读取来自Apache卡夫卡。您可以使用数据帧,数据集和SQL和SQL的过程和火花做聚合和时间窗口和各种各样的东西,并提出流分析。所以,我们开始和蝙蝠,它工作得很好,但是这给我们带来了挑战,这是历史查询。

挑战# 1:历史查询?

有点得到实时分析卡夫卡是伟大的,但它只能存储一天或一周的数据。你不想被存储在卡夫卡年复一年的数据。所以我们必须解决这个问题。实时很适合此时此刻发生的事情,但这是不擅长寻找历史趋势。所以我一直在阅读很多博客和一个漂亮的常见模式,这里发生了有这个东西叫做λ架构,这基本上就我可以告诉你只是尽两次。你有一个实时的事情是做一个近似和给你的此时此刻,到底发生了什么,你有另一个管道可能策划。它运行更慢一点,但它是湖归档所有的数据到你的数据。所以这是,这是第一步。所以如果我们想解决这个历史查询的问题,我们也要设置λ架构之上的香草Apache火花,一旦我有所有数据在数据湖,这个想法是我现在可以运行引发的SQL查询,现在你可以做人工智能和报告。一点额外的工作,一些额外的协调,但幸运的是火花为批处理和流媒体统一的API。 And so it’s possible to do it and we get it set up, but that brings us to challenge number two.

挑战# 2:混乱的数据?

就像我之前说的,现实世界中的数据往往是混乱的。一些团队上游从你的变化模式没有告诉你,现在你有问题。所以有点模式,我看到的是您需要添加验证。所以你需要写额外的火花SQL的检查程序,以确保你的假设是正确的数据。如果他们,如果他们错了,发送了一封电子邮件,这样你就可以改正它。当然,因为我们做了λ架构,我们必须做验证在两个不同的地方,但是,再一次,这是我们可以做的事情。我们可以用火花。现在我们建立验证处理混乱的数据。第三,不幸的是给我们带来了挑战,这是错误和失败。这些验证是伟大的,但有时你忘了把一个地方或在代码中有一个bug,甚至更难中间代码只是崩溃的原因你现货实例运行在EC2和死亡等等,现在你需要担心,“我怎么清理,?“真正的问题,使用这些类型的分布式系统和有点分布式文件系统是如果工作崩溃在中间,结果留下垃圾,需要清理。 And so you’re kind of forced to do all of the reasoning about correctness yourself. The system isn’t giving you a lot of help here. And so a pretty common pattern is people, rather than working on an entire table at a time, because if something goes wrong, we need to recompute the entire table. They’ll instead break it up into partitions. So you have a different folder. Each folder stores a day or an hour or a week, whatever kind of granularity makes sense for your use case. And you can build a lot of, kinda scripting around it, so that it’s easy for me to do recomputation. So if one of those partitions gets corrupted for any reason, whether it was a mistake in my code or just a job failure, I can just delete that entire directory and reprocess that data for that one partition from scratch. And so kind of by building this partitioning and reprocessing engine, now I can handle these mistakes and failures. There was a little bit of extra code to write, but now I can kind of sleep safe and sound knowing that this is gonna work.

挑战# 4:更新吗?

这给我们带来了挑战4号,更新。很难做点更新。很容易添加数据,但很难改变数据在这个湖和如何正确地进行。你可能需要这样做原因GDPR。你可能需要做保留。你可能需要做匿名化或其他的事情,或者你可能会有错误的数据。所以现在你必须最终写另一个类的火花工作做更新和合并。这可以是非常困难的,通常因为它是如此困难,我看到人们做的是而不是做单独的更新,这将是非常便宜的,他们实际上只是任何时间他们需要做些什么,只要他们得到一组安全域每月一次,他们会复制整个表,删除的人由于GDPR要求被遗忘。他们可以这样做,但这是另一个火花工作运行。它非常昂贵。 And there’s kind of a subtlety here that makes it extra difficult, which is if you modify a table while somebody is reading it, generating a report, they’re gonna see inconsistent results and that report will be wrong. So you’ll need to be very careful to schedule this to avoid any conflicts, when you’re performing those modifications. But these are all problems that people can solve. You do this at night and you run your reports during the day or something. And so now we’ve got a mechanism for doing updates. However, the problem here is this has become really complicated. And what that means is you’re wasting a lot of time and money solving systems problems rather than doing what you really want to be doing, which is extracting value from your data. And the way I look at this is these are all distractions of the data lake that prevents you from actually accomplishing your job at hand.

数据湖分心

和的总结我认为这些是什么,一个大一个原子性。当您运行一个分布式计算,如果工作失败在中间,你还有一些部分结果。这不是全有或全无。原子性意味着当工作运行,它完全正确地完成或如果出了任何差错,它完全回滚,什么也不会发生。所以你不再离开你的数据在一个腐败的国家需要一种沉闷地构建这些工具做手动恢复。另一个关键问题是没有质量的执行。由你在每个工作手工检查的数据的质量对你所有的假设。没有系统的帮助,像变异在传统的数据库中,你可以说,“不,这列是必需的,”或“这一定是这种类型的模式。“所有的东西留给你的程序员来处理。最后没有控制的一致性或隔离。这意味着你可以只做一个正确的操作任何数据表,湖和使它很难混流和批处理操作,人们在阅读。 And these are all things that you kind of, you would expect from your data storage system. You would want to be able to do these things, and people should always be able to see a consistent snapshot automatically.

所以让我们退后一步,看看这个过程看起来与三角洲湖。

数据湖的挑战

和三角洲湖的想法是我们把这个相对复杂的体系结构,在很多的正确性和其他东西都留给你手动编写火花程序。我们改变了它,这样,你的思维只有数据流,你把所有的数据从您的组织和流程,不断提高质量,直到准备消费。

的一个

这里的这个体系结构的特点,首先,三角洲湖给Apache火花带来完整的ACID事务。运行的,这意味着每一个火花工作将完成整个工作或什么都没有。的人阅读和写作的同时保证一致的快照。当写出的东西,它绝对是写出它不会丢失。这些都是酸的特点。这允许你关注实际的数据流,而不是思考所有这些额外的系统的问题和解决的事一遍又一遍。三角洲湖的另一个关键方面是它是基于开放标准和开放源码。bob下载地址这是一个完整的Apache许可,没有愚蠢的常见条款或类似的东西。你可以把它和使用它为任何你想要的应用程序完全免费的。,就我个人而言,这将是非常重要的,如果我是存储海量数据,对吧? Data has a lot of gravity. There’s a lot of inertia when you collect a lot of data and I wouldn’t want to put it in some black box where it’s very difficult for me to extract it. And this means that you can store that mass amount of data without worrying about lock-in. So both is it open source, but it’s also based on open standards. So I’ll talk about this in more detail later in the talk, but underneath the covers, Delta is actually storing your data in parquet. So you can read it with other engines and there’s kind of a growing community around Delta Lake building this native support in there. But worst case scenario, if you decide you want to leave from Delta Lake all you need to do is delete the transaction log and it just becomes a normal parquet table. And then finally, Delta Lake is deeply powered by Apache Spark. And so what this means is if you’ve got existing Spark jobs, whether they’re streaming or batch, you can easily convert those to getting all kinds of benefits of Delta without having to rewrite those programs from scratch. And I’m gonna talk exactly about what that looks like later in the talk. But now I want to take this picture and simplify it a little to talk about some of the other hallmarks I see of the Delta Lake architecture, and where I’ve seen people be very successful. So first of all, I wanna kind of zone in on this idea of data quality levels. These are not fundamental things of Delta Lake. I think these are things that people use a variety of systems, but I’ve seen people very successful with this pattern, alongside the features of Delta.

δ湖

所以这些只是一种数据质量的一般类,和这里的想法是把数据转换成湖的数据,而不是试图使它完美的一次,你要逐步提高数据的质量,直到它准备消费。和我将讨论为什么我认为这其实是一个非常强大的模式,可以帮助你更有效率。所以从一开始你的铜水平数据。这是一个原始数据的倾倒场所。它仍然是着火了,我觉得这是一件好事,因为这里的核心思想是如果你抓住一切没有做很多绿豆或解析,没有办法,您可以在解析和绿豆代码错误。你让一切从头开始,实际上你可以经常保持一年值得保留。我会说一下为什么我认为这是非常重要的,但这意味着您可以收集一切。你不需要花很多时间提前决定哪些数据会重视并不是什么数据。你可以弄清楚,当你做分析。从青铜继续,我们继续的银级数据。 This is data that is not yet ready for consumption. It’s not a report that you’re gonna give to your CEO, but I’ve already done some cleanup. I filtered out one particular event type. I’ve parsed some JSON and given it a better schema or maybe I’ve joined and augmented different data sets. I kinda got all the information I want in one place. And you might ask, if this data isn’t ready for consumption, why am I creating a table, taking the time to materialize it? And there’s actually a couple of different reasons for that. One is oftentimes these intermediate results are useful to multiple people in your organizations. And so by creating these silver level tables where you’ve taken your domain knowledge and cleaned the data up, you’re allowing them to benefit from that kind of automatically without having to do that work themselves. But a more interesting and kind of more subtle point here is it also can really help with debugging. When there’s a bug in my final report, being able to query those intermediate results is very powerful ’cause I can actually see what data produced those bad results and see where in the pipeline it made sense. And this is a good reason to have multiple hops in your pipeline. And then finally, we move on to kind of the gold class of data. This is clean data. It’s ready for consumption at business-level aggregates, and actually talk about kind of how things are running and how things are working, and this is almost ready for a report. And here you start using a variety of different engines. So like I said, Delta Lake already works very well with Spark, and there’s also a lot of interest in adding support for Presto and others, and so you can do your kind of streaming analytics and AI and reporting on it as well.

现在我想谈谈如何通过三角洲湖人们实际上移动数据,通过这些不同的质量类。的一个模式,我看一遍又一遍地流实际上是一个非常强大的概念。在我走得深流之前,我想正确的一些误解,我经常听到。所以一件事人们通常认为当他们听到流,他们认为这应该是非常快的。它必须是非常复杂,因为你想要非常快。和火花确实支持模式的一个应用程序,该应用程序。有连续处理,你不断地把新数据的服务器,抓住核心,支持毫秒的延迟,但实际上不是唯一的应用程序,其中流是有意义的。流对我是真正关于增量计算。它是关于一个查询,我想持续运行新数据到来。而不是思考这一堆离散的工作,把所有的这些离散的管理工作对我或者一些工作流引擎,流媒体,走了。 You write a query once. You say, “I want to read from the bronze table, I’m gonna do these operations, I went right to the silver table,” and you just run it continuously. And you don’t have to think about the kind of complicated bits of what data is new, what data has already been processed. How do I process that data and commit it downstream transactionally? How do I checkpoint my state, so that if the job crashes and restarts, I don’t lose my place in the stream? Structured streaming takes care of all of these concerns for you. And so, rather than being more complicated, I think it can actually simplify your data architecture. And streaming in Apache Spark actually has this really nice kind of cost-latency tradeoff that you can too. So at the far end, you could use continuous processing mode. You can kind of hold onto those cores for streaming persistently, and you can get millisecond latency. In the middle zone, you can use micro-batch. And the nice thing about micro-batch is now you can have many streams on the cluster and they’re time-multiplexing those cores. So you run a really quick job and then you give up that core and then someone else comes in and runs it. And with this, you can get seconds to minutes latency. This is kind of a sweet spot for many people, ’cause it’s very hard to tell if one of your reports is up to date within the last minute, but you do care if it’s up to date within the last hour. And then finally, there’s also this thing called trigger once mode in Structured Streaming. So if you have a job where data only arrives once a day or once a week or once a month, it doesn’t make any sense to have that cluster up and running all the time, especially if you’re running in the cloud where you can give it up and stop paying for it. And Structured Streaming actually has a feature for this use case as well. And it’s called trigger once where basically rather than run the job continuously, anytime new data arrives, you boot it up. You say trigger once. It reads any new data that has arrived, processes it, commits a downstream transaction and shuts down. And so this can give you the benefits of streaming, kind of the ease of coordination, without any of the costs that are traditionally associated with an always running cluster. Now, of course, streams are not the only way to move data through a Delta Lake. Batch jobs are very important as well. Like I mentioned before, you may have GDPR or kind of these corrections that you need to make. You may have changed data capture coming from some other system where you’ve got a set of updates coming from your operational store, and you just want to reflect that within your Delta Lake and for this, we have UPSERTS. And of course, we also support just standard insert, delete, and those kinds of commands as well. And so the really nice thing about Delta Lake is it supports both of these paradigms, and you can use the right tool for the right job. And so, you can kind of seamlessly mix streaming and batch without worrying about correctness or coordination.

最后一种模式在这里,我想谈谈这是重新计算的想法。所以当你有这种早期表让你所有的原始结果,当你有很长时间保留,所以年的原始数据。和当你使用流在不同的节点之间的三角洲湖数据图,很容易为你重新计算。你可能会想要重新计算,因为在代码中有一个bug,或者你可能想要重新计算,因为有一些新东西,你已经决定你想提取。这里真的很不错的,因为这是,流的工作方式非常简单。为了给你一个心智模型在Apache火花结构化流是如何工作的,我们的模型主要有流查询应该始终返回相同的结果作为批处理查询的相同数量的数据。这意味着当你开始一个新的流对三角洲表,它开始通过快照的表此刻开始。你这样做回填操作,处理所有的数据快照,打破它分成小块,和检查点状态,下游提交它。当你得到的快照,我们切换到尾矿事务日志,只处理新数据查询开始以来已经到来。,这意味着你会得到同样的结果,好像你有运行查询最后不管怎样,但随着工作显著低于运行它从头一遍又一遍又一遍。 So if you want to do recomputation under this model, all you need to do is clear out the downstream table, create a new checkpoint, and start it over. And it will automatically process from the beginning of time and catch up to where we are today.

实际上这是一种很强大的模式,纠正错误和做其他事情。现在我们已经结束后的高水平,我想谈谈一些特定的用例,三角洲湖方面降低成本和宽松的管理使用Apache引发这些数据上的湖泊。所以δ湖,我想给一点历史。

使用1000年代的全球组织中

所以δ湖实际上是两岁。我们把它在过去两年的砖。这是一个专有的解决方案,我们有我们的一些大客户使用它。所以我要特别是谈论Comcast,而且防暴游戏,和果酱的城市,和英伟达,一群大的名字你知道。他们已经使用了许多年。火花峰会上大约两个月前,我们决定开源,所以每一个人,即使那些运行在06或者在其他地方可以获得三角洲湖的力量。bob下载地址所以我想谈谈一个特定的用例,我觉得真的很酷。这是康卡斯特。所以他们的问题在于他们有世界各地的机顶盒,为了了解人与他们交互编程,他们需要一种sessionize这个信息。你看这个电视节目,换频道,你到这里,你回到另一个电视节目。 And with this they can create better content by understanding how people consume it. And as you can imagine, Comcast has many subscribers, so there’s petabytes of data. And before Delta Lake, they were running this on top of Apache Spark. And the problem was the Spark job to do this sessionization was so big that the Spark job, the Spark scheduler would just tip over. And so, rather than run one job, what they actually had to do was they had to take this one job, partition it by user ID. So they kind of take the user ID, they hash it, they mod it by, I think, by 10. So they break it into kind of 10 different jobs, and then they run each of those jobs independently. And that means that there’s 10x, the overhead, in terms of coordination. You need to make sure those are all running. You need to pay for all of those instances. You need to handle failures and 10 times as many jobs, and that’s pretty complicated. And the really cool story about switching this to Delta was they were able to switch a bunch of these kinds of manual processes to streaming. And they were able to dramatically reduce their costs by bringing this down into one job, running on 1/10 of the hardware. So they’re now computing the same thing, but with 10x less overhead and 10x less cost. And so that’s a pretty kind of powerful thing here that what Delta’s scalable metadata can really bring to Apache Spark. And I’m gonna talk later in the talk exactly how that all works.

但在我进入之前,我想说,我想告诉你到底是多么容易开始如果你已经使用Apache与三角洲湖火花。

开始使用火花与三角洲api

所以开始是微不足道的。所以它发表在火花包。所有您需要做的火花集群上安装三角洲湖是使用火花包。如果你使用PySpark,你可以做冲刺,冲刺包然后三角洲。如果你用火花壳,同样的事情。如果您正在构建一个Java或Scala jar,和你想要取决于三角洲,所有您需要做的就是添加一个Maven依赖项,然后改变你的代码非常简单。如果你使用火花SQL数据帧的读者和作家,所有您需要做的就是更改数据源从拼花或JSON或CSV或无论你使用今天三角洲,和所有其他的应该是一样的。唯一的区别是现在一切都将可伸缩和事务,正如我们之前看到的,可以是非常强大的。

数据质量

所以到目前为止我已经讲过的一切主要是这些类型的系统问题的正确性。如果我的工作崩溃了,我不想让它腐败的桌子上。如果两个人写表在同一时间,我希望他们都看到一致的快照,但数据质量实际上是更多。您可以编写代码,运行正确,但是可以有一个错误在代码中并得到错误的答案。所以这就是为什么我们的扩展数据质量的概念允许您声明的讨论质量约束。这是工作在接下来的季度左右,但这里的想法是我们让你,在一个地方,指定的布局和约束三角洲湖。首先我们可以看到一些重要的事情比如数据存储的地方。您可以打开一个严格的模式检查。三角洲湖有两种不同的模式,我经常看到人们使用他们,因为他们通过他们的数据质量的旅程。在前面的表,您将使用模式的痕迹,也许你刚读了一堆JSON和把它完全像三角洲湖。 We have nice tools here where we will automatically perform safe schema migrations. So if you’re writing data into Delta Lake, you can flip on the merge schema flag, and it will just automatically add new columns that appear in the data to the table, so that you can just capture everything without spending a bunch of time writing DDL. We, of course, also support kinda standard strict schema checking where you say, create table with the schema, reject any data that doesn’t match that schema, and you can use alter table to change the schema of a table. And often I see this use kind of down the road in kind of the gold level tables where you really want strict enforcement of what’s going in there. And then finally, you can register tables in the Hive Metastore. That support is coming soon, and also put human readable descriptions, so people coming to this table can see things, like this data comes from this source and it’s parsed in this way, and it’s owned by this team. These kind of extra human information that you can use to understand what data will get you the answers you want. And then finally, the feature that I’m most excited about is this notion of expectations. An expectation allows you to take your notion of data quality and actually encode it into the system. So you can say things like, for example, here, I said, I expect that this table is going to have a valid timestamp. And I can say what it means to be a valid timestamp for me and from my organization. So, I expected that the timestamp is there and I expect that it happened after 2012 because my organization started in 2012, and so if you see data from, say, 1970 due to a date parsing error, we know that’s incorrect and we want to reject it. So this is very similar to those of you who are familiar with a traditional database. This sounds a lot like a variant where you could say not null or other things on a table, but there’s kind of a subtle difference here. I think if you, so the idea of invariants are, you can say things about tables, and if one of those invariants is violated, the transaction will be aborted, will automatically fail. And I think the problem with big data, why invariants alone are not enough is if you stop processing every single time you see something unexpected, especially in those earlier bronze tables, you’re never going to process anything. And that can really hurt your agility. And so the cool thing about expectations is we actually have a notion of tuneable severity. So we do support this kind of fail stop, which you might want to use on a table that your finance department is consuming ’cause you don’t want them to ever see anything that is incorrect. But we also have these kinds of weaker things where you can just monitor how many records are valid and how many are failing to parse and alert at some threshold. Or even more powerful, we have this notion of data quarantining where you can say any record that doesn’t meet my expectations, don’t fail the pipeline, but also don’t let it go through. Just quarantine it over here in another table, so I can come and look at it later and decide what I need to do to kind of remediate that situation. So this allows you to continue processing, but without kind of corrupting downstream results with this invalid record. So like I said, this is a feature that we’re actively working on now. Stay tuned to GitHub for more work on it. But I think this kind of fundamentally changes the way that you think about data quality with Apache Spark and with your data lake.

现在我一直在高水平,δ是什么,为什么你在乎吗?我想进入的细节细节δ实际上是如何运作的。因为它听起来几乎好得令人难以置信,我们可以将这些完整的ACID事务到像Apache火花和分布式系统仍然保持良好的性能。

三角洲在磁盘上

首先,我们先来看看一个增量表看起来像当它实际上是存储在磁盘上。所以它会看,你有数据的湖,这应该很熟悉。它只是一个目录存储在您的文件系统中,S3, HDFS, Azure Blob存储、ADLS。它只是一个目录和一帮铺文件。还有一个额外的非常重要的一点,那就是我们也存储这个事务日志。事务日志里面,都有不同的表版本。所以,我会谈谈关于这些表的版本,但我们仍将数据存储在分区目录。然而,这实际上是主要用于调试。他们还三角洲模式,我们可以直接与存储系统以最优的方式。举个例子,在S3,他们建议如果你要写大量的定期数据,而不是创建日期分区,创建时间局部性的热点地区,而不是你的随机散列分区,并且由于三角洲的力量的元数据,我们也可以这样做。 And then finally, standard data files, which are just normal and coded parquet that can be read by any system out there.

表=一套行动的结果

实际上在这些表的版本是什么?我们如何思考一个表的当前状态是什么吗?所以每个表版本有一组操作,适用于表,并以某种方式改变它。和表的当前状态,此时此刻,是所有这些行动的总和。那么什么样的行动,我在说什么?对于一个例子,我们可以改变的元数据。所以我们可以说,这是表的名称。这是表的模式。您可以添加一个列到表什么的。你可以设置表的分区。 So one action you can take is change the metadata. The other actions are add a file and remove a file. So we write out a parquet file, and then to actually make it visible in the table, it needs to also be added to the transaction log. And I’ll talk about why that kind of extra level of indirection is a really powerful trick in a moment. And another kind of detail here is when we add files into Delta, we can keep a lot of optional statistics about them. So in some versions we can actually keep the min and max value for every column, which we can use to do data skipping or quickly compute aggregate values over the table. And then finally you can also remove data from the table by removing the file. And again, this is kind of a lazy operation. This level of indirection is really powerful. When we remove a file from the table, we don’t necessarily delete that data immediately, allowing us to do other cool things like time travel. And so the result here of taking all these things is you end up with the current metadata, a list of files, and then also some details, like a list of transactions that have committed, the protocol version for that.

实现原子性

所以如何让我们得到酸吗?真正让这些漂亮的事务数据库的属性?这里一个细节是,当我们创建这些表的版本中,我们将它们存储有序原子单元称为提交。所以我之前谈过这个问题。我们创建表的版本0通过创建这个文件,0. json。这里的想法是当三角洲构造文件系统上的文件,我们将使用基本原子原语。所以在S3,为了保证原子性所有你要做的就是上传系统。他们这样做的方式是你开始上传,说,我预计上传这么多字节。实际上除非你成功上传,许多字节,S3不会接受写。你保证你会得到整个文件或文件。 On other systems like Azure or HDFS, what we’ll do is we’ll create a temporary file with the whole contents and then we’ll do an atomic rename, so that the entire file is created or not. So then you can kind of have successive versions. So version one, we added these two files or sorry, in version zero, we added these two files. In version one, we removed them and put in a third. So for example, you could be doing compaction here where you atomically take those two files and compact them into one larger file.

确保Serializablity

现在,另一种重要的细节是我们想要为每个这些提交的原子性,但是我们也希望可串行性。我们想要每个人都同意修改这个表的顺序,这样我们就能正确地做事情像并入变化数据捕获和其他东西需要这个属性。为了达成这些变化即使有多个作者,我们需要这个属性称为互斥。如果两个人试图创建一个增量的相同版本表,只有一个人能成功。为了明确这一点,用户可以编写零表的版本,用户可以写两个版本,但是如果他们都试着写两个版本,其中一个就可以成功。但另一个必须得到一个错误消息说,对不起,您的交易没有通过。

乐观地解决冲突

现在你可能会说,等等,但如果任何时候两个人做一次失败。这听起来像我浪费了大量的时间和大量的工作。这听起来像是对我很大的复杂性。幸运的是,这就是我们用第三种酷技巧叫做乐观并发。和乐观并发的想法是对表执行一个操作时,你要乐观地认为它会工作。如果你有一个冲突,你就看看冲突问题。如果没有,你可以乐观地再试一次。而且在大多数情况下,实际的交易不重叠,你可以自动矫正这些。所以给你一个具体的例子,假设有两个用户,这些用户都是涌向相同的表。所以当他们开始他们的流写,他们开始通过阅读的版本表在那一刻。 They both read in version zero. They read in the schema of the table. So they make sure that the data that they’re appending has the correct format. And then they write some data files out for the contents of the stream that are gonna be recorded in this batch. And they record what was read and what was written from the table. Now they both try to commit, and in this case, user one wins the race and user two loses. But what user two will do is they’ll check to see if anything has changed. And because the only thing they read about the schema, of the table with the schema and the schema has not changed, they’re allowed to automatically try again. And this is all kind of hidden from you as the developer. This all happens automatically under the covers. So they’ll both try to commit, and they’ll both succeed.

处理大量的元数据

现在,最后一个技巧,我们这里是表可以有大量的元数据。和那些试图把数以百万计的分区放在蜂巢Metastore可能是熟悉这个问题。它可以,一旦这些数据大小、元数据本身实际上可以使系统的东西。所以我们有一个技巧,实际上,我们已经有了一个分布式处理系统能够处理大量的数据。我们将只使用火花。所以我们采取的事务日志的行动。我们读与火花。我们可以编码镶花的一个检查站。一个检查点基本上是整个表在一些版本的状态。当你阅读事务日志,而不是阅读整个事务日志,你可以从检查站,然后开始任何后续之后发生了变化。 And then this itself can be processed with Spark. So when you come to a massive table that has millions of files, and you ask the question like, “How many records were added yesterday?” What we’ll do is we’ll run two different Spark jobs. The first one queries the metadata and says, “Which files are relevant to yesterday?” And it’ll get back that list of files, and then you’ll run another Spark job that actually processes them and does the count. And by doing this in two phases, we can drastically reduce the amount of data that needs to be processed. We’ll only look at the files that are relevant to the query, and we’ll use Spark to do that filtering.

路线图

在我们结束,去之前的问题,我想谈谈路线图。就像我之前说的,这个项目已经有几年了,只是最近开源的。bob下载地址我们有一个非常令人兴奋的今年余下的路线图。基本上我们的目标是完全开源三角洲湖项目的API兼容是什么可用bob下载地址的砖,所以我们的路线图的其他季度基本上是开源很多很酷的功能。所以我们实际上,几周前发布版本0.2.0添加支持阅读从S3和阅读从Azure Blob存储和Azure数据湖。本月,我们打算做一个0.3.0释放。加Scala api的更新、删除、合并、和真空,Python api将在不久之后。然后剩下的这个季度,我们有几件事情的计划。我们想要添加完整的DDL支持,这是创建表和alter table。我们也想给你的能力来存储蜂巢Metastore三角洲表,我认为这是非常重要的数据发现在不同的组织。 And we want to take those DML commands from before, UPDATE, DELETE, and MERGE, and actually hook them into the Spark SQL parser, so you can use standard SQL to do those operations as well. And then moving forward kind of, let us know what you want. So if you’re interested in doing more, I recommend you to check out our website at delta.io, and it has kind of a high level overview of the project. There’s a quick start guide on how you can get started, and it also has links to GitHub where you can watch the progress and see what our roadmap is, and submit your own issues on where you think the project should be going. So I definitely encourage you to do that, but with that, I think we’ll move over to questions. So let me just pull those up and see what we got.

好吧。第一个问题是材料和录音之后可以吗?为此,我希望丹尼可以让我们知道。丹尼,你在这里吗?——(丹尼)我一点问题也没有。是的,就像一个快速上门服务,那每个人都报名参加了研讨会,我们也会发送的幻灯片和录音。为流程大约需要12到24小时才能完成。所以你应该收到这封邮件今天晚些时候或明天早。

——(Michael)太棒了,非常感谢。是的,应该有,你可以看看这个。还有视频在YouTube上。那么请继续关注更多关于三角洲湖的东西。移动到其他问题上。第一个是,三角洲湖增加性能开销吗?这是一个非常有趣的问题。我想要打破它。首先,三角洲湖被设计成一种高吞吐量的系统。所以每个操作,有一点点的开销在执行它。 So you’d basically because rather than just write out the files, we need to write out the files and also write out the transaction log. So that adds a couple of seconds to your Spark job. Now, the important thing here is we designed Delta to be massively parallel and very high throughput. So you get a couple of seconds added to your Spark job, but that is mostly independent of the size of your Spark job. So what Delta Lake is really, really good at is ingesting trillions of records of data or petabytes of data or gigabytes of data. What Delta is not good at is inserting individual records. If you run one Spark job, one record per Spark job, there’ll be a lot of overhead. So kind of the trick here is you want to use Delta in the places where Spark makes the most sense, which are relatively large jobs spread out across lots of machines. And in those cases, the overhead is negligible.

接下来的问题是,因为它ACID属性,将我的系统高可用性?这实际上是一个好的问题我想打开一点。所以δ,这是专门利用云计算和做的,利用这些好的特性。所以对我来说,有几个不错的云的属性。一个是云是非常稳定的。你可以把大量的数据到S3,随意地处理它。通常漂亮的高可用性。所以你可以总是从S3读取数据,不管你在哪里。如果你真的,真的在乎,甚至还有诸如复制,你可以复制数据到多个地区,和δ扮演得很好。所以阅读从三角洲表应该非常高可用性,因为它只是底层存储系统的可用性。 Now, those of you who are familiar with the CAP theorem might be saying, “But wait a second.” So for writes, when we think about consistency, availability, and partition tolerance, Delta chooses consistency. So we will, if you cannot talk to kind of the central coordinator, depending on whether you’re on S3, that might be kind of your own service that you’re running on Azure. They’ve taken kind of the consistency approach (indistinct) we use an atomic operation there. The system will pause. But the nice thing here is because of that kind of optimistic concurrency mechanism, that doesn’t necessarily mean you lose that whole job that you might’ve been running for hours. It just means you’ll have to wait until you’re able to talk to that service. So I would say in terms of reads, very highly available, in terms of writes, we choose consistency, but in general, that actually still works out pretty well.

接下来是你保留所有级别的数据。嗯,我认为我想澄清背后的想法,青铜,银,金。不是每个人都让周围的原始数据。不是每个人都让所有的数据。您可能有一个保留要求说你只允许保留两年的数据。所以,我认为这是由你来决定哪些数据是有意义的坚持。我唯一想说的是我认为优点数据湖泊和三角洲如何适用于他们一般是你有权保持原始数据和尽可能多的你想要的。所以,没有技术上的限制让你保持所有的数据,因此,许多和我一起工作的组织实际上把所有他们在法律上被允许保持很长一段时间。只有删除它当他们不得不摆脱它。

下一个问题是你写逻辑?我们可以用Scala编写逻辑吗?所以δ湖插入所有现有的api,所以Apache火花,这意味着你可以使用任何的人。如果你是一个Scala程序员,您可以使用Scala。如果你是一个Java程序员,工作。我们都也绑定在Python中,如果你的分析师和你不想计划,我们也支持纯SQL。这里真的有点我们的想法是底层引擎用Scala编写和δ也用Scala编写的,但你的逻辑可以用您熟悉的语言写的。这是另一个的情况下,我认为你需要正确的工具,正确的工作。所以就我个人而言,我做很多我的东西在Scala中,但当我需要做出图表,我切换到Python和使用该平台。bob体育客户端下载但仍然三角洲给我能力的过滤器通过大量的数据,把它缩小东西会适合熊猫,然后我做一些绘图。

下一个问题是,转眼间三角洲湖的一部分还是只有火花吗?这是一个很好的问题。这是现在发展的很快。这是两种不同的答案。所以我要告诉你我们在和我们去的地方。现在,里面的一个特性的砖,我们致力于开源,它允许您为三角洲,作家写出这些东西称为manifest文件允许您查询一个增量表一致的方式从转眼间雅典娜或任何其他基于转眼间的系统。然而,我们正在深入的亮光,一转眼间背后的公司,建立一个本地连接器很快。我们也有活跃的利益从蜂巢社区和滚烫的社区,所以有很多兴趣建立连接器。所以今天,三角洲的核心是建于火花,但我认为真正强大的开源和开放标准,这意味着任何人都可以集成。bob下载地址而且,这个项目我们致力于发展生态系统和使用任何人。 So if you’re a committer on one of those projects, please join our mailing list, join our Slack channel, check it out, and let us know how we can help you build these additional connectors.

下一个问题,我们可以尝试三角洲湖community edition的砖吗?是的,你可以。三角洲湖在community edition,检查出来。一切都应该有。让我们知道你的想法。

下一个问题是,和蜂巢三角洲表可以查询吗?是的,所以基本上相同的回答很快。有活跃的兴趣社区建设这种支持。今天没有,但绝对是我们想构建。下一题,三角洲湖是怎样处理缓慢变化维度从原始到黄金?

是的,很好,这是个好问题,实际上是一个博客在www.neidfyre.com上。如果你谷歌缓慢变化维度,三角洲,它将引导您完成所有的细节,但我认为真正正确的答案是合并操作符和加火花的权力,它实际上很容易构建的所有不同类型的缓慢变化维度。三角洲,神奇的一件事是添加的火花,使这些交易。修改一个表会非常危险没有交易和δ使这成为可能,从而使这种类型的用例。

下一个,我们通常处理Azure。我们想知道是否三角洲湖有什么不同的行为在Azure上运行时事件中心而不是卡夫卡。是的,我将回答这个问题有点更普遍。所以我认为,我谈到了一个强大的三角洲被集成的火花。的一大原因是我使用火花的瘦浪费大数据的生态系统。有火花连接器世界上几乎每一个大数据系统。所以如果火花可以读取它,它与三角洲湖。所以事件中心,特别是,既有本地连接器,插头通过火花数据源和也有卡夫卡API与火花卡夫卡作品。所以你可以很容易地读取事件中心和做所有的东西我今天谈到使用事件中心而不是卡夫卡。,适用于任何系统,火花可以读取。

一般,只是有点回答Azure一点,δ在Azure上完全支持,包括ADLS。我们最近的改进支持ADLS,创两个。可以为您下载,这也是Azure砖的一部分的。

是的,那么接下来的问题是究竟是Scala API的DML命令更新吗?答案是,它看起来像火花的SQL,在火花SQL和你传递一个字符串,更新吗?答案是,我们会同时支持。所以如果你真的去GitHub库,我相信这段代码已经被合并。所以你可以看到Scala API,如果没有,有一个设计文档,谈到了在添加一个更新的机票细节。但这里的想法是,都将是一个Scala的函数叫做更新,你可以使用编程方式,而不必做字符串插值,还有一种SQL的方法。所以你可以创建一个SQL字符串,并传递。所以,这就像,你用你最熟悉的语言已经工具包的一部分,自动和δ应与这样的合作。

下一个问题是,三角洲湖与HDFS工作吗?是的,这完全与HDFS。HDFS有我们需要的所有原语,所以你不需要任何额外的细节,和我谈论HDFS支持原子重命名失败如果目的地已经存在。只要你足够运行一个新的版本的HDFS,这甚至不是新的,应该自动工作。如果你看看入门指南文档在δ。io,所有不同的存储系统,我们支持和细节你需要做什么设置。

下一个问题,更新,删除在单行或记录水平?有两个答案。是的,δ允许你做的细粒度,个人行更新。所以你不一定要做你的更新或删除在分区级别。如果你在分区级别,他们是重要的。如果你删除,例如,在分区级别,这是更有效率,因为我们可以将元数据。我们不需要做任何手动修改。但是如果他们不是在分区级别,如果你做一个细粒度的单行更新或删除,我们要做的是我们会找到相关的拼花文件,修改,提交添加和删除操作发生,然后它的事务。所以它支持它,但它确实涉及重写个人文件。所以我在这里要说的是,达美航空的绝对不是设计成一个OLTP系统。 You should not use it if you have lots of individual row updates, but we do support that fine granularity use case.

你知道什么时候三角洲湖的Scala api可以吗?嗯,有几个答案。所以δ阅读和写作和湖流和批处理已经在Scala中工作。今天是可用的。如果你专门谈论更新、删除和合并,我相信大部分的代码已经,已经放入存储库。如果你下载和构建它自己,它的存在。我们希望7月发布。希望这个月,会有下一个版本,包含这些额外的Scala api。

让我们来看看。

是的,所以下一个问题是关于数据质量。我们可以有任何其他字段进行验证的目的除了时间戳?是的,所以我们之前谈到的预期,只是一般的SQL表达式。所以任何期望,您可以在SQL编码是允许的。所以它可能是,在这个示例中,它是一个非常简单的比较操作的一些具体日期,但它可以是任何你想要的。它甚至可以是一个UDF,检查数据的质量。所以重要的是,我们只是让你把这些作为属性的数据流,而不是手动验证你自己记住要做的事。这样的执行全球各地任何人使用该系统。

三角洲湖是否支持合并的数据帧而不是临时表?是的,所以一旦Scala和Python api可用,你可以通过一个数据帧。今天在砖,唯一可用的SQL DML,在那种情况下,你需要注册一个临时表。但就像我说的,请继续关注本月底。我们会有一个释放,Scala api,然后你就可以通过自己在数据帧。

我见过几次这个问题,所以我只回答一次。我们支持ADLS创一创两个,虽然创两个要快,因为我们有一些额外的优化。

下一个是在检查点的例子中,是引发工作计算三角洲湖检查点内部或需要手写吗?这是个很好的问题。当你使用流读取,或写一个三角洲表或者两者都是,如果你只是使用它在两个不同的三角洲表、检查点是由结构化流。所以你不需要做任何额外的工作来构造检查站。内置的引擎。火花的结构化流的工作方式是每一个源和同步,有一个合同,允许我们的做自动检查点。所以源需要能够说,我从这里到这里,处理数据和概念的地方流,我们称之为补偿,这些需要是可序列化的。我们商店的检查站。我们基本上使用检查点之前写日志。所以我们说,批号10将是这些数据。 Then we attempt to process batch number 10, then we write it to the sync, and the guarantee here is the sync must be idempotent. So it must only accept batch number 10 once, and if we try to write it twice due to a failure, it must reject that and kind of just skip over it. And by putting all of these kind of constraints together, you actually get exactly once processing with automatic checkpointing without you needing to do any extra work.

伟大的问题。为什么不使用通晓多种语言的持久性和使用RDBMS存储资产交易?这是一个很棒的问题,我们尝试这一点。事实上,δ使用MySQL的早期版本之一,这里的问题是MySQL是一个机器,所以刚刚的列表文件为一个大桌子可以成为瓶颈。而当你将元数据存储在一个表单,火花本身可以本地过程中,您可以利用火花加工。所以没有什么阻止你实现三角洲的事务协议的存储系统。事实上,有一个很长的谈话现在在GitHub库,这是有点什么来回需要建立基础数据库版本的三角洲,这当然是可能的,但在我们最初的可伸缩性测试,我们发现火花是最快的方法,至少从我们测试的系统,这就是为什么我们决定这样做。

另一个问题,这是否意味着我们不需要数据帧和可以做所有转换三角洲湖呢?我说不。我认为你只能使用更新,删除和合并不使用任何类型的实际的数据帧代码。您可以使用纯SQL,但实际上,我认为这是一种合适的工具做合适的工作。三角洲湖深深地集成引发数据帧。,就我个人而言,我觉得这是一个非常强大的工具进行转换。它有点像SQL + +因为你有这些关系的概念,但嵌入在一个完整的编程语言。其实我觉得可以是一个非常有效的方式写你的数据管道。

三角洲湖如何管理引发的较新版本?是的,所以三角洲湖需要2.4.3火花,这是一个非常最近的版本。因为有虫子在火花的早期版本中,阻止数据源正确插进去。所以,总的来说,我们正在努力火花兼容性。这是本季度我们的核心项目之一是确保一切三角洲插入好公共稳定的api的火花,所以我们在未来可以处理多个版本。

一个问题,三角洲湖是否支持兽人?是的,这是一个很好的问题,我得到很多。再次,讨论GitHub添加支持。我鼓励你去检查一下,在这个问题上投票,如果这是对你很重要的东西。这个问题有两种答案。一个是三角洲湖事务协议。事务日志的东西实际上确实支持指定的格式存储的数据。所以实际上可以用于任何不同的文件格式,txt, JSON, CSV。这是内置的协议了。今天,我们不公开,作为一个选择。 When you’re creating a Delta table, we only do parquet. And the reason for that is pretty simple. I just think less tuning knobs is generally better, but for something like ORC, if there’s a good reason why your organization can switch, I think that support would be really, really easy to add and that’s something that we’re discussing in the community. So please go over to GitHub, find that issue, and fill it in. And then I’m going to take one final question since we’re getting close to time. And the question here is, what is the difference between the Delta Lake that’s included with Databricks versus the open source version? And that’s a question I get a lot. And I think, the way to think about this is I’d like to kind of talk about what my philosophy is behind open source. And that is that I think APIs in general need to be open. So any program you can run correctly inside of Databricks should also work in open source. Now that’s not entirely true today because Delta Lake is only, the open source version of Delta Lake is only two months old. And so what we’re doing is we are working hard to open source all of the different APIs that exist. So update, delete, merge, history, all of those kinds of things that you can do inside of Databricks will also be available in the open source version. Managed Delta Lake is the version that we provide. It’s gonna be easier to set up. It’s gonna integrate with all of the other pieces of Databricks. So we do caching, we have a kind of significantly faster version of Spark, and so that runs much faster, but in terms of capabilities, our goal is for there to be kind of complete feature parity here ’cause we’re kinda committed to making this open source project successful. I think open APIs is the correct way to do that. So with that, I think we’ll end it. Thank you very much for joining me today. And please check out the website, join the mailing list…

高级:深入三角洲湖

潜水通过三角洲湖的内部,一个流行的开源技术支持ACID事务,执行时间旅行、模式和更多的数据之上的湖泊。bob下载地址

现在ansehen