从Azure Databricks部署和服务模型到Azure机器学习

下载幻灯片

我们演示了如何使用Azure机器学习(AML)在Azure Databricks上训练的基于PySpark的多类分类模型部署到Azure Kubernetes (AKS)上，并将模型关联到web服务。这个演示涵盖了端到端开发周期;从训练模型到在web应用中使用模型。目前检测表格数据语义类型的解决方案主要依赖于字典/词汇表、正则表达式和基于规则的查找来识别语义类型。然而，这些解是1。对脏数据和复杂数据不健壮;不能泛化到不同的数据类型。我们通过训练一个多类分类器来自动预测表格数据的语义类型，将其转化为一个机器学习问题。我们选择Azure Databricks来使用PySpark SQL和机器学习库来执行特性化和模型训练。为了加快特性化过程，我们利用PySpark函数(UDF)来注册特性化函数并将其分发到UDF中。 For the model training, we pick the Random Forests as the classification algorithms and optimize the model hyperparameters using PySpark MLLib. Model Deployment using Azure Machine Learning Azure Machine Learning provided the reusable and scalable capabilities to manager the lifecycle of Machine Learning models. We developed the E2E deployment pipeline on Azure Machine Learning including model preparation, computing initialization, model registration, and web service deployment. Serving as Web Service on Azure Kubernetes Azure Kubernetes provide the fast response and autoscaling capabilities serving model as web service together with the security authorization. We customized the AKS cluster with PySpark runtime to support PySpark based featurization and model scoring. Our model and scoring service are being deployed onto AKS cluster and served as HTTPS endpoints with both key-based and token-based authentication.

点击这里观看更多Spark + AI课程
或
免费试用Databricks

视频记录

-大家好，欢迎来到Azure Databricks和Azure机器学习的部署和自我模型。

从Azure Databricks部署和服务模型到Azure机器学习

我是Reema Kuvadia，微软AI平台团队的软件工程师。bob体育客户端下载大家好，我是李涛，我来自微软，是一名高级应用科学家。-我们将演讲分为三个部分，模型训练和实验，模型部署，以及Azure网站部署中的模型消费。前两个模块由Tao负责，最后一个模块由我负责。

在我们进入主要议程之前，我只想快速概述一下我们正在使用的所有Azure资源及其原因。Azure Databricks，在Azure Databricks中，我们使用Jupiter Notebook来运行PySpark代码，并附加一个集群来执行所需的处理来训练模型。Azure Blob存储，一旦模型被训练，我们将模型存储在Azure Blob存储中。你甚至可以使用Azure数据湖。第三，Azure机器学习。在Azure机器学习中，我们正在为部署模型做准备。四、Azure Kubernetes服务。在Azure Kubernetes中，我们正在为将在Azure Web服务中使用的模型创建端点。第四，Azure Web服务。在这里，我们将部署一个web应用程序，在其中我们将使用Kubernetes端点。 All the above resources can be deployed by just one click, using ARM template. Let me give you a quick demo about it. This demo is to show you how to deploy Azure resources, using ARM template. I have written a simple PowerShell script to deploy all the Azure resources we need for this demo, like Databricks workspace, storage account, machine learning workspace, etc. I have added my ARM templates in the template folder. And I’ve got these templates from this GitHub repository called Azure Quickstart Template.

为了进一步定制脚本，我添加了一些参数。json，在这里我可以添加我想要的名称，我想要部署Azure资源的位置，并通过每个资源的安全性管理访问策略。

要部署Azure清单，只需在PowerShell脚本中调用模板。例如，为了部署数据导出案例，我将调用Databricks模板，并传递所需的参数。你看这里，它只需要位置，资源组，诸如此类，这些都传入了数据库参数。现在要做的就是运行脚本。要运行此脚本，请导航到PowerShell中的文件位置。然后运行脚本。这将开始在它将要创建的资源组中，在所需的订阅和位置部署Azure资源。对于这个演示，我之前已经运行了脚本。因此，我已经在资源组中部署了资源。

查看脚本是否成功运行，以及是否创建了所有Azure资源。您可以在这里检查它们，但如果出现故障，您可以转到部署选项卡并单击这里。它将向您显示失败的资源和错误。这里显示这个键盘名已经在使用中。因此，我所做的是，我更改了键盘名称，并重新运行脚本，这为这个资源组部署了键盘。就是这么简单。所以，对我来说就是这样。现在，让我把它交给陶，谈谈更多关于现代训练和实验。

问题介绍

-好了，现在让我来看看这个项目要解决的问题。所以我们想要解决一个机器学习问题来识别语义类型，关于一种类型的数据，也就是列数据。这个问题在适应度界是非常关键的，要解决数据清洗、归一化、数据匹配等数据丰富问题。所以现在，大多数的解决方案依赖于字典，或者他们查找的任何东西。因此，所有这些解决方案对于肮脏和复杂的数据都不具有鲁棒性。更重要的是，这不能推广到我们的数据。所以在我们的项目中，我们想要制定一个机器学习解决方案，以确保模型能够从数据中学习，并做出正确的预测。这里有几个例子。第一个例子是一个名字。所以我们希望机器学习模型能够捕捉到所有的信息，从数据中获得最好的语义含义，并根据名字做出预测。 Similarly, for the second and third column, we want to make the right prediction to see this location, for the location of other values, and also date for the last columns. Okay, now let’s talk about the Model E2E Flow. So in a model flow, as I mentioned before, so we’re treating this problem as a machine learning problem. Then so we started with all the data gathering, and then we need to do the model training, which happened on the Databricks. So in this stage, we do the feature engineering, and also model training, using PySpark machine learning. Then we will move to another stage, once we have the model. We will publish the model into the Azure Blob Storage, which can be consumed in a later stage. Once this is done, we will move to the next stage which is a model deployments. Definitions, we will define the modeling environment, and also the dependencies. So once this is done, another important is the scoring script, which is used as the entry function for the model to make the prediction. Once all this comes down, we will go to the final stage of the model deployment, which is first register model into other machine learning, and is then create a model image. And standard which is deploy them as a model into the Kubernetes service, and the web service. Once all this is done, so we will go we can use this as a serving, and also consumer stage. The model will be served as web services on Azure. And then secondly, the application can fully use and consume the model using the RestAPI endpoint, then this will become to be the application stage.

模型架构与训练

好的，现在我来讲讲模型架构和训练的一些细节。在这一阶段，我们使用随机森林作为多类分类。这是我们的建筑。所以我们从数据画廊开始，使用Excel数据，公共网络表格和一些论文中的研究表格。我们还利用了一些客户数据。因此，所有的数据都将被处理，就像某种表格数据一样，有一个标题和更广泛的示例。

基于随机森林的多类分类模型构建与训练

我们会做特征工程，为标题和其他东西生成特征?此外，我们还根据标题生成了标签，以得出关于类别的某种标签。好的，一旦这些都完成了，我们每一列就有一个特征向量，还有一个标签，也就是语义标签。所以我们在Azure上做了一个模型训练，以及一个可以在Azure机器学习生产中使用的优化模型。在这次演讲中，我只讲一些细节。在特征化阶段，我们利用数字数据帧的spark特性，使嵌入的内容易于查找。当然，我们将利用Spark SQL，以确保功能代码可以作为UDF函数运行，以确保更快的计算。好的，现在让我们谈谈另一个组成部分，也就是模型训练。因此在建模中，我们利用Spark MLlib进行模型实验。在我们的例子中，我们使用一个随机森林和我们的模型。 What’s more, we also leverage MLFlow for the model logging, and also selection. Now, let’s move on to the second demo, which is training the model within our Databricks. This session, I will then the model training on Azure Databricks. Azure Databricks provide us a easier to use interface to manage a resource on Azure. It also creates the cluster management, which enables us to easily create config, and also manage a cluster. For example, I can specify Databricks worsening, we are creating clusters. It can also allow us to easily create a new packages or libraries that is needed for the experimenting. What’s more, it provides us easier to use notebook, for the modeling training and also experimenting. So in this (mumbles), so we need to specify all the libraries that we needed for the featurization, and also model training. And then secondly, we need to ling to our Azure Blob Storage, using our storage account, and a key. And together with all the configuration parameters, we need in this kind of little experiment. Now, we can load all the production that was embedded into a memory, which can be super efficient to use in the featurization part. And after that, we need to define all the UDF functions for the featurization. This is all function related here for the featurization to this tabular data under study. We then just use the UDF function, and also learnt the impression, to just concatenate the functions together with the return tab into each UDF. All the UDF function will be registered into a system, to make sure the featurization can be in power computing, and for even more efficiency. Then here is the featurization column, most basically we just call the UDF function by using the withColumn function, and I call each UDF to compute the features. Once the featurization is done, we will go to the model training part, which will train a random foresting model, by using the pipeline.fit function. Once the model have been trained, we save the model into the local storage by using the model.write.overwrite.save function. Okay, once the modeling being trained, and is saved, we will continue with them model evaluation and measurement by generating the precision numbers per tab, and the weighted precision recall numbers overall. After all is done, we can use a publish model into Azure Blob Storage for the downstream pipeline to consume and deploy. What’s more, Databricks also provide us the function that we can use, like the Mleap, which allows us to package a model into the job package. And the model training and the selection part, the MLFlow also being used to do some kind of feature parameter sweeping, and off logging. Yeah, so thank you.

大家好，在这节课中，我将继续讲解模型的部署。

模型部署

现在，让我们看看模型部署是什么?因此，在这个模型部署部分，正如我们提到的，模型将在Azure Databricks上进行训练，然后我们将其发布到Azure Blob Storage。

因此，为了突出模型训练或Azure Databricks，模型将被发布，并保存到Azure Blob Storage中。一旦完成，该模型将作为服务部署到Azure上，并在Azure Kubernetes Services上运行。所以这里有一些先决条件，这是第一个Azure机器学习工作空间，另一个Azure Kubernetes服务集群，加上我们需要利用SDK, Azure机器学习，以及Azure存储。模型部署之前的第一步是模型注册，即将模型注册到工作区中。确保模型可以被存储、跟踪和版本控制。好的，一旦模型被注册，我们就可以进入计划阶段了。定义一个评分脚本，您已经将其称为score.py。这是为了加载模型，也是为了在将部署服务部署到Azure Kubernetes Services时加载模型。其次，这个函数，它还需要处理来自端点的数据。同时通过特征化和预测，将其纳入模型，然后保留结果，以及对端点的响应。 Plus, we also need to defend the AML environment which included software dependencies, and also the libraries dependencies. Once this has been defined, we can go to the second stage, about the model deployments. In this stage, there are two important parts. One is create and image, basically just configure the entry script, and environment, and then configure that runtime, which here it’s spark.py, to make sure the cluster, or the image can draw within the runtime of Spark. Plus, it also provided the flexibility to make even more flexible configuration, like CPU memory for all the configurations. Once this image being defined and created, we can deploy the image as a web service into the Azure Kubernetes Cluster, and then get an endpoint. Now why is the model been deployed, and in the endpoint get ready. We can easily consume the model by using both the SDK, and also using the endpoint at the RestAPI services.

为Eile评分(Score.py)

一个很重要的文件叫做score file，或score。py。这是一个it入口函数，用于接收数据，并进行特征化，预测，并保留结果。端点回到客户端。这个函数包含两个重要的函数。第一个，我们称之为初始化函数。这个函数将模型作为全局参数加载，我们也可以使用这个函数来定义一些其他参数。当然，这个函数只能运行一次，也就是当你将模型部署到Docker、映像或集群时。

这里有一个例子。

如果你看这里，这个函数定义在这里一个火花计，一个小时机器学习模型。再加上，一句话要嵌入到记忆框里去查它。因此，这将使所有的模型，全局参数，也在内存源可以加载和充分使用。一旦完成，另一个重要的函数将被调用，我们的返回将接收一些数据，这被称为运行函数。这个功能主要是接收数据，并对我们的参与者进行预测。所有接口的输入和输出，都遵循JSON格式，用于序列化，也用于反序列化。这里有一个例子。这个函数只是接收JSON格式的数据，收集数据，进行一些数据特征化，并进行预测。一旦完成了所有的特性化，我们将只选择结果，并将其发送回端点。好，现在让我们讨论更多细节，关于如何构思模型。 Okay, in this session, I will demo how to do the model deployment using the Azure Machine Learning. So it provided us an easier to use portal, and to access that notebook, and also compute plus with all model endpoint. Here in a compute, you can access the computing instance, which is used to execute the script. And also you can also use this to manage your cluster. In this case, we have one Kubernetes cluster used to deploy our machine learning models. You can immediately do this model to manage and optimize, monitor all your pipelines. What’s important, you have with a notebook, which allow you to do the model deployment. In this pipeline, we created the one folder. And the only needed to stop here, is only one notebook, which had all the logic to do the model preparation, and also model deployment. So firstly, let’s enable all the pre requirements by importing all the libraries from the Azure Machine Learning SDK. And let’s also initialize a workspace to persist all the configuration and author models. Now let’s go to the first step, which would prepare the deployment. Here, we just need to connect to the Azure Blob Storage, and download the models, and also embedded models. Let’s look at the model download, which just use a datastore.download function. Similarly, we downloaded the inviting function as well. And then, in a second step, we will register all these two models into the Azure returning space. So let me take a look at whether it is a model. Here, if you look at a models here, you can access all the models that was been shortlisted into the LMS running.

好的，一旦这个模型被注册，你可以去到一个源文件，部署一个模型和一个web服务。好，现在我们找一个ComputeTarget。在这里，我们选择将要使用的aks_clusters。这里给出了函数。最后，让我们指定一些其他Kubernetes配置来配置备份，如何调用CPU，以及其他4gb内存。它还为你提供了使用标记来配置计算机和其他集群。一旦我们完成配置，我们将进入映像创建部分，通过使用容器映像。然后使用执行脚本score.py创建图像。还有运行时，spark。py。还有conda_file，也就是env。yml。 Now, let’s first look at what’s a yml file, before we do the deployment. The yml file is just a single file, contains all the dependency here with a specified elaborate Python, and the PySpark with the (mumbles). And then with all the packages that needed to be installed on a plaster. Okay, another important one in order to do the deployment, is what we call a spark.py, which is this function. Which entry function, let me take a look at this function. This function as I mentioned, basically it had two most important function. Why the initial add function basically just take all the global parameters. In this case, we defended the spark environment, and also the model, plus the model to in word embedding, and also some other global parameters we want to use. So for spark, we define our spark session, in order to use for each model prediction. And also when we do the model loading, we use Azure SDK data model pass, and then load the model using spark called PipelineModel.load. Similarly, we did the same thing for the embedding, but you use a different function, pickle.load, to load an array of embedding file into the memory. Why is this done? Here we also have similar function used to do the featurization. And together, we also have another important function, which is run function. This function needed to just receive the data from the endpoint. Which is jsondata, and then doing the featurization, which is part the data from a JSON file. Now once the featurization is done, we call it createDataFrame, the data into a spark data frame, and then continue the model scoring, and the prediction. So then as a prediction, we may do some kind of data post processing to adjust some prediction, either to reject the prediction with some low probability. And the result of all that, we will return a result. So the result will become to be summarized about JSON format. Then this result will become to be returned to a web service. We can also handle some exceptions by using the exception function here. – Thank you Tao for getting the model ready for us to consume. Now let us see how we can consume this model into our web application.

模型消费和网站部署

要使用模型，我们首先需要在机器学习工作区中注册它。下面是注册模型的代码片段。我们将需要路径、模型的名称、模型的描述和工作区。注册完成后，我们就可以创建端点了。我们有一些依赖项，我们将在环境配置文件中声明这些依赖项。我已经添加了关于这看起来如何的片段。在您将看到的下一个演示中，我将向您介绍如何注册模型、创建工作区以及如何使用yml文件。

一旦我们部署了web应用程序，这就是我们的web应用程序的外观。

应用程序演示

为了测试我们的端到端模型，如果你给出一个输入，

该模型可以检测它是什么类型的列。正如您在第一列中看到的例子，name，它检测这个名字。在第三列中，它检测公司名称。为了完成这个端到端模型，让我快速演示一下。在Azure机器学习工作区中，我创建了一个名为模型-注册和部署的脚本。我首先初始化工作区。然后我注册模型。这是您在幻灯片中看到的相同的代码片段，其中我将需要模型路径、模型名称、描述以及我们在前面步骤中初始化的工作区。在进行下一步之前，我们需要两样东西，计分脚本和环境配置文件，即yml文件，我们将在其中声明所有依赖项。回到脚本，第三步是创建端点。 To create this endpoint, we will be needing three things, the scoring script, which will be our execution script, defining the runtime which will be PySpark, and the environment config yml file, which will be our conda_file. When you’re at this script, you will see that it is creating an image. Once this image is successfully created, you navigate to the endpoint tab, in machine learning workspace. And if you click on any of this endpoint, you will be able to see all the details of this endpoint.

在这里，您将看到REST_endpoint。复制这个，并为下一步做好准备。

我已经创建了一个simple.net web应用程序。在我的web配置文件中，我复制了URL, REST_endpoint URL，然后粘贴到这里。一旦您的应用程序准备好了，您所要做的就是转到您的解决方案，只需右击，然后单击Publish。再次点击Publish，因为我们已经部署了App Service Plan，你可以使用它。或者你也可以创建一个新的，只需点击Azure和应用服务窗口。这里你可以选择你的订阅，资源组，如果它已经有App Service，你可以这样说，或者你可以创建一个新的App Service计划。既然我们已经有了，我就继续讲。你可以选择其中一个应用服务。然后你要做的就是发表。一旦你发布，它会给你一个网站URL。 And when you click on that, it will open a website, which will look exactly like this.

为了检查端点是否正确配置，我将向您展示如何验证这一点。点击启动，手动输入，输入我的名字。

还有陶氏的名字。如果你看到了，它应该显示我的名字和全名。

再举一个例子，

我要把输入从名字改为公司。

看看我们得到了什么。这是公司名称。这表明我的端到端(从模型创建到模型消费)正在准确地工作。这是我的演示，谢谢。最后，我想说的是，我们已经实现了端到端模型训练，使用Spark api和微软资源以及第三方平台(如Databricks)对消费进行建模。bob体育客户端下载

我已经附上了所有的链接供你参考。希望你们喜欢我们的谈话。

点击这里观看更多Spark + AI课程
或
免费试用Databricks

«回来

关于Reema Kuvadia

微软

Reema获得了乔治华盛顿大学的计算机科学硕士学位，拥有超过4年的软件工程师经验，拥有全栈开发的专业知识，每天都热衷于学习从新语言到新技术的新东西。她目前在微软的AI平台团队工作，为微软的繁荣技术(Cosmos, Azure机器学习)和行业领先的开源技术(Spark，bob下载地址bob体育客户端下载砖)。她期待着作为数据工程师探索她的职业生涯。

关于李涛

微软

作为一名端到端应用科学家和数据工程师，在微软的必应数据挖掘，必应预测，C+E商业AI领域有8年以上工作经验的全栈数据工程师和机器学习科学家。技术领域从数据挖掘，机器学习，市场情报，使用微软专有技术(Cosmos, Azure机器学习)和行业领先的开源技术(Spark, Databricks)。bob下载地址