大规模地理空间分析:大流行期间人类运动模式分析

您想通过您的地理位置数据产生有意义的见解吗?您是否尝试以pb级运行这些查询?参加本次讲座，了解如何使用Databricks扩展ESRI的地理空间专业知识。

鉴于2020年全球冠状病毒大流行，我们将研究如何分析运动数据并确定这段时间人类运动的影响。在我们的演讲中，我们将展示几个关键的技术概念——使用地理索引进行降维，利用Delta Lake进行地理空间查询性能，以及使用人类运动指数量化人类运动带来的风险。

在本课程结束时，您将更好地了解如何在规模上获得对人体运动的见解，这是一种高度适用于各行业的可重复模式。

点击这里观看更多Spark + AI会议
或
免费试用Databricks

视频记录

-好的，欢迎来到火花峰会。首先，我想欢迎大家来到这个环节。简单介绍一下。我叫吉姆·杨。我是Esri商业部门的业务发展主管和合作伙伴主管。bob体育外网下载我在俄勒冈州波特兰市工作，今天和我在一起的是Joel McCune，他代表我们的GOAI团队，构建了将地理和人工智能结合起来的解决方案。所以今天我们要和大家谈谈我们是如何结合Databricks和Esri的力量来建立基于人体运动的COVID风险指数的。所以让我们开始吧。

地理位置很重要,

我们都知道位置数据无处不在。无论你是在构建一个简单的应用程序只是在地图上找到咖啡还是你想做一些更复杂的事情，比如让飞机保持在空中，地理都很重要。越来越多像你们一样的数据科学家开始看到将地理视角应用到分析中的价值，这很棒。因为地理确实给你的数据带来了背景和理解，无论是通过在地图上可视化，还是通过建立位置数据来增强解释变量，甚至只是引入背景数据来帮助你正确地看待数据。这都是地理因素，这是一种额外的视角。

因此，尽管你不需要地图来利用地理知识，但地图是一个强大的理解隐喻，因为我们的大脑在进化过程中天生就能理解复杂的二维和三维数据集，无论是空间上还是物理上。以这种方式使用数据是非常自然的。因此，无论你是想了解天气对销售的影响，还是想弄清楚基于射频传播的下一个蜂窝塔在哪里，甚至是如何规划物流套件，地图都能帮助你理解和决定，正如乔尔喜欢说的那样，地图是原始的信息图。

地理问题(在大流行期间)

因此，当我们考虑地理因素时，特别是在大流行的背景下，我们目前面临的危机是COVID，除非人与人之间的接触，否则COVID不会传播。吉姆现在在波特兰而我在华盛顿的奥林匹亚，除非我们面对面，否则疾病是不可能传播的。显然，这就是为什么社会距离有效的原因但最终，如果我们想要了解一个地区的风险有多大，无论是从去那里的角度还是从人们来自哪里的角度，我们需要了解一些因素，所以，我们一直在研究这个问题，并检查现有的社会距离指标，其中一件具有挑战性的事情是，其中许多指标已经正常化。所以这意味着它让我们了解人们在社交距离方面做得有多好，比如在堪萨斯州西部中部，与纽约市相比。但最终从风险的角度来看，这是两个完全不同的地方因为人口密度和相互作用的数量不同。因此，当我们想要从大流行的角度量化风险时，我们需要考虑几件不同的事情。我们要计算这个风险指数要考虑到数量，相互作用的数量，以及某人为相互作用所走的距离，因为距离越远，连接两个地理位置的可能性就越大，否则这两个地理位置就不相关。因此，当我们开始考虑这个风险指数时，我们的研究表明，目前的社交距离指标没有充分考虑地理因素。在很多情况下，它们是由人口标准化的，这就消除了乔尔所说的描述这个体积，它从方程中消除了人口密度。同样作为距离的度量，一个人走得越远，风险就越大。 So if I go to downtown Portland, that’s sort of one level of risk. If I travel to Sao Paolo, influencing that population that is distinct and unique from my population, that connectivity represents a whole higher level of risk. And most models also don’t even cover this idea of significant group clusters. So we’ve been focused on building this risk index that considers distance and volume of people moving. – So when we talk about these movement risk factors, this idea that there’s two things we want to consider.

量化风险

既包括发生的距离，也包括发生相互作用的范围。我们所调查的或正在研究的能够量化这一点的是人类运动数据。

更简单地说，我们可以简单地称之为手机追踪数据。每个人都有一部手机，你能做的就是获取追踪这些设备位置的数据。这发生的方式，特别是在我们这里使用Veraset数据的情况下，这是后台应用跟踪。比如，当你安装一个天气应用程序时，它问你，你是否允许应用程序跟踪位置?这个位置跟踪就是Veraset使用的。你可以想象，这并不代表所有人。根据目前的情况，我们可以获得8%左右的市场渗透率。现在这有点不同取决于你看的地方和时间框架。所以，我们能说的最好的是一个代表性样本，但作为一个代表性样本，它可以传达大量的信息。尽管我们看到的是不到10%的市场渗透率，但这是显示手机位置的个人记录，这是大量的数据。 Of the magnitude of billions of records per day. So, with this in mind, we wanted to be able to understand where people are going from and where they’re going to, we wanted to be able to put it on a map so that we can understand it, but will billions and billions of records per day, we were trying to do analysis from the beginning of March. You can imagine, hundreds of billions of records. There was no really other way to do it than using Spark in a scaled environment in which case we’re using Databricks to be able to do this. – Okay, so this is essentially the general workload that we took, the approach that we took. The top three items here are powered by Databricks and our data toolkit, which again, sits natively in the cluster. The jar that gets loaded there is similar to our open source engine that powers things like Athena and AWS Athena and Presto but it’s been enhanced to be even more perform inside of data risk. And the bottom three here are powered by RJS. Essentially, we start from raw data, we apply this hexagon based index to generalize the data, we build up the summarized origin destination pairs for hexagon and then we bring that much smaller data set into Esri and we have pinned our demographic data, visualize, and ultimately publish an interactive dashboard. – So when we’re looking at this panel data, the raw data, what we’re looking at, as we were talking about before, is each one of these records is really nothing more than when did it happen, a unique identifier, the location of the data, and then finally, how accurate that location is. Because ultimately, there is a margin of error for how precise you know where a device is, so what this allows us to do is this gives us a starting point. Ultimately though, there’s a lot of work that has to be done to be able to understand the relationship of this data so that we can then ultimately get our index. And since there’s so many records, the first step that we want to do is to be able to understand the data in some sort of generalizable form. What we used is a hexagon index for this. This enables us to be able to group them based on an area that’s roughly the size of 2/3 of a city block in New York City, just to give you a rough idea of what we’re looking at and then from there, what this allows us to do is understand the relationship based on the origin and the destination. And in this case, what we refer to as the origin is where the device, and by proxy, a person resides during the night time hours and then everywhere that they go that is not during the night time hours, this then becomes a trip that they venture to. Specifically, we also examine how fast the device is moving because we don’t want to be looking at people driving down the interstate. Ultimately what we want to do is we want to understand the location of people that is relatively static because that’s when people are at rest and have the potential to be interacting with other humans in a different location other than their home. – Okay, so let’s get to the good stuff. Here’s what’s happening inside of Databricks really and our workflow in order to build up that risk index. Essentially, we take the raw data and we filter it by significant dwells where a device is seen multiple times in a given location. We bend those into these hexagons as Joel said and we’re doing this at level nine which is about a city block. We take those hexagons and we build it up, an origin cell, based on where they sleep and a destination cell and we total those up, so that at each hexagon, we have a cumulative trip the destination paired for all permutations. And from there, we calculate that risk index which is simply the number of trips times distance and finally, we output this much smaller reduced cleaning data set as a process table for use by the GIS and now, let’s take a look at the actual notebook.

再一次，这是交互式仪表板中的多维数据我可以看到这是底特律，例如，我可以看到风险区域，我可以看到这里的贡献，就这些不同的挂毯部分的贡献而言我可以在几个城市之间来回切换。接下来我要去波士顿。我们看到了一个非常不同的模式。每个城市都是独一无二的。这里有一些集群。这里有一大群可能会有危险。让我们跳到纽约。这里我们看到了一个关于曼哈顿的故事，我们看到了这些不同的贡献者，但有趣的是，让我们把这些高层租房者过滤一下。我可以在仪表板上过滤显示不同的片段。在城市的北部，我们看到这些高层租房者，这里有一个大的集群。 I may want to think about who are these people and as a policy maker or decision maker, how might I message to them? So we just look at this chart here or this little infographic shows who those high rise renters are from the segment. So we see median age 32, we see that they are relatively low income, much of which goes to rent. We see many single parents and we can just sort of explore who those people are that are occupying that cell and then as I said, make certain decisions or policies or messaging about how to reduce that risk.

下一波高层租户挂毯

现在我们回到工作流程上，我想说的是，如果没有这种组合，没有神奇的组合，没有Databricks前端的分布式处理，没有GIS的丰富和可视化能力，这种分析是不可能实现的。这让我们能够获取非常原始的、大量的数据，并为这些数据带来一些意义和理解，然后我们可以让这些数据以可消费的形式在社区内轻松共享。同样的工作流程，或者从原始数据到这些可视化，使用同样的方法可以应用于大量的行业，无论是电信数据还是观察移动数据，就像乔尔为零售开业和选址所做的那样。这真的就像你正在分析的人类天气模式，我得说，我喜欢Databricks的协作能力，能够建立这些笔记本。

但说实话，乔尔大部分时间都是在笔记本上度过的，乔尔，你的经历如何?

最后，我想要强调的更重要的一点是，这就是吉姆所暗示的，他总是喜欢嘲笑我，我没有进入这个领域是因为我对大数据很感兴趣，真正让我感兴趣的是我是一个地理学家。但在此之前，我一直想强调的是，如果我能做到这一点，你们也能做到，原因是，我有一个公园、娱乐和旅游业的学位。我发现地理学几乎是偶然的，然后我有点偏向于大数据分析，因为最终，我做得最多的是地理学家。我有个问题需要解决。我需要了解人们从哪里来，到哪里去，规模有多大，这样我们才能量化风险。我能够框定一个地理问题，我需要解决它。要做到这一点，唯一的方法就是利用可扩展的架构，与Esri的技术相结合，将其提炼成有意义的东西，然后我们可以将其放入GIS中，从而能够为其添加更多的背景信息。因为最终，游戏的名称是能够获取数据并从中获取信息，而这真正始于能够首先理解问题。所以我在这里真正强调的是我能够做到这一点。我不是一个Databricks专家。 I will freely concede that. I started doing this about five weeks ago. And ultimately, I was able to put this all together and get something up and running. I didn’t do it alone. Jim obviously helped me a lot, but this really was an idea that came to fruition because I had a need and then reached out and found the right technologies and ultimately, the people to help me get over the humps to be able to do this. So really, this was the type of thing where the combination of these two is really greater than the sum of the parts because we have a scalable ability with the context of geography to be able to understand this problem in a very meaningful way. So, with that, thank you so much for your time.

点击这里观看更多Spark + AI会议
或
免费试用Databricks

«回来

关于Joel McCune

ESRI

Joel擅长利用地理信息寻找答案，特别是从地理数据中获取可操作的信息。几乎所有的数据都有一定的地理相关性。然而，在正确的背景下定义地理以发现正确的地理相关性，这在某种程度上更具挑战性。Joel的大部分职业生涯都在与地理信息系统(GIS)合作，从这些地理关系中挖掘信息。随着数据规模的增长，对技术的需求呈指数级增长，并且不得不发展。这让乔尔进入了大数据的世界，继续应用地理学，但规模要大得多。

关于吉姆·杨

ESRI

Jim Young是Esri的业务开发主管，专注于大数据和人工智能。他正在与科技公司和开发人员合作，探索在他们的产品和应用程序中使用位置感知api和空间分析。他的热情是物理和数字的交叉-专注于计算机视觉，传感器网络和定位服务。作为移动社交网络的先驱，Jim在加入Esri之前创建了基于位置的Jambo networks。他在剑桥大学获得地理信息系统硕士学位，并在南卫理公会大学获得历史和经济学学士学位。