Advanced Natural Language Processing with Apache Spark NLP

Download Slides

NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python. Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. It’s the most widely used NLP library in the enterprise today. You’ll edit and extend a set of executable Python notebooks by implementing these common NLP tasks: named entity recognition, sentiment analysis, spell checking and correction, document classification, and multilingual and multi domain support. The discussion of each NLP task includes the latest advances in deep learning used to tackle it, including the prebuilt use of BERT embeddings within Spark NLP, using tuned embeddings, and ‘post-BERT’ research results like XLNet, ALBERT, and roBERTa. Spark NLP builds on the Apache Spark and TensorFlow ecosystems, and as such it’s the only open-source NLP library that can natively scale to use any Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. You’ll run the notebooks locally on your laptop, but we’ll explain and show a complete case study and benchmarks on how to scale an NLP pipeline for both training and inference.

Watch more Spark + AI sessions here
or
Try Databricks for free

Video Transcript

——(大卫)大家好,我的名字是大卫•Talbyand I am here with my friend Alex Thomas, to talk to you about spark NLP, it will be libraries today and how it can help you with your next natural language processing project.

We’re going to go through, talk through three things. First, we’re going to just introduce the library, what it does well, where it is, where stands today can as well meet 2020. And then we’ll go through some code examples and some basic design concepts of the library to show you how you get basic things done with it. And then I’ll hand it off to Alex, who will go through some notebooks and showing kind of live code of primary use case using the library.

So Spark NLP was created now three years ago in order to provide the open source community with access to state of the art NLP and for Python Java and Scala. So those three languages are all fully supported. These are the same API’s. Nowadays Python is the most widely used version. And in terms of what we wanted to improve a really well three things accuracy, scalability and speed. In terms of accuracy, deep learning happened to NLP all the time in a lot of the libraries that we all know and love, it became became quite outdated in terms of accuracy that that’s achievable. And the goal in sparking will be is to make sure that we go and we bought. Today’s the latest research and make up production grade a scalable, trainable interval scalability and speed in the library is based on Apache Spark. So you can natively it’s only open source like a library that can be scaled to any any Apache Spark cluster, and then a whole lot of work on optimizing it both on single machines and on clusters.

Spaghetti became early in 2019 became the most widely used in NLP library in the enterprise, and really one of the most widely used data science libraries kind of in practitioner enterprise settings in automatic research, but if you come from the fortune 5000 it most likely these libraries is being used in production today. And we see it in all environments, cloud on premise large cluster single machines, a small containers and everything in between.

A, just a few weeks ago, they 2020 right, a survey was also published and sparkly steel is still the most widely used empty library in the enterprise and still by White Mountain, if, you know in that area.

He said the library is a really introduction used by all of these companies and really thousands more it we don’t know the full list, but we don’t know that. Really, it’s been used heavily for kind of full production use cases really from from single machines to very, very large clusters and use cases. is a really for companies and teams that wanted that state of doubt accuracy in a production rate setting

in terms of accuracy, and one of the things that we always have to clarify is what the state of the art mean, because it’s not a marketing term, it’s an extra academic term. A set of valid means if you look at the academic the academic benchmarks, if you look at peer reviewed results, what’s the best level of accuracy that we can reach? And there’s a number of sites online that actually break these like papers we’d go in and there’d be progress and others definitely worth following up with. And so we are very open with a three dimensional can accuracy we can achieve. And of course, it’s open source, so it’s all reproducible. We publish some public metrics and oily and other places. It will publish benchmark benchmarks on on pre trained models, the latest releases and really what we do in every release, we just tried to go in and prove accuracy. And the fun thing about saying that you provide Doubt accuracy is that you keep running it right because Lily in NLP every few weeks there was a new paper, but it improves the state of the art. And so we’d like to use them and then productize them. So this is all in 2020 and in 2. 4, and it was completely redesigned, what do we do with a bird and even the glasses on deep learning for ner and definitely does does better now. It was a it was best in class before now it’s now it’s even a little higher. If you say anything to improve the accuracy of the spark OCR library for optical character recognition, and related entity linking and entity resolution in a simple title key 2.5 we added a built in support open source and free for our built in excellent. We’ve published a new contextual spell checker a the time seems to be by far the most accurate automated spell checker. It’s available to them, either Python or Java, Scala ecosystems. And we’ve also really basically the sentiment analysis also If you can improve accuracy, using new deep learning architecture, so those are the kind of the kinds of things you’d expect to see and kind of each of those major releases, usually six to eight weeks and we other than we release every release one to two weeks.

In terms of scalability is dimension is bankability really designed to scale to a cluster. And there’s a lot of use on single machines. But also a lot of use that young people use it on data rates on AWS, Azure on the on in house clusters, in really kind of by default, you don’t need any changes to scale it to a Spark cluster. We’ve tested this, this specific benchmark is also the public one and was done using AWS EMR. And the reason this works so well is because really, that’s what we benefit from Apache Spark, which deals with all of the really health problems of distributed computing, like how you minimize a how you optimize the bandwidth use on a cluster. How you minimize it, just sending data around how you optimize caching and using multiple people, multiple grids effectively, how you minimize shuffling. So all of the immense amount of work that the Apache Spark ecosystem does, we kind of almost get for free. Although, of course, we’ve had to do a lot of profiling work with a small community with data bricks, a tools to get to get where it is, of course distributed system is not magic. So the actual speed up you get depends on the use case. Right? So if you’re doing just inference, yes, you’re likely to get polynomial scaling. But if you’re training a saint Alexander was seeing that the Holy Trinity by nature will be extinct suddenly, now speedups.

和我们很多关注的另一件事是真的just just speed because our goal is is also provide you with the fastest open source library. We’re planning for inference. As part of this. We do a lot of work with Intel limited Nvidia a to make sure that we have deals that take advantage of the latest cheap and latest hardware in and really to provide the library that’s really tuned in to the latest compute platforms. And this is something because TensorFlow pi torch do but other other MP libraries in general do not this benchmark that you see here on the right is a was training at a box awaiting the same level of accuracy, in what design is we trained from scratch and they may be recognized for the French language. And here what we did we compare both the latest Intel chips latest Nvidia chips on AWS and in this specific case intervals actually both both 90% faster and 46% cheaper than the Tesla p 100. But of course depends on the use case and we keep we keep doing optimizations on both lot of the work

is not just data scientists the engineering and the profile. Product ization. And in really making it easy to use algorithms on a regular basis, other than the library is definitely production grade, it’s now been using production for the third year multiple fortune 500 public case studies, it’s very active, we have 26, new release in 20 1830 releases in 2019.

So basically, at least every two weeks, it was a new release. And of course, you can see on GitHub, the releases, you can see the substantial new features and functionality that’s added in each one of those active communities to get started, and issues configuration out of memory you can just ask him and those kinds of sending responses, and everything is a permissive open source license. So you’re welcome to use it for free, do whatever you want with it, and use it commercially as well.

In terms of out of the box functionality, and this slide shows what you. Get a I’m sorry, we have to be sorry.

And I see what you’re doing now, I’ve got to take like four full days. It’s okay to do my hair and makeup makeup. (laughs) Okay, ready.

And we go in terms of out of the box out

of the box functionality, and this is what the library comes with. It’s a complete NLP library in the sense that you shouldn’t need to, you shouldn’t have to integrate with other existing NLP and libraries that are already out there. So functionality starts with the basics, a tokenization splitting text into sentences, limit ization, stemming part of speech tagging. And then we have the more advanced features in spell checking, spell correction, sentiment analysis, named entity recognition chunking and text matching. Everything is trainable. Okay, so even the part of speech tagger dilemma ties and definitely things like sentiment analysis or entity recognition are trainable. So you can either use the out of the box pre trained models, which many people do. Or if you really have a domain specific use case, or you want to support a new language, you can train your own. And that’s really one of the one of the areas that’s going to be strongest in supporting cases when you don’t need to train your own models. In this enable us to support now a very quickly 20 languages in the last release, we went from six to 20. And as you can see, over 90 pre train models already, in over 70, pre trained pipelines and a pipeline includes a whole set of basically a complete task. Take rotex tokenizer glamorizing semi clean it up, split it find things in it, then say for example, find entities and then publish them. And so you should be able to do that, as you seen him really just a few lines of code and he talks everything he needs. And our goal is similar with languages signaling, we’ll come former’s make sure everything is there. So this not only simplifies your life and simplifies you code, it also enables us to scale. Right? So of course, you can’t really scale on the cluster if we only for example, if we only scaled the training piece, only scale the conformance base, but even simple thing, for example, like stemming and limit ization, you still have to go here. So everything is within one spark pipeline and skills.

So, the libraries and API’s of Java Python and Scala is I said we we optimize for the latest Intel and Nvidia chips and we’ll make sure you’re running the fastest on the latest hardware. And we officially make sure everything works on every race on data bricks club there as well in AWS. And that you can assume always supported but I would say you know, really people are on this on the on prebuilt from skate spot class. There are no conflicts on the plane. On really just about any configuration a lot of course, communities don’t care. And this is really running production just about everywhere we support Windows servers. And this is really just the nature of what we test on on every release.

Okay, so that’s the overview. And that’s kind of what the library claims. Now we’ll see how we actually get some things done with it. And getting started is really simple. And one of the things we spent a whole lot of time on is making sure that really five lines of code three lines of code, you can get actual end to end tasks done with state of the art, a results transformers, a deep learning architectures. So here’s one example of doing sentiment analysis. So the first line you import spark in NP, you initialize it sparking a B dot start. Then you load the pre trained model. And so you import the preprint pipeline class in a pipeline equals pre trained pipeline analysts sentiment ml in English. k M is for the English language. And you’re just loading a pipeline a pipeline loads a whichever models annotators whatever is needed to execute to do the work. And then all you need to do is to call it on a sentence. Okay, so in this case, we call pipeline dot annotate helicopter is a great movie. And we have a result object, which is just a simple Python dictionary. And if you look at their sentiment column, you see that the result will be positive. So it’s really just as simple as that you do not need to know to know anything about deep learning Transformers TensorFlow sparkle in your with it and get things off.

这是另一个例子。这是命名实体再保险cognition. So we named named entity recognition. And the goal is to be able to as you see in the example here and Mark entities index, so if you have a hairy enrollment in Pokemon, you’d want the algorithm the model to know the Terry normal people and the talks me. The location is a place and importantly, additional work and without. The dictionary. Right. So Harry could be could be a person, but it could be it could be other things. It could be a company, it could be location, right? So you’d have you’d have those example but you should not know, look, this is meeting so yeah, so this would be people, a hoax need to be marked as location, even though of course, it’s a fictional place. Most people believe it, because when you have two people, and you say in English math in afterwards only either a place or a time will come and this is not the time so it has to be a place it’s also capitalized. Right? So it’s obviously, I mean, if you haven’t read the books the President still know that this is a place and and you’d want your ner model to be able to generalize and catch those things. The way we do this, and this is really using Bert like really state of the art accuracy state of doubt a Transformers pipeline equals pre trained pipeline with a good mix entities birth in English and we have a few other languages in which Complete pre trained pipeline pipeline goes annotate any and raw image in hogsmeade. And when you look at the results, you see that basically a by token, the first and third tokens are marked as a either sphere, meaning person, I mean, this is the initial initial, this is the first token, they offer a an entity to recognize, right, so Harry Potter, then you’d have two tokens, and you’d know which one is the first which is the second and you also see the token isn’t recognized as okay. This is really all you need to do to activate pre trained ner. And here’s an example in Scala. Okay, and really just to show that it really looks exactly the same. This is a spell checking example. So the only thing you really see in Scala is that instead of saying pipeline equals with Val pipeline equals and then we printer is different, but it’s same capabilities. If we load the spell check ml pipeline in English, we are not a teleporter is a great one. And you can see a great movie of misspelled words. And once you clean the result in the spell columns, you will see that you get the tokens, you see that great and movie has been automatically corrected. Right. And that’s also like exactly three lines of code and something you get out of it.

So those are kind of three examples of the 70 plus pre trained pipelines that come with it. Here’s what’s happening once once you call this.

So and you really don’t need to know anything about Apache Spark, as you’ve seen to use the library. But when you post a spark, if you don’t start it with a spark session for you of course you can configure it if you want. When you call a corporate to imply pipeline, you initialize it, it loads the English version of the explained document, the L button. It loads everything that’s required with it. If it’s stored locally, it loads it on the local machine, if nobody tries to log in from s3 but kind of does all of that everything It’s loaded locally, right? So both is local files and in memory, right, so we can use them. And so I’m going to be actually comes with its own in memory database based on rocks dB. And in particular, because a lot of work so for example, if you’re loading a 2g data by embedding, but then you need to share it across any threads, or multiple steps in the pipe and into the semi meetings, they will only be loaded once and they actually can read access works weekly. So this will work. And so we’ve done a lot of work to optimize the use of flour embeddings and models. A for you A also under the hood, we initialize TensorFlow on the same JVM, we make sure there’s no inter process communication was a lot of just memory sharing and optimization work to make that happen. We initialize the tensor flow graph here that’s required for the our weather locally on the cluster depending on how that decision was started. In then the when you actually call annotates with Okay, Run the pipeline on this a recall the inference. So meaning we activate each step in the pipeline and one by one and we pull the right algorithms for the algorithms that actually use deep learning. We call them the kind of under the hood TensorFlow implementation. In any it actually has a deep learning algorithms, and it uses the Transformers as needed.

They ain’t terms of any art. And the way this works is, it says, I mean, the specific examples, we use it cnn for cocktails, maybe let’s name photo cans, I see a ref at the end. And of course, we use net to correct buildings

and is needed. And then the result object is a just a local Python dictionary so we can clean it up and making it easy to set up. So all of this is happening. And the main point is that really you don’t need to know about any of this. Right? You just need to know this is just some what we take care of.

Okay, the next thing I’d like to go through, it’s really just just if you keep key concept that you need to know if you want to use the library on a regular basis. So the first one is a pipeline. And the pipeline is really just a set of transformation set of steps that you do to text. A for example, in this case, we we take the text we had a document the San Lucas document, a, we formed the document, we split it into sentences, so we use the sentence detector, and the output is a sentence and it was a sentence column A we tokenize the sentences we tokenize meaning we split the sentences into words we use the tokenizer then we get the token column as a result. Then for example, if you want sentiment we call the sentiment analyzer. The result we have a sentiment column and the way this works is when you use it on a data frame. So to start a data frame or you have an abundance data frame, is basically the things like documents sentence token sentiments, accounts will be actually added to the data frame. Okay, so we can see in the end if you want to only get the result, but if you want, you can also see all the intermediate steps. And the other nice thing is because when we do this on the spark data frame, it comes this way, it’s a trivially easy to parallelize. Right. So different documents can be on different machines on the cluster or different threads. And you can build a skill build the table this way. So that’s the concept of a pipeline. The next concept is the concept of an annotator. So each block in the previous slide is an annotator right to send the detector is an annotator in a sense a sentiment analyzer is an annotator AM. So in annotate was actually an object right of the class that implements the algorithm in which you activate. So in this example, sentiment detector is you instantiate the class. You say, in this case, the sentiment takes as input sentences is an output if you do the sentiment score, and you can also kind of configure it right. So whether it’s dictionaries models, embedding, Inflation all type of in other hyper parameters. It’s really if you want to ask Okay, what but cool algorithms can which can be the question is what annotates everything happening.

The next concept that you need to be familiar with is the concept of a resource.

And resource basically is any external file that you need in order to run an algorithm on an emulator. Okay, so a pre trained model splittin, embeddings, dictionaries, transformers, briefing, pipelines, rules, everything that I can find that need to be loaded on resources. And it says Fang p kind of does a lot of resource management for you and split them across the cluster, split them across threads, and share them in any memory data store. Make sure they’re efficiently accessible. And all of that is taken care of for you. Right. So you just do. Yeah, I just just load the pipeline with the built in any art and it goes like embeddings like models that I told the other pipe that multi-tone selves and it just works for you.

最后一个概念是一个训练有素的管道,以前we’ll just put them all together. So Britain finally someone defined the pipeline in the pipeline includes a set of annotators. I downloaded to view certain resources a but all of that is predefined which is really I mean, you just basically define the object and you save it you persist it. And because one of the cool things about sparking up pipelines is also realizable, a writer also distribute distributable right. But really, we just create the pipeline in code and save it and then you have a different pipeline for the next time. So let’s say those are the core concepts. Here’s just one example. Here’s a view of how it all works together. And when you want to train your own custom named entity to nation with Bert, and this is really all the code that’s required to do that. So if you look here, you import libraries initialized. So spark equals a spark in the build start, you load the training data. Okay, and for the hour, we looked at the Cornell format, which is basically it’s kind of a standard format. Flow training data for entities you have like a token. And then you say for example, if it’s a built on location, for each specific token, you load the burden beddings Okay, an xo belt equal belt embeddings don’t pre train and we load prepend, the embeddings. And then we initialize the potato the NFL ticker. So when you’re thinking equals male, the approach which is our deep learning, and named entity recognizer, you see, we can configure it. And then a what’s left is super simple. A we create a pipeline. This is an example of really creating your own custom pipeline. So pipeline equals a new pipeline with two stages Bert and any allowed. So basically say, look with this pipeline, here’s what you do, you get data, you create the UK, the building blocks from it, okay, using the pre trained embeddings uploaded, and then you apply the ner using using those embeddings. Let’s see. So, we defined the pipeline the next the next lecture. Once you have a pipeline, a pipeline has two interesting methods, it has fit to train and conform he wanted to do inference. So here we just do pipeline dot fit on the training data Betty, and you train it you train your custom model and this will run locally plus the GPUs, right we’ve built embeddings kind of everything is already done everything you want is done for Okay, this notebook as well as the others that Alex will show are one available online public. And we actually have no limitation with community support. So you’re more than welcome to try it yourself.

Okay, well that’s enough slides for today a in the next section aliens have to Alex and we see some live Python notebooks and they love to see some live examples of using the library. All of the notebooks will show a public the available on the NLP, john smith. com website, a tween Emily’s to really start quickly and get the benefit of this library. Which is really free, completely open source and build for your for the community to enjoy.

– [Alex] Thanks, David. Hi, I’m Alex Thomas. I’m a principal data scientist at wise cube and I’m going to be going over some notebooks so we can see some actual runnable code for spark NLP, you can find all these notebooks on the spark NLP workshop repo and we’ll share a link to that at the end of the presentation.

So the first slide we’ll be going into is one talking about spark NLP basics.

So in this first notebook, we’re going over spark NOP basics. And we see in this first cell that we’re installing some stuff. So the reason we’re using Google collab, which is where this notebook is,

就是这样吗?让我们远离安装l have people install stuff on different machines and where it gets spark and Opie running. Because oftentimes setting up spark or Java can take a couple tries. But so with Google collab, we just installed some of these basic things in this slide. So here we are setting up Java, and we’re installing spark and spark LP with Pip. So it’s super easy to do.

So let’s go ahead and we’ll just run that. I’ve actually run all these notebooks beforehand, because some of the cells take a little bit of a long time so I’ll be skipping over some of the lengthier ones.

So when you have spark set up, you’re going to want to start your spark session. So David talked about how spark op has a start method that does all this cool stuff in the background for you. So we’ve run that and we’ll get our nice spark LP version and Pachi spark version. So what’s going on behind the scenes is that we are just doing a spark folder like this. In this case, we’re doing a we’re putting the master at the local machine, and we’re telling it to use all the cores, we’re giving it 16 gigs of driver memory. We’re also setting serialization. So this serializer is useful in a number of ways. One of the important ways is that it decreases the memory footprint of the objects in your spark data frames. This is really important in NLP tasks as NLP takes text which is which can often be a large amount of data itself and will produce a large number of objects.

So we also have something here if if GPU is available.

So now that we have spark running. We have a bunch of different pipelines that we can use. So David talked about a couple of these four. And you see here we have a list and this list is growing.

所以我们要开始使用做解释cument, machine learning. So this is one that’s going to do a lot of basic text processing. But it doesn’t use any of the sort of like the heavier deep learning techniques. And so we’re going to look at is this one and see we have some spelling mistakes in here that we’re going to catch. And we have some entities you might want to look up. So let’s go ahead and see what we have here. So in the explained document, we have the document assembler. And so what that does is it’s going to take a text and it’s going to create a document representation. And there are different sorts of NLP libraries. an annotation library which is what spark NLP is, needs a document represent. And then once we have a document representation, we pass it to each stage. And it looks for its input. So, for example, the sentence detector just takes the raw document. The let’s take something like the lemma tiser. Well, lemma tiser needs to look up a token, which means the text needs to already be tokenized splitting individual words. So the tokenizer needs to go before the lemma tiser. The stemmer, on the other hand, also also takes the tokenizer as input. So the order of these twos, that lemma tiser and the stemmer don’t really matter. So let’s see what comes out of this pipeline. So we load it. And what it’s going to do is is go download it. So this is a pretty small pipeline because it’s a relatively simple. Also, since I’ve run this before it’s cached. Then we’re going to annotate that document.

So the items we in there, we have checked that’s the spell checker we have the document, the sentiment identified with it, the tokens and the sentences. So here the sentences have found. So it looks here like it found all the sentences and didn’t falsely identify anything.

It is interesting that one of the sentences begins with a small character, but that seems more like an error in the original text.

So here we see a list of the tokens.

So we also find the parts of speech so a part of speech if you remember sort of your learning in school, we have nouns which are things we have verbs which are actions, adjectives, which describe nouns. in NLP, there’s actually a much wider set than is normally used in education or even in linguistics, and that’s because it’s useful to break them up into more categories. These are defined from the University of Pennsylvania treebank these particular parts of speech. So this NP means a proper noun.

This is a verb and the third person singular, so in English that is normally without s ending, so like he knows, she sees. Here we have an adverb adjectives. So you have a number of different sorts of parts of speech here.

Now, let’s look at some of the other things. So with parts of speech, we’re sort of classifying a token, but with the lemma stems and spelling checkers, we want to take the token and return some sort of normalized version. So limit ization is stemming is different in that limit ization will take up Word and return it to its dictionaryform. So for example, cats, when you look that up in the dictionary, it will direct you to cat. Similarly, something like oxen will direct you to ox. And there’s some English has a lot of irregular plurals as other languages do. stemming on the other hand, will remove will remove suffixes from it. So, that will often an English result in a word but not necessarily. And then we have a spell checker here. So let’s go through and look at these examples. So we see Peter is essentially treated the same by all of them. The stemmer has a lowercase feature, you see is here is returned is not changed by spellchecker. The lemma tiser resolves it to be is that’s the dictionary form, and the stemmer just turns it to an odd

and the other hand, if we go down here, look at something like very, we see this I hear on very, that’s because that’s the stem form of that word. So let’s go down.

So, let’s pull some of this data into a panda’s data frame so that we can visualize it a little bit neater. So here we have Peter and it’s a proper noun. No corrections are sort of needed.

Let’s see. So we have broths. So that’s marked as a noun, the s there means that it’s a plural noun. And both the lemma tiser and stemmer handled it correctly. Here we see the lemma tiser did nothing to however which makes sense and the stemmer sort of messed it up by removing the ER. Now sort of the trade offs with limited users and stammers is that a limit Iser is a dictionary, you need to have all your mappings there. stemmer is generally algorithmic although there are some variations. So a stemmer can potentially remove the plural from a word you make up if you if it has a normal plural, in normal English plural. A lemma ties on other hand will not know what to do with that. And we’ll just return the word. So those are sort of the trade offs of stammers unlimited users.

Now we have the deep learning equivalent of this pipeline explained document DL.

So it’s going to have a document assembler sense detector tokenizer. All those are sort of similar but the ner model is using glove which is a type of embedding.

We’ll get more into that in a couple notebooks when we go over embeddings more and it sort of has a lot of the similar outputs, but let’s see what it also has. It also has these embeddings We can get out of it. So let’s go ahead and see the output. Here we see this. So we’ve seen the name benefi label. So in this case named he recognition is, as David was talking about trying to classify the tokens as wherever they are. So in this case, Peter is at the beginning of a person named entity segment in the text. So let’s go down and look for one that has multiple. So here we see Lucas Naugle done back done Barker, so here we have Lucas is marked as the beginning of an organization. So although it identified it as a name that it miss identified it as for which kind, and then it marks these two as being internal to it.

See if there’s any other ones. Yes. So yeah, we’ll go more into ner in a notebook dedicated to that subject. But let’s look at the pre train spellchecker. Because on top of having these pre trained pipelines, they’re also pre trained models. In this case, this pipeline is has a it has a different spell checker.

So here we see the C person is corrected, interesting as corrected.

So if we have a number of documents, we were going to want to be able to pass those through the annotator as well. In this case, it’s a small number. So when we pass it through, we’re going to get the same results as when we pass in a single text above. But it’s done for each one of these documents. Of course, if we have a larger number of them, we are going to want to actually use spark to do that. As opposed to using sort of some of these like methods.

So in this case, we have the actual annotations being returned instead of the simple Python dictionaries. So here we see annotation of word embedding token.

So we also have the entities here. So if we remember back when we were looking at the named entities found in the explained document, DL, we saw that there were these three that were marked as a same entity. But they we want to combine them so that the algorithms that consume that output, know that in this case, Peter Parker is a single entity. So let’s go back and look at this input text. Peter Parker is a nice guy and lives in New York. So here we have two named entities, a person in a place. Both are multi token entities. So We’re going to want to make sure that it does work out that way that they’re both chunked together. So chunking is chunk is sort of a term in NLP when you have a phrase that is identified separately,

perhaps by your model, but in consuming it, you actually want to consume them gathered together. So for example, maybe you want noun phrases. So in this case, a nice guy is actually the noun phrase.

So guy is the noun, but the determiner and the adjective here are the two. There’s Junkers for parts of speech, but also Junkers for named entities, which is what we’re looking at here. So, here we see that Peter Parker and New York are both correctly identified as For their respective categories, so Peter Parker’s a person, and New York his location.

So, let’s look at similar to what we did before and look at the all the different tokens, we see that they all have the same sentence ID, because there’s only one sentence here, we have the start and end positions. This is sort of a normal thing you see in annotation based libraries. So you want to give the start an end so that you don’t need to actually modify the text to know where an annotation is found. Let me see the part of speech and the named entities found. So here we see Peter is marked as beginning person, Parker mark is interior person.

So let’s look at the match chunk. So this is an example of the part of speech chunking but here you can give it a regular expression. Like this that will pull out particular sequences of parts

of speech. So, here we have the book has many chapters. So we’re looking for a possible determiner followed by zero or more adjectives, and then one or more nouns. So let’s see what the sentence has. So here we have a determiner and a noun, so this should be found by it. And we also have chapters here.

Depending on on the part of speech you use, this could be considered a determiner or a quantifier.

So let’s see. So this case it’s a label with an adjective. So in this case, we should get many chapters. So see the Trump’s we got so we got the book.

So what’s interesting is that the reason why Didn’t get the chapters as if your call above plural nouns are marked as ns. So chapters would technically not be an N, it would be an ns. So I’m looking here at another one, we have the little Yellow Dog. So we have determiner, adjectives and dog, and we have determiner noun, so that the little yellow dog and the cat should be the two chunks we find. And surely enough, we see that those other ones be fine. We also want to be able to extract data that’s often a very useful thing and then in a number of different fields. So this case, I saw him yesterday, and he told me that he will visit us next week. So in this case, the result of the output is finding two dates. So you Yesterday is may 26, at the time of this recording, and next week starts June the third again, or next week is June the third. So that’s again at the time of this recording. So this what’s cool is that this finds relative dates and gives you an absolute one.

So here I saw him yesterday and told me that he will visit us next week. Similarly found the same dates this is the full annotation.

And then we also have sentiment analysis. So this is a another sort of very popular NLP task. So here we see a similar one to the example David showed and says the movie I watched today was not a good sentiment negative, and that makes sense. Let’s go ahead and change that. Run it.

And oh, okay, I guess I guess the notebook terminated

or something?

Oh, okay. Yeah. So it’s positive. So, yeah, it overwritten that previous notebook up above, but so he received that not when we include that it goes negative. And that’s because this is a signal for something positive when you put a knot before it inverts that.

We’ll go more into that and some of that coming up next. So the next slide we’re going to be looking at is another one that looks at pre train pipelines. But this time it’s gonna be we’re gonna be focusing on that named entity recognition task. This is a super important task in NLP. So it’s worth a deeper look. In this notebook, we’re going to be doing the installation similar to the previous one. All these groups collab notebooks run in separate environments. So let’s just go past that. So here we’re seeing we’re doing the same sort of startup as what we did before. But we’re going to be using the recognize entities DL, to train model. So this is a bit of a bigger model than previously. So let’s look at this text. So we have the Mona Lisa is a 16th century oil painting created by Leonardo. It’s held at the Louvre in Paris. So here we see, Mona Lisa is a named entity.

You could arguably say the Mona Lisa is a named entity because it’s referring to the painting as opposed to the person. Oftentimes, in NLP, you’ll run across these judgment calls as far as for what is correct. Leonardo is obviously a person. Paris is a location and move. Again, that can be a location because that’s where it’s held at the loop, but you could also argue that it’s an organization as in the the museum itself.

So let’s see what’s actually extracted from it. So here we see the tokens are all extracted. We have the different named entities here. It’s identified Louise’s location, we have Mona Lisa, and we have Leonardo. So let’s look at that. So it’s found all those entities, chunks together, Mona Lisa. So let’s try a bit of a larger one.

So here we have the slugger one I won’t bother reading it.

So it’s about Sebastian Thrun. And here we see that’s a strike extracted a number of entities.

So one of the important things when doing neighbors recognition, because it’s a classifier is you need to take in these signals. First one is David mentioned this about not having not using a dictionary. So that often creates a difficulty because when you start out on a project you may not have everything identified. So, for example imagine if we were starting out a project and we’re looking instead for instead of people in places if we were looking for pharmaceuticals. So you if you were to run the this named entity recognizer

over medical documents where you’re looking for mentions of drugs you wouldn’t really find many. And if you did, they would almost certainly be false positives as persons locations and organizations don’t overlap with drugs. So there’s a couple different techniques. You can you can do to solve this. One is you can create a dataset and have it labeled and train your own model. So that’s a bit intensive in the beginning, you can also

use a simpler dictionary approach to find some drugs and then train a model on that and iterate by using human labelers. And some dictionary methods, I tell you have a good enough model to run.

So this named entity recognition model is really easy to use, because just download it it’s using the glove and beddings. And you can sort of kick the tires of spark LP with only a few lines of code. But if you have a specific project with specific needs, you’re gonna want to be able to train your own model, and that and that’s what we’re going to cover in the next slide. In the next notebook. So in this third notebook, we’re going to be going over how to train your own ner model with different embeddings. So let’s go ahead and switch over to that notebook. So in this notebook, we’re going to be looking at named entity recognition will primarily be using Bert embeddings. But we’ll also look at some glove embeddings. So here we see again, similar installation instructions.

So this is another example of sort of the library being started up. So first, we need to pull the data. So this is a sort of a, just a simple poll from GitHub, our simple download from GitHub. So let’s look at what some of this data looks like. So, in this data format, we have the token, the part of speech, as well as the chunk itself,

so the idea that this isn’t a noun phrase, and we’re not gonna be doing any sort of training on that we’re really looking at, at this. So this is the the part of speech that’s part of the named entity that it’s identified with. So the EU is an organization.

rejects is not a named entity. Here, it’s identified German as a miscellaneous type. Arguably, you could say that this could be a because it’s

an adjective, it’s a little bit complicated. You could argue maybe it’s an organization because it’s referring to the German government. British here similarly, as this adjective is found to be a miscellaneous type. And we see that it’s that this is how this data is structured. And really, what we’re doing is that we’re taking these tokens and we’re learning to classify the tokens by these labels.

So let’s go ahead and download our data set. So here we have our data set that’s in, it’s put in the format necessary for us to train.

So let’s talk a little bit about what Bert embeddings are, is that’s the features that we’re going to be using for our normal.

所以,在粘性伯特嵌入的技术发明gle. So in the early days of word, word to veck, they were using neural networks that produced a single vector representation for a word. And that means that bank is in an institution that you deposit and withdraw money and bank, a river bank are both given the same vector. So that would mean that if I had a model trained on Are all text, that the vector for bank is more likely to represent the usage of the common the common sense of it. So the institution, but if I’m wanting to analyze a document that’s talking about geography, most of the uses are going to be river bank. So we really want to have a way of getting embeddings that uses context. So Bert does that by using these transformers. And essentially what it’s doing is it’s learning a context and Toka representation. So that when I turned bank into a vector, it’s not just the word bank I’m using. It’s also the context that that gives us a really good models and Bert and its successors like Xcel net are responsible for a improved scores on a number of tasks. The downside is that that means that It’s no longer a simple lookup. When I get a word with glove or word avec, I get a token, I look up what it is in my embeddings. And I have my factor with Bert, I need to run the neural network, that produces the output that takes in the word and its context.

这创造了一个复杂的re resources are required to use it. And that you when you’re training it because it’s learning this context, it’s also a more complex model. So there’s always this trade off in NLP as is as an all machine learning. When you want to get a more complex representation of your features, it’s going to be more expensive. So with birth, there’s different layers that you have that pool together this context. So if you use the first layer, it’s the fastest, but it’s often the simplest The layers. So a common intuition about neural networks is that, especially with deep learning is that each of the layers represents a higher order representation of your features. So when you start out at the first layer, it’s a very simple representation of your words. But if you go through a much later layer is your bird is trained against the task of finding missing words masked words, as they refer to in the paper, it gets more and more specialized to that task. So if you want if you want to use one of these higher level features, then you’re likely going to need to retrain some of the Bert model fine tuning. It’s generally called literature on your particular data set.

So let’s go ahead and get our Burton beddings. So here we have our data. So we have our document. Our we have the labels for all of them for the names MCs, now we’ve added our burden backs.

So, we’re going to go ahead and write out the embeddings that we have, we only do 1000 of them, so stuff can run a little bit quicker. So if we pull these embeddings in, we get these, this, these vectors. So we’re gonna, we’re gonna want to pass in these vectors

to train because these are the inputs to our any our model, these output any embeddings.

这里我们有尼珥尾随者。这是接受these embeddings. And we have a number of layers that we’re looking at. Number layers, number of parameters that we have, we can limit the number of epochs because the N ER model itself is also a neural network, the batch size as well as a number of other servers. Like logging related parameters. So we can set the learning rate and the decay coefficient, dropout batch size, a lot of these techniques used to improve neural network training. So I’m not going to actually train it here because it takes a full minute to train even these thousand examples we want to do is train this models, we want to pull out the ner, the actual ner model. So that would be the second element of our of our pipeline. And we’re going to want to save that because we’re going to reuse it later. So here we have our predictions. So let’s go ahead and see what they look like.

So here we see let’s see. Let’s look at this one. It’s nice and short. London market is location we predicted location. That’s great. And some other ones xx mark is organization, our model

debt market is being the beginning of one looks like both here are seen odd. I don’t know why the label would have it as an organization.

So let’s go ahead and look at the schema that’s output. So this is the the full schema that’s output when you’ve annotated everything. This is, of course quite a intimidating set of objects and features in our data frame. Which is why it’s often important that you use the finisher transformer that can extract the particular elements you’re interested in. In this case, we’ll just be selecting out the elements ourselves.

So here we have the ground truth says Leicestershire is as an organization that marks it, which I think I’m a little bit suspicious with being that we don’t mark it at all. So there’s a One sort of error for our model. Here we have London’s marked correctly. West Indian all rounder Phil Simmons.

So looks like we only train on 1000. So I’m not surprised that we don’t have a great model. Let’s go ahead and load our model. So we’re gonna go ahead and refer back to that model we saved before. And we’ll load it up and use it to predict, again, see the same results.

So again, we can also put this in panda. So you can have a cleaner represent cleaner look at the data. Let’s go ahead and use one of the deeper layers. So we’re going to be using the second to last layer from the Bert embedding. So instead of the first layer, which is one user by default, we’re using the second to last. So again, we’re gonna call fit, we call fit on it, which I did beforehand. And we see here, okay, this looks a little bit more accurate. It’s actually very accurate. We do see some errors here, the hair marks Western Union organization, which I think if I think might actually be more accurate because I believe this is referring to a sports team.

So, look at two different ways you can Burton beddings. And let’s go ahead and try this with the glove embeddings. So the glove embeddings We won’t, we don’t have to train them, we just have the embeddings that are sort of pre built. So we download that model, and we just need to train the new ner model, and we’re saved that we’re saved that aside.

So we do that. And it looks like I didn’t do bad it missed. Indian here seemed like a West similarly wrong to the first layer from the Burton beddings.

So again, let’s take a nice cleaner look at some of these

pandas is often in notebooks a better way to look at data frames. The downside is you do need to actually pull the data locally to do that.

So let’s go ahead and build our pipeline, our prediction pipeline. So here we have our document assembler, sentence detector tokenizer for embeddings, to any ER model to be trained from before, and our converter that’s going to build us their trucks. So here we pass in something empty, it returns nothing. That’s a good sanity check. Then we have our senses. We tried earlier sentence that we tried earlier, Peter Parker’s a nice guy and lives in New York. And let’s see what it finds.

So in this case, it didn’t Find anything let’s go ahead and try with the glove model looks like glove was a little more successful it found Peter Parker and identified as a person. So now we’re let’s take our name recognition model that we did before. And we run it on the Mona Lisa one.

And this is sort of similar to the results we saw before. So trading on 1000 examples or thousand documents is not going to give you a great ner model. But as you can see, it’s really easy to set up this training with spark NLP. So we essentially just like downloaded the Conan kkona kkona level data set. We train the ner model on it with the Burton beddings. We tried a couple different embedding techniques. You could also potentially you can also use spark NLP to help fine tune your burden beddings Even though that will

sort of also depend on your, on what sort of data you’re using, if you’re using general text, the default ones should work fine. But if you’re using specialized tax like clinical text, you’re gonna probably want to fine tune your Bert.

Okay, so let’s go over to the next and last notebook. The next notebook that we’re going over is about document classification. So we looked previously at some text processing as well as named entity recognition, which are often our frequently used tasks, but they’re often used as inputs to other tasks. So text classification is a great example of a document wide process, as opposed to one that scans through the document. So let’s go ahead and switch over to the notebook. So in this notebook, we’re going to be building that this classifier we go through our normal setup. Here we see a link to a blog post. post where we started where this is discussed more in depth.

So here we do our normal setup. And we’re going to load a different data set this time This one is datasets built more around classification. So it’s a news category data set.

So here we see an example of what’s in there. So we see all of these ones are business examples. And we have just the text. So the idea is that I give you a an article as text. And I want to be told what category. What news category this belongs in. So we have 120,000 examples. And we have the data splitting between four categories science and technology, World Sports and business. So now that we have our data set, loaded Let’s go ahead

and look at our test data set. And we see it’s much smaller, but similarly evenly split. This is good for looking at an example of a library. But realistically, you would never get something even close to this balanced. Every sort of actual example I’ve seen of kind of classification. It’s usually been like one categories, half of them. And then there’s like 100 little categories. But this is a this is a much cleaner one to look at, as an example.

So if we want to split the data set ourselves,

we can go ahead and do that if we wanted to take to leave this training. This test data set is like a holdout for the end and perhaps do some hyper parameter tuning on a validation set.

让我们进入它。所以我们要建立我们的pipeline. So we go And once we were sort of like familiar with from before the document assembler the tokenizer, the normalizer The normalizer is something that sort of does like cleanup on the on the text. So, stemming and limit ization will reduce the number of the forms of the words spell checking will remove these very rare misspelled words so that we can group them together with the correct more frequently occurring tokens and stop word cleaners suffered removers essentially just remove words that are so common is to be useless to us. But a normalized your will is sort of built to use reg x’s and some other techniques to remove odds and ends perhaps from HTML scraping.

So then once we have our lemmas we will want to get our glove embeddings for them.

所以一旦我们有手套嵌入每组sentence will be represented as a list of vectors. But if we want to classify the whole document, we’re actually going to want to find some way of aggregating all these vectors together. So we’re gonna use a simple averaging technique. There’s, of course, more sophisticated techniques out there, especially with neural networks that you could use, you could build a recurrent neural network to classify it. But averaging is often a very good first attempt for text classification. If it doesn’t work, well know what the next steps are.

So once we’ve averaged them, we’re going to then run it into the classifier and this classifier is has three epochs, it’s going to predict which of the news categories it’s in. So we go ahead and train our data. So in this case, it took four minutes not too bad. bit too long for a talk such as this, but not too bad when you’re actually running it for 120,000 training examples.

So here we see the log for the data for the training. And we can get now we can use the model to predict for the test data set.

So let’s go ahead and see what our predictions are. So this looks pretty good. The first one’s business. Let’s see if there’s one mistake here with science and tech. Wilcher police warn about phishing after its fraud squashy so on. So it looks like this article might be about this type of scammer. Oh.

So it looks it might be this about this type of scammer. And here we’ve identified it as world. Apparently it’s supposed to be science and technology.

通常一个文档分类我们我们talked about, with named entity recognition, it can be a bit of a judgment call which category it falls into. So if our world category, or let’s say there was a crime category, or world category contains a lot of crime documents, then maybe this would belong in there. But it seems like it’s focusing on fishing. So size technology seems appropriate.

So let’s go ahead and look at the actual predictions and look at some metrics. So it looks like for each one of these categories, we did pretty good on recall, sports got really good for precision and recall. So we did really good on that one. I think looking at the previous misclassifications from above, I imagined that business science and technology and world sort of have some clashes in there, whereas sports is probably a very different vocation. All right. And if you consider what this averaging is doing, is that when you have word to vac, or glove or even Burton beddings, the words that you’re turning into vectors are being converted into a point in some space. And the idea is that space, the distances are supposed to be similar to the semantic differences distances of the words. So that two words that are very different from each other should be very far apart in the space.

所以如果你平均一堆一堆单词发送in that space, what you’re getting is something that’s supposed to be like a semantic middle ground between all those words. That’s sort of the O of what the averaging technique is doing.

And so that’s why something like sports, which has a very different vocabulary, it’s likely making vectors when average that aren’t very good. Different part of our vector space,

whereas the other ones might collide some. So overall, this isn’t a bad model considering how even the split is, I would say, I think we could probably do better, we iterate more, perhaps build maybe a more complex model.

So let’s look at the if we use sentence embeddings. So in this case,

we’re going to be predicting each sentence. So we’re going to we’re going to be doing the same technique as we did before, where we average across documents. But when we average across sentence and predict the label of the document for the sentences in, then our centroid are that middle ground point that makes from the vectors is going to be more controlled, so that for a given sentence

We don’t have as many things to average apps, because the more things you average out, the more, the more you sort of lose specific information about your document.

So let’s go ahead and fit that model.

So let’s go ahead. Okay, so, so here, let’s see where we where the predictions go. So we can train that model.

So here we have these predictions. So here’s an example predicted so fearing the fate of Italy, the center right government has threatened to be merciless. This looks more like as a world one, we’ve classified it as a business one. So, a couple of techniques you can use when classifying if you look before, deep learning the company technique was to just convert your document into a bag of words, and just use that to predict. And that’s not much different than what averaging all the vectors are within the differences that you are kind of baking in some information about the words, when you use Word evac versus bag of words, you also have to do don’t have to deal with a humongous number of features that you’d have to deal with in the back of words method.

in a task like this specifically, I would probably my next step would be to find perhaps some keywords that I think are more useful, so that I can remove some of the words that these carriers might share in common. Let’s go ahead and look at saving our model. So we have our different stages here. We have our assembler, our universal sentence encoder, and then our classifier.

So we go ahead and so once we have that model, we can go ahead and save it. And then we can of course, load it to be used again. And let’s go ahead and do something live. We’re going to go ahead and just annotate something here.

So let’s go ahead and build ourselves another light model.

So we’re gonna take this one today as a whole,

let me let me is if I go back a little bit

okay let me Let’s see what is this? It looks like it’s only Christopher silver.

Oh, okay. I think I know what the issue is. I need to build another pipeline. Okay. No worries. I will go in.

Okay. Oh, no, no, I’m just I needed a copy this piece right here.

Probably the wrong piece of code when I was doing the live example, which is of course the danger of live examples. Okay. (laughs)

So now that we have now that we’ve trained our own model, we can extract it to save it. So we look at our pipeline. We want to take out our classifier model and we’re going to want to save that. And now that we’ve saved that, we can load it. And then from here on if, when you download the notebook, you can feel free to go and play around and try and do classification on your own. Maybe try Burton beddings and see if if they actually improve the classification, or even try some classic techniques.

这就是它的笔记本。现在,我们已经看到ome notebooks. You can, of course, use spark helping yourself. There’s the free version that’s out there, and that’s sort of what we’ve been demoing. But there’s also a spark LP for healthcare and spark. In spark OCR. Healthcare is a very difficult and very difficult domain for an LP so there’s lots of specialized models and annotators. They’ve been built specifically for that by Johnston labs. And spar co Cr is number of annotators that are used for turning images to text if you are working in any sort of legacy industry or anything that’s going to send you PDFs instead of normal text files, OCR is probably something you’re gonna need to use at some point. So, there’s a number of great use cases that were of companies that have already used spark NOP. So here we see like a Kaiser Permanente and Roche. And you can feel free to go through and click to those and learn more about those example use cases sparking up is what’s great about it is that it’s really easy to get started on. There’s a free version, but it’s also production grade. There’s not many are making sure there’s not many NLP libraries that can try and claim something. So that’s why I encourage you to go ahead and join the SPARC NLP community. You can join the session slacking and go to the GitHub. The developers on are super responsive and really knowledgeable about both the library and NLP and deep learning. So it’s a really great community a great place to go and like learn more about spark NLP. And also I encourage you to go to the spark LP workshops.

Watch more Spark + AI sessions here
or
Try Databricks for free
« back
About David Talby

John Snow Labs

David Talby is a chief technology officer at John Snow Labs, helping healthcare & life science companies put AI to good use. David is the creator of Spark NLP – the world’s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

About Alexander Thomas

Wisecube AI

亚历克斯Wisec托马斯是一个主要的数据科学家ube. He's used natural language processing and machine learning with clinical data, identity data, employer and jobseeker data, and now biochemical data. Alex is also the author of Natural Language Processing with Spark NLP.