Being able to quickly and accurately partition the legal space into coherent, useful categories is crucial to conducting legal research. In service of this goal, various lists and taxonomies of legal topics have been created. Recently, Ravel reworked our own legal topics system to provide ourselves the flexibility to more easily expand the topic taxonomy and to provide a clearer, more immediately useful service for users.
Document Classification is a common task that usually relies on at least some amount of human-sourced information to assign labels to documents. In our case, we have a provided set of labels corresponding to legal topics that we must assign to opinions. We allow potentially many labels per document, making this an instance of multi-class, multi-label classification.
For our particular use case, we use a Binary Relevance approach, where a classifier for each individual label/topic is trained independently and subsequently evaluated on texts. This setup allows for multi-labeled predictions while still using straight-forward linear models. While this is a fairly standard approach, we make use of an interesting annotation framework to maximize the effectiveness of our annotators’ time and reduce the time needed to roll out new topics into our system.
Prior to this work, the topic model in use at Ravel was based on a fixed Latent Dirichlet Allocation (LDA) model, summarized in Fig. 1. After selecting an arbitary value for the number of LDA clusters, the trained model was manually examined and human-readable labels were simply assigned to the various clusters as appropriate. New documents were assigned an LDA cluster and simply inherited that cluster’s topic label.
Because many of the clusters represented no particular coherent topic, only a subset of opinions received a label under this scheme, and each opinion could only receive a maximum of one – both of which are not ideal properties for our purposes. Furthermore, to add new topics to this model it would need to be entirely retrained and each new cluster would need to be reanalyzed – there would be no assurances that existing clusters would survive the retraining. This particular fact is the most problematic for a system that will eventually expand to several hundred topics in the end.
Typical document classification systems involve getting top level labels for exemplar documents from annotators and allowing a featurization routine to extract useful features from those documents that are then used to make predictions on new, unseen data. In the original Ravel topic system, there was a single feature (namely the LDA cluster that the document fell into), and a hard-coded mapping to labels based on those features. Because there were no top level labels on the documents, the actual learning potential of the model was minimal; the model itself was unsupervised.
As a pilot study for a new model (shown in Fig. 2), our annotation team provided several hundred top-level labels for documents in our corpus. Then, rather than using LDA clusters as features, a standard lexical featurization was conducted, where the features for a document became the counts of the words used in that doucment. The intuition being that if documents we label as “Intellectual Property” frequently contain references to things like ‘copyright’ or ‘trademark’, we can learn to associate those words with the topic of “Intellectual Property”.
While this process is a well-used one, the particular details of our situation complicate it. For one thing, the taxonomy of topics into which we are classifying is rather large and will fluctuate and expand over time. As such, we place a high value on our ability to quickly and reliably create new topics. However, adding a new label under standard document classification practices involves actually obtaining a set of documents that should have that new label – the more the better, but a few dozen as a reasonable minimum. This is not an easy task if, for example, the topic is highly specific; we may not have access to a large enough set of relevant documents to support the new topic.
Instead of requiring that we assemble a new set of documents for every new topic that we wish to support, our solution allows our annotation team to simply describe the nature of those documents. This is done by having the annotators associate important phrases from documents with topic labels rather than providing top-level labels for the entire document itself, shown in Fig. 3. While the existence of a labeled document set is helpful in that it provides a straightforward source for those important phrases, this process can actually be run entirely independent of the corpus – annotators can opt to supply their own key phrases in lieu of actually having to obtain a corresponding set of documents. In our case, a smaller number of whole documents were labeled after the fact in order to provide an evaluation set.
Structuring the model in this way provides a few benefits. First, these new annotations directly provide the feature weightings that all topic models need to function rather than relying on the learning algorithm to both select important phrases and then learn the weightings. Second, the annotation task is now very low-intensity and can be done quickly and cheaply (the interface used by the annotators is shown in Fig. 4). Finally, building the model around key terms and phrases relates the entire system to the well-known idea of boolean search; in some ways, each topic is a sort of archived search result, where our annotators have provided a tuned search query that returns a particular set of topically relevant documents.
The most obvious changes made by this work are that a much higher percentage of the Ravel corpus are now assigned topics, and most labeled documents now receive multiple topics. This means that finding interesting cases at the intersection of topics is easier than ever. For instance, browsing case law by sets of topics like “Intellectual Property” combined with “Employment Law” is now possible.
To see one example of the new topics system in action, check out the new Law Firms Analytics dashboards (on the left) to see firm by firm breakdowns of commonly litigated topics.