Finding trending Meetup topics using Streaming Data, Named Entity Recognition and Zeppelin Notebooks — a tale of a super enthusiastic working group during the pandemic times
We started out as a working group from bigdata.ro. The team was comprised of Valentina Crisan, Ovidiu Podariu, Maria Catana, Cristian Stanciulescu, Edwin Brinza and me, Andrei Deusteanu. Our main purpose was to learn and practice on Spark Structured Streaming, Machine Learning and Kafka. We designed the entire use case and then built the architecture from scratch.
Since Meetup.com provides data through a real-time API, we used it as our main data source We did not use the data for commercial purposes, just for testing.
This is a learning case story. We did not really know from the beginning what would be possible or not. Looking back, some of the steps could have been done better. But, hey, that’s how life works in general.
The problems we tried to solve:
- Allow meetup organizers to identify trending topics related to their meetup. We computed Trending Topics based on the description of the events matching the tags of interest to us. We did this using the John Snow Labs Spark NLP library for extracting entities.
- Determine which Meetup events attract the most responses within our region. Therefore we monitored the RSVPs for meetups based on certain tags, related to our domain of interest — Big Data.
For this we developed 2 sets of visualizations:
- Trending Keywords
- RSVPs Distribution
The first 2 elements are common in both sets of visualizations. This is the part that reads data from the Meetup.com API and saves it in 2 Kafka Topics.
- The Stream Reader script fetches data on Yes RSVPs filtered by certain tags from the Meetup Stream API. It then selects the relevant columns that we need. After that it saves this data into the rsvps_filtered_stream Kafka topic.
- For each RSVP, the Stream Reader script then fetches event data for it, only if the event_id does not exist in the events.idx file. This way we make sure that we read event data only once. The setup for the Stream Reader script can be found -> Install Kafka and fetch RSVPs
Trending Keywords
3. The Spark ML — NER Annotator reads data from the Kafka topic events and then applies a Named Entity Recognition Pipeline with Spark NLP. Finally it saves the annotated data in the Kafka topic TOPIC_KEYWORDS. The Notebook with the code can be found here.
4. Using KSQL we create 2 subsequent streams to transform the data and finally 1 table that will be used by Spark for the visualization. In Big Data Architectures, SQL Engines only build a logical object that assign metadata to the physical layer objects. In our case these were the streams we built on top of the topics. We link data from the TOPIC_KEYWORDS to a new stream via KSQL, called KEYWORDS. Then, using a Create as Select, we create a new stream, EXPLODED_KEYWORDS, for exploding the data since all of the keywords were in an array. Now we have 1 row for each keyword. Next on, we count the occurrences of each keyword and save it into a table, KEYWORDS_COUNTED. The steps to set up the streams and the tables with the KSQL code can be found here: Kafka — Detailed Architecture.
5. Finally, we use Vegas library to produce the visualizations on Trending Keywords. The Notebook describing all steps can be found here.
Detailed Explanation of the NER Pipeline
In order to annotate the data, we need to transform it into a certain format, from text to numbers, and then back to text.
- We first use a DocumentAssembler to turn the text into a Document type.
- Then, we break the document into sentences using a SentenceDetector.
- After this we separate the text into smaller units by finding the boundaries of words using a Tokenizer.
- Next we remove HTML tags and numerical tokens from the text using a Normalizer.
- After the preparation and cleaning of the text we need to transform it into a numerical format, vectors. We use an English pre-trained WordEmbeddingsModel.
- Next comes the actual keyword extraction using an English NerDLModel Annotator. NerDL stands for Named Entity Recognition Deep Learning.
- Further on we need to transform the numbers back into a human readable format, a text. For this we use a NerConverter and save the results in a new column called entities.
- Before applying the model to our data, we need to run an empty training step. We use the fit method on an empty dataframe because the model is pretrained.
- Then we apply the pipeline to our data and select only the fields that we’re interested in.
- Finally we write the data in Kafka:TOPIC_KEYWORDS
RSVPs Distribution
3. Using KSQL we aggregate and join data from the 2 topics to create 1 Stream, RSVPS_JOINED_DATA, and subsequently 1 Table, RSVPS_FINAL_TABLE containing all RSVPs counts. The KSQL operations and their code can be found here: Kafka — Detailed Architecture
4. Finally, we use Vegas library to produce visualizations on the distribution of RSVPs around the world and in Romania. The Zeppelin notebook can be found here.
We used a machine from Hetzner Cloud with the following specs — CPU: Intel Xeon E3–1275v5 (4 cores/8 threads), Storage: 2×480 GB SSD (RAID 0), RAM: 64GB
RSVPs Distribution
These visualizations are done on data between 8th of May 22:15 UTC and 4th of June 11:23 UTC.
Worldwide — Top Countries by Number of RSVPs
Worldwide — Top Cities by Number of RSVPs
Worldwide — Top Events by Number of RSVPs
Romania — Top Cities in Romania by Number of RSVPs
Romania — Top Meetup Events
Romania — RSVPs Distribution
Europe — RSVPs Distribution
Trending Keywords
Worldwide
Romania
This visualization is done on almost 1 week of data from the start of August. The reason for this is detailed in Issues encountered section, point 5.
All of these are mentioned in the published Notebooks as well.
- Visualizing data using Helium Zeppelin add-on and Vegas library directly from the stream did not work. We had to spill the data to disk, then build Dataframes on top of the files and finally do the visualizations.
- Spark NLP did not work for us in a Spark standalone local cluster installation (with local file system). Standalone Local Cluster means that the cluster runs on the same physical machine — Spark Cluster Manager & Workers. Such a setup does not need distributed storage such as HDFS. The workaround for us was to configure Zeppelin to use local Spark, local (*), meaning a non-distributed single-JVM deployment mode available in Zeppelin.
- Vegas plug-in could not be enabled initially. Running the github — %dep z.load(“org.vegas-viz:vegas_2.11:{vegas-version}”) — recommendation always raised an error. The workaround was to add all the dependencies manually in /opt/spark/jars. These dependencies can be found when deploying spark shell with the Vegas library — /opt/spark/bin/spark-shell –packages org.vegas-viz:vegas-spark_2.11:0.3.11
- Helium Zeppelin addon did not work/couldn’t be enabled. This too raised an error when enabling it from Zeppelin GUI in our configuration. We did not manage to solve this issue. That’s why we used only Vegas, although it does not support Map visualizations. In the end we got creative a bit — we exported the data and loaded it into Grafana for Map visualizations.
- The default retention policy for Kafka is 7 days. This means that data older than 1 week is deleted. For some of the topics we changed this setting, but for some we forgot to do this and therefore we lost the data. This affected our visualization for the Trending Keywords in Romania.
- In the world of Big Data you need clarity around the questions you’re trying to answer before building the Data Architecture and then follow through the plan to make sure you’re still working according to those questions. Otherwise, you might end up with something that can’t do what you actually need. It sounds a pretty general statement and pretty “DOH, OBVIOUSLY”. Once we’ve seen the visualizations, we realized that we did not create the Kafka objects according to our initial per country keywords distribution visualization — e.g. we created the count aggregation per all countries, in the KEYWORDS_COUNTED Table. Combine this with the mistake of forgetting to change the Kafka retention period from the default 7 days, by the time we realized the mistake we had lost the historical data as well. Major learning point.
- Data should be filtered in advance of the ML/NLP process — we should have removed some keywords that don’t exactly make sense such as “de”, “da”. In order to get more relevant insights maybe several rounds of data cleaning and extracting the keywords might be needed.
- After seeing the final visualizations we should probably have filtered a bit more some of the obvious words. For example of course Zoom was the highest scoring keyword since by June everybody was running only online meetups mainly on Zoom.
This study group was a great way for us to learn about an end-to-end solution that uses Kafka to ingest streaming data, Spark to process it and Zeppelin for visualizations. We recommend this experience for anyone interested in learning Big Data technologies together with other passionate people, in a casual and friendly environment.
This article originally appeared on https://bigdata.ro/2020/08/09/spark-working-group/