AI-Powered Topic Mining for the Coal Mining Museum

The system streamlines the research process, saving the museum's researchers significant time and effort.

TIMELINE: 3 WEEKS

COUNTRY: POLAND
TEAM: SOFTWARE DEVELOPER & SENIOR AI ENGINEER

About

Our team developed an AI-powered system to analyze
a vast collection of historical documents from various coal mines, enabling researchers at the Coal Mining Museum to efficiently extract valuable insights and identify potential research topics.

Result in Short: The system streamlines the research process, saving the museum's researchers significant time and effort. In just three weeks, we developed
a fully functioning solution leveraging AI.

The museum's digitization department sought an efficient solution to process these documents. The goal was to allow researchers to browse generated research topics and easily access the underlying source materials without the need for labor-intensive manual review.

In response, the museum issued a public tender for a proof-of-concept system that would leverage AI to automate the extraction of valuable information from these documents.

Photos titled “Historic Coal Mine Guido” by Mariusz Ciszewski, published under the CC BY-ND 2.0 license."

The Client & The App

The client for this project is the Coal Mining Museum, a significant public institution in Poland, employing approximately 400 people. Dedicated to researching and preserving coal mining history, the museum holds an extensive collection of historical documents from various coal mines across Poland. The museum received funding to digitize this vast collection.

Our Role & Involvement

01

Proof-of-Concept Development Under a Tight Deadline

Our team was tasked with the complete development of the proof-of-concept system within a tight deadline of three weeks. With considerable autonomy over the approach and technologies, we designed a custom solution that employed AI algorithms, including clustering and large language models (LLMs), to process the museum’s historical documents.

02

Integration with Snowflake and User-Friendly Interface

We integrated the system with Snowflake, a cloud-based data platform, to ensure robust and scalable data processing. To make the system user-friendly, we developed an intuitive interface using Streamlit, enabling researchers to effortlessly interact with the system.

03

Overcoming Challenges to Deliver Meaningful Themes

The key challenge was to identify meaningful themes across multiple documents. The system needed to handle documents in seven languages, ensuring all output was delivered in Polish. We developed a solution that met all the requirements by leveraging AI algorithms like clustering and LLMs to extract coherent themes.

System Requirements and Approach

The system was designed to meet the following requirements:

Multi-language

The system could process documents in multiple languages, including Polish, German, Russian, Chinese, and others, ensuring broad usability.

Topic identification

The AI needed to identify and group content into meaningful themes, ensuring each topic was derived from at least four different documents.

Polish output

All results, including topic descriptions and reports, were required to be in Polish, regardless of the input document's language.

Report generation

The system would generate an Excel report with detailed descriptions of the topics and references to the source materials.

The System Workflow

The system’s workflow can be broken down into the following key steps:

1

Document Ingestion

Various historical documents are uploaded to the Snowflake platform. Snowflake functions as a data warehouse, providing integration with Large Language Models and enabling the transformation of unstructured document data into structured data that can be easily analyzed.

2

Data Preprocessing & Vectorization

Once ingested, the documents undergo preprocessing to extract relevant text. Using advanced Natural Language Processing (NLP) techniques, this content is transformed into embedding vectors - mathematical representations of the semantic meaning of the text - enabling efficient semantic analysis.

3

Topic Clustering

Clustering algorithms group these vectors based on semantic similarity, forming clusters that represent potential research topics. Each cluster must consist of content derived from at least four different documents, as per the project requirements. 
The additional challenge in this aspect involved not knowing beforehand the number of clusters/topics in the dataset of documents. As a result, we used DBSCAN, which allows dynamically finding an appropriate number of clusters and is able to detect noise.

4

Topic Refinement & Description

Large Language Models (LLMs) analyze the content within each cluster to generate research titles and concise descriptions of the topic relevancy to the bibliography. Each identified topic is supported by
a bibliography of at least four documents to ensure its relevance and reliability with a per page level citation – identifies which pages from that document are relevant to the main research question. Additionally, each source is further described, highlighting how it is relevant to the overarching research topic.

5

Report Generation

Two types of outputs are generated:

  • A user-friendly web application built with Streamlit, enabling researchers to browse and interact with the identified topics and explore associated documents.
  • A comprehensive Excel report containing the research titles, and the relevant references to the source materials with the inclusion of the descriptions of each source. Thus aiding further analysis in downstream processing systems.

The system automatically analyzes incoming data. The researchers can add new documents as they become available. When a new source document is introduced, the system identifies potential topics, enabling it to adapt dynamically to evolving information.

Results

The implemented system successfully automates the process of identifying potential research topics from the museum’s historical documents. The AI algorithms efficiently extract and cluster relevant information, presenting it as curated topics, each accompanied by summaries and references to the source materials.

Let's win your market together!

Tell us more about
your application

Contact us to discuss your app idea and possibilities. We’ll advise you on the best solution and estimate the project. If you have any questions – we’ll provide you with answers.

Let's talk!

Talk with Mark,
our Technical Solutions Manager

Write a message

mark.cameron@teacode.io

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.