AI-Powered Topic Mining for the Coal Mining Museum
The system streamlines the research process, saving the museum's researchers significant time and effort.
TIMELINE: 3 WEEKS
COUNTRY: POLAND
TEAM: SOFTWARE DEVELOPER & SENIOR AI ENGINEER
The system streamlines the research process, saving the museum's researchers significant time and effort.
TIMELINE: 3 WEEKS
COUNTRY: POLAND
TEAM: SOFTWARE DEVELOPER & SENIOR AI ENGINEER
Our team developed an AI-powered system to analyze a vast collection of historical documents from various coal mines, enabling researchers at the Coal Mining Museum to efficiently extract valuable insights and identify potential research topics.
Result in Short: The system streamlines the research process, saving the museum's researchers significant time and effort. In just three weeks, we developed a fully functioning solution leveraging AI.
The museum's digitization department sought an efficient solution to process these documents. The goal was to allow researchers to browse generated research topics and easily access the underlying source materials without the need for labor-intensive manual review.
In response, the museum issued a public tender for a proof-of-concept system that would leverage AI to automate the extraction of valuable information from these documents.
Photos titled “Historic Coal Mine Guido” by Mariusz Ciszewski, published under the CC BY-ND 2.0 license."
The client for this project is the Coal Mining Museum, a significant public institution in Poland, employing approximately 400 people. Dedicated to researching and preserving coal mining history, the museum holds an extensive collection of historical documents from various coal mines across Poland. The museum received funding to digitize this vast collection.
01
Our team was tasked with the complete development of the proof-of-concept system within a tight deadline of three weeks. With considerable autonomy over the approach and technologies, we designed a custom solution that employed AI algorithms, including clustering and large language models (LLMs), to process the museum’s historical documents.
02
We integrated the system with Snowflake, a cloud-based data platform, to ensure robust and scalable data processing. To make the system user-friendly, we developed an intuitive interface using Streamlit, enabling researchers to effortlessly interact with the system.
03
The key challenge was to identify meaningful themes across multiple documents. The system needed to handle documents in seven languages, ensuring all output was delivered in Polish. We developed a solution that met all the requirements by leveraging AI algorithms like clustering and LLMs to extract coherent themes.
The system was designed to meet the following requirements:
The system could process documents in multiple languages, including Polish, German, Russian, Chinese, and others, ensuring broad usability.
The AI needed to identify and group content into meaningful themes, ensuring each topic was derived from at least four different documents.
All results, including topic descriptions and reports, were required to be in Polish, regardless of the input document's language.
The system would generate an Excel report with detailed descriptions of the topics and references to the source materials.
The system’s workflow can be broken down into the following key steps:
Various historical documents are uploaded to the Snowflake platform. Snowflake functions as a data warehouse, providing integration with Large Language Models and enabling the transformation of unstructured document data into structured data that can be easily analyzed.
Once ingested, the documents undergo preprocessing to extract relevant text. Using advanced Natural Language Processing (NLP) techniques, this content is transformed into embedding vectors - mathematical representations of the semantic meaning of the text - enabling efficient semantic analysis.
Clustering algorithms group these vectors based on semantic similarity, forming clusters that represent potential research topics. Each cluster must consist of content derived from at least four different documents, as per the project requirements. The additional challenge in this aspect involved not knowing beforehand the number of clusters/topics in the dataset of documents. As a result, we used DBSCAN, which allows dynamically finding an appropriate number of clusters and is able to detect noise.
Large Language Models (LLMs) analyze the content within each cluster to generate research titles and concise descriptions of the topic relevancy to the bibliography. Each identified topic is supported by a bibliography of at least four documents to ensure its relevance and reliability with a per page level citation – identifies which pages from that document are relevant to the main research question. Additionally, each source is further described, highlighting how it is relevant to the overarching research topic.
Two types of outputs are generated:
The system automatically analyzes incoming data. The researchers can add new documents as they become available. When a new source document is introduced, the system identifies potential topics, enabling it to adapt dynamically to evolving information.
The implemented system successfully automates the process of identifying potential research topics from the museum’s historical documents. The AI algorithms efficiently extract and cluster relevant information, presenting it as curated topics, each accompanied by summaries and references to the source materials.
Let's win your market together!
Contact us to discuss your app idea and possibilities. We’ll advise you on the best solution and estimate the project. If you have any questions – we’ll provide you with answers.
Talk with Mark,
our Technical Solutions Manager
Write a message
mark.cameron@teacode.io