PDF Search Engine for UN agencies and NGOs

5 min readDec 15, 2020

Written by Kjunwonl, Shivika K Bisen and Yashaswini Joshi

Demo of PDF Search Engine

In the growing world of data, the world has seen a surplus of tools that are able to grab the most relevant information pertinent to a user’s query. Google has been dominant as a search engine and accelerates the learning process for many people around the world. As the top industries develop these tools and enhance their retrieval models, many industries still need to catch up on this development of technology. A specific industry in mind is the non-profit sector. The information that many UN agencies and NGOs push out is often in disarray and unorganized. This makes the task of retrieving relevant information for a specific topic near impossible unless the specific location for the report was known or a standardized search engine was implemented for this purpose.

Our solution is innovative in the following ways. Firstly, our dataset has equity in representation of UN, Non-Profit and Humanitarian aid organizations. For instance, given a query google search showed top relevant links mostly biased towards UN reports and had little or no NGO reports. Moreover, it retrieved media articles. Our dataset ensures reliable source/report from the organization. Secondly, our aim is to help NGO program managers and Policymakers to have a search engine to design the policies. Google search fails to do that because it is biased towards annual reports as top relevant. In our data corpus, we have Program reports and not just Annual reports. Program reports are more informative on details on the designing of the program while Annual reports have marketing inclination. Finally, our search engine provides a BERT based summary for each retrieved report.

We ended up using a baseline model that utilizes word2vec and cosine similarity. We found that our model of BM25L retrieved more accurate results than the word2vec. We tried many models like a deep neural network that utilizes LSTM and RoBERTa models using Haystack. We found that these were computationally expensive as well as inaccurate.

Data

The data collected for this project consist of pdf reports from three different UN partner websites: International Red Cross Federation, International Water Association, and UNICEF.

Metadata Extraction:

Experiment with various pdf parser libraries AWS textract, PDFminer, pyPDF2 and finally chose to implement Apache Tika to extract metadata from pdf in text format. The metadata includes Date, Title, Content text. Apache tika was a high-performance parser.

Data Preprocessing:

Developed code to clean data using pandas, NumPy, regex, nltk to clean text content. Convert text to lowercase, tokenize paragraphs to sentences, tokenize sentences to words, remove a special character, remove punctuation, remove stop words, and remove NaN values.

BERT Summarization:
Developed code for Bidirectional Encoder Representations from Transformers (BERT) summarization. It is a Transformer-based deep learning technique. It understands the context of the text.

Methods

Word2Vec based cosine similarity:

We decided to use word2vec as a baseline model to understand the similarity between a user query and the document that might want to be retrieved. This is a vector space model. We then used gensim’s word2vec model to create a word embeddings function for each word in the summary of our documents. We also created a cosine similarity function. Using these two functions, we created a ranking function that would look at the similarity from our query as well as each of our documents. Our team’s ranking function returns the top 10 highest similarity document that retrieves the summary, title and URL associated with that document in our data frame.

BM25L:

Our team also implemented a model of BM25L. We noticed that many models seem to have a bias towards lower length documents which may only return single topic documents. For the purposes of our project idea, we want to create a model that returned the most accurate document to the user query. BM25L function which uses the default hyperparameters (k = 1.5, b = 0.75, delta = 0.5) to create BM25L word weightings.

Evaluation and Results

To work on our ground truth we created a query list containing 30 queries. We annotated the true relevance for the query and the documents retrieved and we then calculated precision, recall and F1 score based on annotated relevancy and predicted relevancy. We chose the NDCG score to evaluate our model because it is one of the widely used TREC metrics. It is designed for situations of non-binary notions of relevance. We annotated 30 queries and 5 predicted relevant documents for each query. (150 in total) and similarly annotated for the baseline model as well to compare the results. Our ground truth annotation was based on relevancy score:

0: Not relevant

1: Somewhat relevant

2: Very Relevant

Overall BM25L performed better than the baseline model in terms of NDCG and F1 Score.

We hypothesize that our baseline model (based on word2vec cosine similarity), performed better because it was able to understand lengthy queries requiring contextual understanding. However, the word2vec based baseline model is computationally expensive and takes a much longer time to retrieve reports over our BM25L model. Our model helps alleviate that problem in some ways by accurately retrieving relevant documents to a user’s query. However, with our limited resources for documents our model can only give a finite amount of relevant documents to a user’s query. If this project continues to grow and build a large enough database of pdf reports from all NGO and UN-agencies, this model can be the most helpful tool for frustrated workers.

Demo of PDF Search engine can be found at Link.

Next steps

We tried other models like a deep neural network that utilizes LSTM and RoBERTa model and Haystack. We found that these were computationally expensive as well as inaccurate and require more data. So, as the next step, we would like to collect more documents to implement Deep learning models and UN cluster classification. Every model we implemented used the summary of the pdfs as an input feature. Potential future work could include working on feature engineering and deriving more metadata like Author name.

The whole project can be found at https://github.com/yashajoshi/PDF-Search-Engine-for-UN-agencies-and-NGOs-

References

Miller, Derek. “Leveraging BERT for extractive text summarization on lectures.” arXiv preprint arXiv:1906.04165 (2019).
Lv, Yuanhua, and ChengXiang Zhai. “When documents are very long, BM25 fails!.” Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011
https://medium.com/analytics-vidhya/building-a-faster-and-accurate-search-engine-on-custom-dataset-with-transformers-d1277bedff3d