Text Extraction & Organization using Vision API – Challenges and Solutions 


Emproto was mandated by an Executive agency of the European Commission to build a platform that would extract essential information like ingredients and nutrition from food wrappers. The objective of this exercise was to analyze the health quotient of packaged food products available in the European Union.

Brief given to Emproto:

Field workers from the agency would go to supermarkets and capture images from food wrappers available. The images would then be uploaded to the platform. The platform had to process the images and extract the data. The extracted data needs to be organized in a tabular format. This data would be checked and approved by an admin 


The application uses Vision API provided by Firebase for text extraction. The extracted information is presented in tabular form to ensure usability and further data processing.


API Calls Retrofit, Android library  https://square.github.io/retrofit/
Text Extraction Vision API, Firebase
Image Capture Camera2 Packagehttps://developer.android.com/reference/android/hardware/camera2/package-summary


Selecting the right tool : Tesseract OCR vs Vision API

  We considered Tesseract OCR and Vision API. Tesseract OCR gives text output and the accuracy was not as expected.

We chose Vision API for the following reasons

  • Tesseract OCR only gives us the extracted text. Vision API has classes that allows access to inner functionalities
  • In case of multiple languages, Tesseract requires trained data for each language
  • Tesseract works well with clear, perfect images. Vision API is better in case of imperfect samples that have creases, folds or glare etc  

Vision api Organizing text into table format: 

Vision API gives accurate results as text. We needed to convert the text to tabular form for further manipulation. We used a function, getBoundingBoxes() which gives a Rect Object corresponding to the position of the text.

rectTemp1 = textBlockTemp.getBoundingBox();

Multiple language support: 

This application is intended for use across Europe and therefore needs to support multiple languages.

In Tesseract, If we need to support multiple languages. We need trained data to be stored in device path for each of those languages.

tessBaseApi.init(DATA_PATH, “eng”);

DATA_PATH -> it’s a path that all the trained data should be stored.

If we want multiple language support, we have to give it like this. tessBaseApi.init(DATA_PATH, “eng+tamil”);

In Vision API, the method 

recognizer = vision.getCloudTextRecognizer(options); 

allows us to recognize both Latin and Non-Latin languages.

Memory management issue:

We used the recycler view class to optimize memory but this resulted in OutofMemory issue. When the app tries to render more than 50 images at a time, memory overloading happens resulting in a crash. To handle this, we decided to show only 10 images first. On scroll, we show a sequence of 5 images.


The accuracy of the results obtained provide us a few exciting opportunities going forward. We can completely automate the workflow eliminating the need for manual approval. We can also use it to build a crowdsourced food recommendation engine that can guide us in our dietary habits.