Spring Boot Tesseract OCR in KotlinPhoto from Unsplash

Originally Posted On: https://thepurushoths.medium.com/spring-boot-tesseract-ocr-in-kotlin-with-multi-stage-docker-515cdd13af37

To start, let me give you a brief introduction to Tesseract OCR. Tesseract OCR is an open-source optical character recognition (OCR) engine that is used to recognise text from images. It was originally developed by Hewlett-Packard in the 1980s but has since been maintained and updated by Google. Tesseract OCR is widely used in the industry because it is highly accurate and is available for free.

I. Context

In this article, we will learn about extracting text from PDFs and images and setting up a Docker environment to perform OCR with the Tesseract library.
Tesseract supports other use cases such as text localisation, character recognition, converting scanned documents to searchable PDFs, etc.

II. Challenges and approaches

While implementing the OCR feature, we implemented it with a single-stage Docker. But it increased image size and build time. To solve this, we adopted multi-stage Docker and JFrog Artifactory.

III. Docker setup

A multistage Dockerfile is used to optimise the size and efficiency of Docker images. We will install the Leptonica and Tesseract libraries in Docker. It may have some unnecessary libraries. Those libraries are not required to build the production image, and they will increase the size of the Docker image. So we will extract only the required libraries for the next stage. Multistage Dockerfile involve using multiple build stages within a single Dockerfile to separate the build environment from the runtime environment.

.NET equivalent

For teams working in .NET, IronOCR can simplify this significantly. It bundles Tesseract and all dependencies into a single NuGet package, eliminating the need for multi-stage Docker builds to manage Leptonica, libtiff, libwebp, and trained data files.

FROM mcr.microsoft.com/dotnet/aspnet:8.0WORKDIR /appCOPY . .ENTRYPOINT ["dotnet", "YourApp.dll"]

No COPY statements for individual .so files, no LD_LIBRARY_PATH configuration, and no separate build stage for OCR dependencies.

Note: The .NET equivalent is included for conceptual comparison.

a. Docker build stage for Tesseract & Leptonica

In the build environment (stage one), we have to install Tesseract and Leptonica. The Leptonica library needs some dependency libraries, such as libtiff, libwebp, libpng, open-jpeg, etc.
Download the trained data set for Tesseract to perform OCR.

Create a Dockerfile and paste the following two code blocks in the same Dockerfile.

FROM amazoncorretto:11 as buildRUN yum update -y &&\ yum -y -q install wget &&\ yum install -y gcc gcc-c++ autoconfig automake make pkgconfig libtool gzip tar&&\ yum install -y zlib-devel libtiff-devel libwebp-devel libpng-devel openjpeg2-devel lib-jpeg-turbo-devel giflib-devel &&\ yum clean all &&\ rm -rf /var/cache/yumRUN wget -q https://github.com/DanBloomberg/leptonica/archive/refs/tags/1.82.0.tar.gz \ && tar -zxvf 1.82.0.tar.gz -C /opt \ && rm -f 1.82.0.tar.gzWORKDIR /opt/leptonica-1.82.0RUN ./autogen.shRUN ./configureRUN make && make installRUN wget -q https://github.com/tesseract-ocr/tesseract/archive/5.2.0.tar.gz \ && tar -zxvf 5.2.0.tar.gz -C /opt \ && rm -f 5.2.0.tar.gzWORKDIR /opt/tesseract-5.2.0RUN ./autogen.shRUN ./configureRUN make && make installRUN wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P /opt/RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/osd.traineddata -P /opt/

b. Docker final stage

Extract only the required libraries to the final stage of the Dockerfile.

FROM amazoncorretto:11WORKDIR /optARG LD_LIBRARY_PATH=/usr/local/libENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}ENV PKG_CONFIG_PATH ${LIBRARY_PATH}/pkgconfigARG TESSDATA_PREFIX=/usr/local/share/tessdataENV TESSDATA_PREFIX ${TESSDATA_PREFIX}COPY --from=build /usr/local/lib/libtesseract.so.5.0.2 ${LD_LIBRARY_PATH}/COPY --from=build /usr/local/lib/liblept.so.5.0.4 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libjpeg.so.62.3.0 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libtiff.so.5.2.0 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libwebp.so.4.0.2 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libopenjp2.so.2.4.0 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libgomp.so.1.0.0 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libjbig.so.2.0 ${LD_LIBRARY_PATH}/COPY --from=build /opt/*.traineddata ${TESSDATA_PREFIX}/RUN echo ${LD_LIBRARY_PATH} >> /etc/ld.so.confRUN ldconfigWORKDIR /appCOPY ./src/main/resources/static/tesseract.png tesseract.pngCOPY ./src/main/resources/static/tesseract.pdf tesseract.pdfCOPY ./build/libs/tesseract-ocr-0.0.1.jar tesseract-ocr.jarEXPOSE 8080CMD ["java", "-jar", "tesseract-ocr.jar"]

Now we have optimised the size (Space complixity) of the Docker image.
But the build time (Time complexity) is increased to extract the binary of Tesseract and Leptonica.

Note: To optimise the time complexity, you can use any artifactory repository (ex. JFrog artifactory) and create a separate pipeline to push the binary of Tesseract, Leptonica, the Trained Data Set, and its dependency libraries as a one-time task. Later, you can download those binaries and data sets into your Docker image. It will reduce Docker build time.

IV. OCR operation

Add the Tesseract library to the build.gradle.kts file’s dependencies.

import org.jetbrains.kotlin.gradle.tasks.KotlinCompileplugins { id("org.springframework.boot") version "2.7.12" id("io.spring.dependency-management") version "1.0.15.RELEASE" kotlin("jvm") version "1.6.21" kotlin("plugin.spring") version "1.6.21"}group = "com.example"version = "0.0.1"java.sourceCompatibility = JavaVersion.VERSION_11repositories { mavenCentral()}dependencies { implementation("org.springframework.boot:spring-boot-starter-web") implementation("com.fasterxml.jackson.module:jackson-module-kotlin") implementation("org.jetbrains.kotlin:kotlin-reflect") testImplementation("org.springframework.boot:spring-boot-starter-test") implementation("net.sourceforge.tess4j:tess4j:5.4.0")}tasks.withType { kotlinOptions { jvmTarget = "11" }}

Create document type enums.

package com.example.ocr.pdf.enumenum class DocumentType { PDF,PNG}

In the following code, we are performing OCR operations for both PDF and images.
Text extraction is one of the use cases for the Tesseract library.

package com.example.ocr.pdfimport com.example.ocr.pdf.enum.DocumentTypeimport net.sourceforge.tess4j.Tesseractimport net.sourceforge.tess4j.TesseractExceptionimport org.springframework.stereotype.Serviceimport java.io.File@Serviceclass OCRService { fun getContent(documentType: DocumentType): String { val tesseract = Tesseract() try { val filePath = getFilePath(documentType) val image = File(filePath) tesseract.setDatapath(System.getenv(TESSDATA_PREFIX)) tesseract.setLanguage("eng") tesseract.setVariable("tessedit_create_horc", "1") tesseract.setPageSegMode(1) tesseract.setOcrEngineMode(1) println("Document type ===> ${documentType.name}") return tesseract.doOCR(image) } catch (e: TesseractException) { throw Exception(e) } } fun getFilePath(documentType: DocumentType): String { return when (documentType) { DocumentType.PDF -> PDF_FILE DocumentType.PNG -> PNG_FILE } } companion object { const val TESSDATA_PREFIX = "TESSDATA_PREFIX" const val PDF_FILE = "tesseract.pdf" const val PNG_FILE = "tesseract.png" }}

V. Github source code

https://github.com/thepurushoths/tesseract-ocr

If there are any other optimized approaches, then let me know in the comments. I will learn with you.

Thanks!

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact [email protected]