Create a Dockerfile and paste the following two code blocks in the same Dockerfile.
FROM amazoncorretto:11 as buildRUN yum update -y &&\ yum -y -q install wget &&\ yum install -y gcc gcc-c++ autoconfig automake make pkgconfig libtool gzip tar&&\ yum install -y zlib-devel libtiff-devel libwebp-devel libpng-devel openjpeg2-devel lib-jpeg-turbo-devel giflib-devel &&\ yum clean all &&\ rm -rf /var/cache/yumRUN wget -q https://github.com/DanBloomberg/leptonica/archive/refs/tags/1.82.0.tar.gz \ && tar -zxvf 1.82.0.tar.gz -C /opt \ && rm -f 1.82.0.tar.gzWORKDIR /opt/leptonica-1.82.0RUN ./autogen.shRUN ./configureRUN make && make installRUN wget -q https://github.com/tesseract-ocr/tesseract/archive/5.2.0.tar.gz \ && tar -zxvf 5.2.0.tar.gz -C /opt \ && rm -f 5.2.0.tar.gzWORKDIR /opt/tesseract-5.2.0RUN ./autogen.shRUN ./configureRUN make && make installRUN wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P /opt/RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/osd.traineddata -P /opt/b. Docker final stage
Extract only the required libraries to the final stage of the Dockerfile.
FROM amazoncorretto:11WORKDIR /optARG LD_LIBRARY_PATH=/usr/local/libENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}ENV PKG_CONFIG_PATH ${LIBRARY_PATH}/pkgconfigARG TESSDATA_PREFIX=/usr/local/share/tessdataENV TESSDATA_PREFIX ${TESSDATA_PREFIX}COPY --from=build /usr/local/lib/libtesseract.so.5.0.2 ${LD_LIBRARY_PATH}/COPY --from=build /usr/local/lib/liblept.so.5.0.4 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libjpeg.so.62.3.0 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libtiff.so.5.2.0 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libwebp.so.4.0.2 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libopenjp2.so.2.4.0 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libgomp.so.1.0.0 ${LD_LIBRARY_PATH}/COPY --from=build /lib64/libjbig.so.2.0 ${LD_LIBRARY_PATH}/COPY --from=build /opt/*.traineddata ${TESSDATA_PREFIX}/RUN echo ${LD_LIBRARY_PATH} >> /etc/ld.so.confRUN ldconfigWORKDIR /appCOPY ./src/main/resources/static/tesseract.png tesseract.pngCOPY ./src/main/resources/static/tesseract.pdf tesseract.pdfCOPY ./build/libs/tesseract-ocr-0.0.1.jar tesseract-ocr.jarEXPOSE 8080CMD ["java", "-jar", "tesseract-ocr.jar"]Now we have optimised the size (Space complixity) of the Docker image.
But the build time (Time complexity) is increased to extract the binary of Tesseract and Leptonica.
Note: To optimise the time complexity, you can use any artifactory repository (ex. JFrog artifactory) and create a separate pipeline to push the binary of Tesseract, Leptonica, the Trained Data Set, and its dependency libraries as a one-time task. Later, you can download those binaries and data sets into your Docker image. It will reduce Docker build time.
IV. OCR operation
Add the Tesseract library to the build.gradle.kts file’s dependencies.
import org.jetbrains.kotlin.gradle.tasks.KotlinCompileplugins { id("org.springframework.boot") version "2.7.12" id("io.spring.dependency-management") version "1.0.15.RELEASE" kotlin("jvm") version "1.6.21" kotlin("plugin.spring") version "1.6.21"}group = "com.example"version = "0.0.1"java.sourceCompatibility = JavaVersion.VERSION_11repositories { mavenCentral()}dependencies { implementation("org.springframework.boot:spring-boot-starter-web") implementation("com.fasterxml.jackson.module:jackson-module-kotlin") implementation("org.jetbrains.kotlin:kotlin-reflect") testImplementation("org.springframework.boot:spring-boot-starter-test") implementation("net.sourceforge.tess4j:tess4j:5.4.0")}tasks.withType { kotlinOptions { jvmTarget = "11" }} Create document type enums.
package com.example.ocr.pdf.enumenum class DocumentType { PDF,PNG}In the following code, we are performing OCR operations for both PDF and images.
Text extraction is one of the use cases for the Tesseract library.
package com.example.ocr.pdfimport com.example.ocr.pdf.enum.DocumentTypeimport net.sourceforge.tess4j.Tesseractimport net.sourceforge.tess4j.TesseractExceptionimport org.springframework.stereotype.Serviceimport java.io.File@Serviceclass OCRService { fun getContent(documentType: DocumentType): String { val tesseract = Tesseract() try { val filePath = getFilePath(documentType) val image = File(filePath) tesseract.setDatapath(System.getenv(TESSDATA_PREFIX)) tesseract.setLanguage("eng") tesseract.setVariable("tessedit_create_horc", "1") tesseract.setPageSegMode(1) tesseract.setOcrEngineMode(1) println("Document type ===> ${documentType.name}") return tesseract.doOCR(image) } catch (e: TesseractException) { throw Exception(e) } } fun getFilePath(documentType: DocumentType): String { return when (documentType) { DocumentType.PDF -> PDF_FILE DocumentType.PNG -> PNG_FILE } } companion object { const val TESSDATA_PREFIX = "TESSDATA_PREFIX" const val PDF_FILE = "tesseract.pdf" const val PNG_FILE = "tesseract.png" }}V. Github source code
https://github.com/thepurushoths/tesseract-ocr
If there are any other optimized approaches, then let me know in the comments. I will learn with you.
Thanks!
