Project Final File
Project Final File
ON
by
DEVANSHU VERMA 1902300100029
PANKAJ 1902300100065
Under theSupervision
of
Date:
Place: DGI, GN
(ii)
ABSTRACT
The popularity of machine learning technologies has increased exponentially in the past ten
years. In many fields of study, possible use cases for machine learning are tested continuously.
Thus, the basics of developing software that uses machine learning are becoming more critical
by the day.
This report describes the analysis, design, and implementation of the proposed application
(Snap_Seeker Android Application). In this application users can take the photos of the object
from their mobile phone and run it through the object detector for keywords related to the
image then the application searches for information on the web about the keyword and the
information is shown to the end user.
Snap_Seeker is an Android application designed to convert images containing text into digital
text that can be edited, copied, and shared. The application uses optical character recognition
(OCR) technology to extract the text from images and convert it into machine-readable format.
The Snap_Seeker application allows users to capture images of text using their smartphone
camera or select an image from their photo library. Once the image is captured or selected, the
application processes the image using OCR to extract the text.
Users can then edit the text, copy it to the clipboard, or share it through various means such
as email or social media. The application supports multiple languages and fonts, making it
suitable for a wide range of users.
In addition, the Snap_Seeker application includes features such as image cropping, rotation,
and resizing, which can be useful for improving the quality of the image before processing.
The application is user-friendly and easy to use, making it suitable for users of all levels of
technical proficiency
Overall, Snap_Seeker is a useful application for anyone who needs to extract text from
(iii)
ACKNOWLEDGEMENT
It is always a difficult task to acknowledge all those who have been of tremendous help in an
academic project of this nature and magnitude nevertheless, we have made a sincere attempt to
express our gratitude to all those who have contributed to the successful completion of this
project through this project report.
We are extremely grateful to Prof. K.K. SAINI, DIRECTOR DGI-GN for giving his valuable
input and opportunity to develop this project and making all the resources available to us
with the intension of success of this project
We would also like to thank Prof. Bipin Pandey, H.O.D DGI-GN for his idea, advice,
knowledge and about all his support throughout this project.
We also take this opportunity to express my deepest gratitude to our Project Guide and our
faculty Prof. Saranya Raj, Associate Professor DGI-GN for his constant support, guidance and
encouragement and thus being a constant source of inspiration for us.
Last but not the least we would like to thank my parents, all my friends who have always
been a strong support during the entire course of my project and without their co-operation
the completion of this project would not have been possible.
(iv)
LIST OF FIGURES (v)
9 Binarization 16
10 Formula used 16
11 Skew Correction. 19
12 Feature Upscaling. 21
13 Text Recognition 22
14 Postprocessing 22
17 ML Kit architecture 25
23 Resulted App 31
25 References 33
(vi)
LIST OF TABLES
Table No CONTENT Page No.
1 Dependency used in the android application 45
REFERENCES 33-34
APPENDIX 35
CHAPTER-1
INTRODUCTION: MACHINE LEARNING
When a computer program can improve the way that the tasks are performed by using its previous experience,
we can then say that the computer is learning. This scenario is entirely different from one where a program
performs a task based on existing rules, logic, and information added by a programmer. For example, a chess
game where the programmer implements a winning strategy for the program, illustrates this perfectly. The
machine learning approach differs from that because it does not have any clue on how to win the game at first.
The programmer only defines a set of legal moves, and the rest is for the computer to learn. By learning from
its mistakes, the computer starts to win eventually, but this requires plenty of repetition.
Machine learning is an iterative process for the computer (Figure 1), meaning that it takes some time to
complete, just like the process for humans to learn to ride a bicycle. The process requires much previous
experience, trial, and error to achieve mastery in a specific task.
learning simplifies many everyday tasks that would be tedious for a programmer to code
traditionally. Typical tasks that would otherwise require infinite lines of code because they do
1|Page
not have a well-defined set of rules, meaning that they are non-existent or too abstract for the
programmer to follow. The information processing also takes time, and in real-time
applications, a way for the computer to make assumptions based on previous experiences will
reduce the processing time significantly. Computers were the first ones to use machine
learning, but it is proven that mobile development with machine learning implemented on
mobile devices is trendy. Modern mobile devices show a high productive capacity level that
is close enough to perform appropriate tasks to the degree that modern computers do. Any
application that does one of the following is using machine learning in some way:
● Speech recognition
● Gesture recognition
When applying more advanced technologies and algorithms in a mobile environment one of
the challenges is the limited computational power of the mobile hardware. As inference is
computationally expensive, it is crucial that operations are optimised for mobile devices. By
using the mobile version of TensorFlow (TF) namely TensorFlow Mobile (TFM) and the
updated mobile framework TensorFlow Lite (TFL), developers can use pre-trained models on
mobile devices for inference with optimisation for mobile hardware [3].
2|Page
Figure 2. Different learning methods
In supervised learning, if the predicted output is a discrete value, such algorithms are part of
classification algorithms. However, when the predicted output is a continuous value, such
algorithms fall under the umbrella of regression algorithms. These differences must be taken
into consideration when training the model.
List of multiple algorithms that are associated with the supervised learning, such as K-nearest
neighbours, Naive Bayes, Decision trees, Linear regression, Logistic regression, Support
vector machines, and Random Forest.
3|Page
1.2.2 Unsupervised learning
On the contrary, unsupervised learning does not learn under supervision. The model learns
based on the data that gets fed to it and discovers hidden patterns in the data (Figure 4). This
type is useful when there are enormous amounts of data, and the patterns we are looking for
are unknown. Unsupervised learning algorithms can, in general, provide useful insights about
given data like confirm what we might already know or, in some cases, predict what is going
to happen next. A suitable example of using unsupervised learning could be customer
segmentation. When vast amounts of customer data are available that captures all customer
features, unsupervised learning algorithms could cluster different kinds of customers. This
type of information could be precious to the marketing team.
4|Page
Model training can be done from scratch and can be a tedious, time-consuming task. Training
from scratch can, however, yield more reliable results if done correctly. One of the
prerequisites is that the problem needs to be well-defined. When the machine learning
problem is well-defined, the following conditions are satisfied: We have the right problem,
the right data, and the right criteria for success. Sometimes this is the only way to approach a
particular issue because application data requirements can be precise.
An alternative way is to use an existing model if a suitable model is available and retrain it
based on application needs. Using the existing model can save valuable time and resources at
the possible cost of model accuracy. That is why the model used for the basis should be
selected very carefully.
In both cases, the model requires a useful and extensive dataset, meaning that it is qualitative
but also quantitative. In the end, when training the model, it is all about following the best
practices available. On the list below, there are a few examples of these best practices.
sense to run machine learning models on the device itself instead of storing it in the cloud.
The most significant advantage of the on-device model is that the model is always available
without the need to send information back and forth (Figure 5). Localness is also the reason
why using the model is convenient.
5|Page
However, the Cloud model is not useless because the
Cloud model can be updated seamlessly, without users
ever noticing the difference. The Cloud model is the
way to go if the model file is enormous and would
drastically increase the application size (Figure 6).
Today, many applications rely heavily on using machine learning frameworks for specific
tasks, but still use traditional solutions for most of their functionality. Machine learning is not
a replacement for conventional ways of developing software but an extension of it. In many
cases, the machine learning approach can simplify the convoluted logic of specific tasks.
● Google Maps
● Facebook
● Snapchat
6|Page
● Netflix
● Tinder
● Uber
There are many other popular applications not listed that use machine learning functionality.
What comes to using machine learning functionality in viral applications is a smart move
because, firstly, it can massively improve the user experience and, secondly, provide
information with actual business value to the developer(s) or the company.
OCR stands for Optical Character Recognition. An OCR model is a type of machine learning
model that is designed to recognize text characters in images or scanned documents and
convert them into machine-readable text. The goal of OCR is to automate the process of data
entry and document digitization by extracting text from images, which can then be processed
and analysed by computers. OCR models typically use techniques from computer vision and
natural language processing to recognize text characters and convert them into digital text.
OCR models can be trained on large datasets of images and corresponding text to improve
their accuracy and recognize a wide range of fonts and languages.
7|Page
1.6 Objective of our project
The first goal of this thesis is to create an instruction manual of the development process of
the application (Snap_Seeker Android Application). which machine learning framework was
used, including information about machine learning methods, models, general android
applications, techniques, suitability, and more.
Secondly, to fully understand the basic concepts of machine learning development, a “object
classification” demonstrating the development process, is an essential factor. The project uses
a machine learning framework, which allows the development process while also showcasing
the required setup and dependencies when using such a framework.
The objective of text from image recognition project is to develop an algorithm or model that
can accurately identify and extract text from images. This is typically done using a
combination of computer vision techniques and natural language processing (NLP) methods.
The extracted text can be used for a variety of purposes, such as:
Improving accessibility: Text from image recognition can help make images more accessible
Enhancing searchability: Extracted text can be used to improve the searchability of images by
allowing users to search for specific keywords or phrases within the image.
Automated data entry: Text from image recognition can automate data entry by extracting
Translation: Extracted text can be used for translation purposes, such as translating foreign
language text in images into the user's language.
Overall, the objective of text from image recognition is to make information contained in
8|Page
CHAPTER – 2
SURVEY OF TECHNOLOGIES
2.1 Software requirements
● Anaconda
Anaconda is a distribution of the Python and R programming languages for scientific
computing that aims to simplify package management and deployment. The
distribution includes data-science packages suitable for Windows, Linux, and macOS.
We used anaconda as our environment for our python installation and trained our
model on it. It managed all the packages like OpenCV, NumPy, Keras etc. and CPU
distribution for fast usage.
● Jupiter Notebook
The Jupiter Notebook is an open-source web application that you can use to create and
share documents that contain live code, equations, visualisations, and text.
Jupiter notebook has offered a lot of tools in the development of the project.
We were able to plot graphs and execute code in a single file. It reduces the time
● OpenCV
OpenCV (Open-Source Computer Vision) library is a widely-used open-source software
library for developing computer vision and machine learning applications. In the field of
optical character recognition (OCR), OpenCV provides a range of image processing and
computer vision functions that are essential for the development of OCR systems.
OpenCV can be used for a variety of OCR tasks, such as image preprocessing, text detection,
character recognition, and post-processing. It includes a variety of algorithms and
methods
9|Page
that can be used for these tasks, including edge detection, morphological operations,
connected component analysis, template matching, machine learning-based OCR,
and deep learning-based OCR.
● Matplotlib
Matplotlib is a low-level graph plotting library in python that serves as a visualisation
utility. In our project we needed to plot the output of our training date to evaluate the
● Android
● iOS
The usage of TensorFlow Lite on Android and iOS platforms is very straightforward.
Both platforms require the framework as a dependency. The only real difference is that on
iOS, the project must use a Cocoa Pods dependency manager.
● Android Studio
Android Studio is an IDE for android application development and we have used this
10 | P a g e
IDE for our project development. Android studio gives us the compatibility with
● Computer
We used HP, Intel i5 for the development of our application. Android studio IDE was
● GPU
We used Nvidia's GPU for training our deep learning model. Because our deep
learning training had a lot of data and large data requires large computational power.
● Mobile Phone
● Data Cable
We need to use data cable for run, test and use the application on android mobile.
11 | P a g e
CHAPTER – 3
SURVEY OF TECHNOLOGIES
In this project we will be building an android application. The application has a machine
learning model implemented inside it which is loaded when the application starts. Through
the application we open the mobile phone camera and take the photo of the image that we
The image from the camera is then compressed and sent to the ML model which is integrated
in the application. The application shows the recognize text which we can edit according to
Next step of the application is to gather information from the image. For this purpose, we
used Camera and Gallery image content also to convert it into a recognised text after
processing.
The diagram can start with an input image, which is processed through the following stages:
12 | P a g e
1. Image Pre-processing: This stage involves image enhancement techniques such as
noise removal, contrast adjustment, and edge detection. The output of this stage is a
separated from the pre-processed image. This stage may involve techniques such as
3. Feature Extraction: This stage involves extracting relevant features from the
segmented characters. Features such as edges, corners, and texture are commonly
used.
4. Character Recognition: In this stage, the feature vectors generated in the previous
stage are used to recognize the characters. This stage involves techniques such as
5. Postprocessing: This stage involves refining the recognized characters to improve the
6. Output Text Generation: In this stage, the recognized and refined characters are
combined to generate the output text. This output text can be in a digital format such
7. Verification: In this stage, the accuracy of the output text is verified using techniques
9. Output: This stage involves presenting the final output to the user. The output can be
10. Each stage of the OCR model is represented by a colourful image, making it easier to
13 | P a g e
CHAPTER - 4
IMPLEMENTATION OF APPLICATION
Another crucial pre-processing step is image de-skewing, which involves correcting the
image orientation by aligning the text regions along a horizontal or vertical axis. This helps to
improve the accuracy of the OCR algorithm by reducing any skew or distortion in the text
regions that can make it difficult for the algorithm to recognize the characters.
Overall, image pre-processing is an essential step in OCR that significantly impacts the
accuracy and speed of character recognition. By enhancing the image quality and removing
any noise or unwanted artifacts, image preprocessing can help to ensure that the OCR
algorithm can accurately recognize and extract the text from the image.
The main objective of the Pre-processing phase is to make as easy as possible for the OCR system to
distinguish a character/word from the background. Some of the most basic and important Pre-processing
techniques are: -
1) Binarization
2) Skew Correction
3) Noise Removal
14 | P a g e
Before discussing these techniques, let’s understand how an OCR system
(2D array if the image is grayscale (or) binary, 3D array if the image is coloured).
Each cell in the matrix is called a pixel and it can store 8-bit integer which means
image into an image which consists of only black and white pixels (Black
pixel value=0 and White pixel value=255). As a basic rule, this can be
of the pixel range 0–255). If the pixel value is greater than the threshold,
15 | P a g e
Binarization conditions. Source: Image by author
But this strategy may not always give us desired results. In the cases where lighting
Figure 9. Binarization
So, the crucial part of binarization is determining the threshold. This can be done by
Imax= Maximum pixel value in the image, Imin= Minimum pixel value in the
image, E = Constant value Source: Reference [2]
16 | P a g e
C(i,j) is the threshold for a defined size of locality in the image (like a 10x10 size
part). Using this strategy we’ll have different threshold values for different parts of
the image, depending on the surrounding lighting conditions but the transition is not
that smooth.
→ Otsu’s Binarization: This method gives a threshold for the whole image
conditions, contrast, sharpness etc) and that threshold is used for Binarizing image.
-> Adaptive Thresholding: This method gives a threshold for a small part of the
image depending on the characteristics of its locality and neighbours i.e there is no
single fixed threshold for the whole image but every small part of the image has a
different threshold depending upon the locality and also gives smooth transition.
imgf = cv2.adaptiveThreshold(img,255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,2) #imgf
contains Binary image
(image aligned at a certain angle with horizontal) sometimes. While extracting the
information from the scanned image, detecting & correcting the skew is crucial.
17 | P a g e
→ Topline method
→ Scanline method
However, the projection profile method is the simplest, easiest and most widely
• project it horizontally (taking the sum of pixels along rows of the image
matrix) to get a histogram of pixels along the height of the image i.e
• Now the image is rotated at various angles (at a small interval of angles
called Delta) and the difference between the peaks will be calculated
(Variance can also be used as one of the metrics). The angle at which
• After finding the Skew angle, we can correct the skewness by rotating
the image through an angle equal to the skew angle in the opposite
direction of skew.
Correcting skew using the Projection Profile method. Source: Reference [1]
import sys
import matplotlib.pyplot as plt
import numpy as np from PIL
import Image as im
from scipy.ndimage import interpolation as interinput_file =
sys.argv[1]img = im.open(input_file)# convert to binary wd, ht = img.size pix
= np.array(img.convert('1').getdata(), np.uint8) bin_img = 1 -
(pix.reshape((ht, wd)) / 255.0)
plt.imshow(bin_img, cmap='gray') plt.savefig('binary.png')def find_score(arr,
angle):
data = inter.rotate(arr, angle, reshape=False,
order=0) hist = np.sum(data, axis=1) score
= np.sum((hist[1:] - hist[:-1]) ** 2) return
hist, scoredelta = 1
18 | P a g e
limit = 5
angles = np.arange(-limit, limit+delta, delta) scores
= [] for angle in angles: hist, score = find_score(bin_img,
angle) scores.append(score)best_score = max(scores)
best_angle = angles[scores.index(best_score)] print('Best
angle: {}'.formate(best_angle))# correct skew data =
inter.rotate(bin_img, best_angle, reshape=False, order=0)
img = im.fromarray((255 *
data).astype("uint8")).convert("RGB") img.save('skew_corrected.png')
intensity than the rest of the image. Noise removal can be performed
import numpy as np
import cv2
from matplotlib import pyplot as plt
# Reading image from folder where it is stored img =
cv2.imread('bear.png')
# denoising of image saving it into dst image
dst = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15)
# Plotting of source and destination image plt.subplot(121),
plt.imshow(img)
plt.subplot(122), plt.imshow(dst) plt.show()
19 | P a g e
4. Thinning and Skeletonization: This is an optional preprocessing task which
→ If we are using the OCR system for the printed text, No need of performing this
task because the printed text always has a uniform stroke width.
→ If we are using the OCR system for handwritten text, this task has to be
performed since different writers have a different style of writing and hence
different stroke width. So to make the width of strokes uniform, we have to perform
In the above code, Thinning of the image depends upon kernel size and no.of
iterations. In this article, we have seen some of the basic and most widely used
Preprocessing techniques which gives us a basic idea of what’s happening inside the
OCR system.
4.2 Text-Detection
In general, an object detection problem in computer vision refers to the task of detecting object positions in
images. The output of such algorithms is a list of bounding boxes corresponding to the positions of each
object detected in the image.
Using the same extracted features as in object detection, the set of (N, N) spatial features, we
now need to upscale those back to the image dimensions (H, W). Remember that N is smaller
than H or W.
Feature upscaling:
Especially if our (N, N) matrices are much smaller than the image, basic upsampling would
bring no value. Rather than just interpolating our matrices to (H, W), architectures have
20 | P a g e
different tricks to learn those upscaling operations such as transposed convolutions. We then
obtain fine-grained features for each pixel of the image.
Binary classification:
Using the many features of these pixels, a few operations are performed to determine their
category. In our case, we will only determine whether the pixel belongs to a word, which
produces a result like the “segmentation” illustration from the previous figure.
4.3 Text-Recognition
The Text Recognizer segments text into blocks, lines, elements and symbols. Roughly speaking:
• a Block is a contiguous set of text lines, such as a paragraph or column,
• a Line is a contiguous set of words on the same axis, and
• an Element is a contiguous set of alphanumeric characters ("word") on the same axis
in most Latin languages, or a word in others
• an Symbol is a single alphanumeric character on the same axis in most Latin
languages, or a character in others
The image below highlights examples of each of these in descending order. The first
highlighted block, in cyan, is a Block of text. The second set of highlighted blocks, in blue,
are Lines of text. Finally, the third set of highlighted blocks, in dark blue, are Words.
21 | P a g e
Figure 13. Text Recognition
For all detected blocks, lines, elements and symbols, the API returns the bounding boxes,
corner points, rotation information, confidence score, recognized languages and recognized
text.
4.4 Postprocessing
Post processing is an important step in OCR that involves analysing and correcting errors in
the recognized text output. OCR algorithms may make mistakes in character recognition due
to various factors such as image quality, font styles, and language complexities. post
processing techniques help to correct these errors and improve the accuracy of the final text
output.
Post processing typically involves comparing the recognized text output against a dictionary
of known words and analysing the context and grammar of the text to identify and correct
errors. It can also involve using predefined rules or machine learning algorithms to correct
specific types of errors.
The goal of post processing is to ensure that the recognized text output is accurate and reliable
for use in various applications such as text search, transcription, and translation. By applying
22 | P a g e
post processing techniques, the accuracy of the final text output can be significantly improved,
making it more suitable for use in real-world scenarios.
CHAPTER - 5
ABOUT MODAL
23 | P a g e
Convolutional Neural Network analyses the image, sending it to the recurrent part of the
important features detected. The recurrent part analyses these features in order, taking into
consideration previous information in order to realize what are some important links between
these features that influence the output.
To understand a bit more about how a CRNN works in some tasks, let’s take Handwritten Text
Recognition as an example.
Let’s imagine we have images containing words, and we want to train the NNet in order to
give us what word is initially in the image.
Firstly, we would want our Neural Network to be able to extract important features for
different letters, such as loops from “g” or “l”, or even circles from “a” or “o”. For this, we can
use a Convolutional Neural Network. As explained earlier, CNN uses filters in order to extract
the important features (we saw how different filters have different effects on the initial image).
Of course, these filters will detect in practice more abstract features that we can’t really
understand, but intuitively we can think of simpler features, such as the ones mentioned
earlier.
Then, we would want to analyse these features. Let’s take a look as to why we can’t decide
what a letter is based solely on its own features. In the image below, we see that the letter is
either “a” (from “aux”) or “o” (from for).
The difference is made by the way the letter is linked to the other letters. So, we would need to
know information from previous places in the image in order to determine the letter. Sounds
familiar? This is where the RNN part comes in. It analyzes recursively the information
extracted by the CNN, where the input for each cell might be the features detected in a specific
slice of the image, as represented below, with only 10 slices (less than we would use in real
models):
24 | P a g e
Figure16: Illustration of CNN
We don’t feed the RNN the image itself, as shown in the above image, but rather the features
extracted from that “slice”.
We might also see that processing the image forward is as important as processing the image
backward, so we can add a layer of cells that process the features in the other way, taking into
consideration both of them when computing the output. Or even vertically, depending on the
task at hand.
Models/research/attention_ocr
Open the file named 'common_flags.py' and specify where you'd want to log your training. And
run the following command on your terminal:
25 | P a g e
Figure 17: ML Kit architecture
Machine learning KIT (ML-KIT) is a mobile SDK provided by Google that allows
developers to easily integrate machine learning features into their Android and iOS apps.
ML-KIT offers a range of pre-built APIs and models for tasks like text recognition, face
detection, image labeling, and language identification, making it easy for developers to
add machine learning capabilities to their apps without needing to develop custom
models or algorithms. The architecture consists of three main components:
• Base API: The base API provides common functionality like camera input, model
management, and results output for all MLKIT APIs. It also handles tasks like
image scaling, rotation, and colour conversion.
• Vision API: The Vision API includes a range of pre-built models and algorithms
for tasks like face detection, text recognition, image labelling, and barcode
scanning. It also provides APIs for custom model integration and cloud-based
model hosting.
• Natural Language Processing (NLP) API: The NLP API includes pre-built
models and algorithms for tasks like language identification, sentiment analysis,
and entity recognition. It also provides APIs for custom model integration and
cloud-based model hosting.
Developers can choose to use any combination of these components based on their
app's requirements.
Overall, ML-KIT architecture is designed to be easy to use and integrate into mobile
apps while providing developers with the flexibility to choose the specific machine
learning features they need.
26 | P a g e
CHAPTER – 6
ANDROID DEVELOPMENT
Android implementation uses the ML Kit library as its machine learning framework.
27 | P a g e
The first steps of development on the Android-platform were to create a new project with
Android Studio from an empty template and define application layout, along with adding the
• We explored how Text Detection works it uses machine learning kit API and process
it to convert image into Recognized Text.
• Create a project on Android Studio with one blank Activity. Add the Google Play
services dependency to it:
• Our main and only Activity file is MainActivity.java and layout xml file is
activity_main.xml. activity_main.xml: We have one Surface View to show the camera
view and one Text View to show the detected text.
• In the Main Activity, check if camera-permission is available or not. If not, request for
it:
• Set one processor to the Text Recognizer to detect if any text is available on the
camera screen. We will receive one call back and update the Text View that is on the
camera screen. The code for starting the camera source looks like:
28 | P a g e
Dependency name Dependency source
Layout 'androidx.constraintlayout:constraintlayout:2.1.4'
Main Activity This is the file where main code activity is written.
Android
This File provides permission of Camera and External Storage.
Manifest
The Choose Model (Main Activity) opens when the user launches the application. The choose
model activity loads the application layout and handles user interactions. On choosing the
29 | P a g e
6.2 Image Intent: Captured Image
30 | P a g e
6.3 Model Conversion of image into text: Converted into Text
Following Figure 20: shows how the converted Text from Image.
After successful capture of the image, you will get to the converted text. In this class the
31 | P a g e
Data Passing to Model and Conversion:
Now we will be passing the data to the ML Kit model to display the predictions to the screen
32 | P a g e
CHAPTER – 7
CONCLUSION & FUTURE SCOPE
In the thesis we have presented the approach of building a machine learning application for
mobile. We have created an OCR text Recognition with a CNN model which is in-built in
Google Machine Learning Architecture. Utilising multiple base models and OCR text
Recognition we successfully trained and implemented a variety of models that could be used
This Bachelors’ thesis aimed to improve basic knowledge of machine learning related
processes and techniques. All goals set for this thesis were achieved. A theoretical part of this
thesis answered the questions which were to run a machine learning model in low
Using the Google Machine Learning Kit API really helped us in achieving the goal of
building a machine learning model. We used CNN as our machine learning model. We use
the ML KIT dataset provided by Google Machine Learning Kit to train and test our model.
After successfully training our model, we got to the accuracy of 99.1% through multiple trial
and error methods.
Soon we would like to take the accuracy from 99.1% to 99.9%. Currently we are clicking
image and converting into text in basic looking app. And also, we add the functionality of
Take images from gallery and convert it into text. We will work it to make a user- friendly
with powerful UI with more functionality.
We cannot have a large size of model in an Android application; it will affect the performance
of the app. To tackle this problem, we will shift the model to the cloud so that there will be no
33 | P a g e
REFERENCES
1. Hardt, Moritz, Price, Eric & Srebro, Nati. “Equality of opportunity in supervised learning.” Advances
in neural information processing systems. 2016.
2. Stuart J. Russell, Peter Norvig (2010) Artificial Intelligence: A Modern Approach, Third Edition,
Prentice Hall ISBN 9780136042594.
3. Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar (2012) Foundations of Machine Learning, The
MIT Press ISBN 9780262018258.
4. S. Geman, E. Bienenstock, and R. Doursat (1992). Neural networks and the bias/variance dilemma.
Neural Computation 4, 1–58.
6. Assefi, Mehdi (December 2016). "OCR as a Service: An Experimental Evaluation of Google Docs
OCR, Tesseract, ABBYY FineReader, and Transym". ResearchGate.
7. Ashok Popat (Sep 4, 2015). "IEEE SPS: Optical Character Recognition for Most of the World's
Languages" Retrieved 2021-12-20.
8. Ian Goodfellow and Yoshua Bengio and Aaron Courville (2016). Deep Learning. MIT Press. p. 326.
9. Collobert, Ronan; Weston, Jason (2008-01-01). A Unified Architecture for Natural Language
Processing: Deep Neural Networks with Multitask Learning. Proceedings of the 25th International
Conference on Machine Learning. ICML '08. New York, NY, USA: ACM. pp. 160–167.
doi:10.1145/1390156.1390177. ISBN 978-1-60558-205-4. S2CID 2617020.
10. Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (2017-05-24). "ImageNet classification with
deep convolutional neural networks" (PDF). Communications of the ACM. 60 (6): 84–90.
doi:10.1145/3065386. ISSN 0001-0782. S2CID 195908774.
11. Ed, Burnette (July 13, 2010). Hello, Android: Introducing Google's Mobile Development
12. "SDK Tools | Android Developers". Developer.android.com. Retrieved April 25, 2018.
13. Loukas, S. 2020. What is Machine Learning: Supervised, Unsupervised, Semi Supervised and
Reinforcement learning methods. Digital article. Published 10.6.2020. Read 24.8.2021.
https://towardsdatascience.com/what-is-machine-learning-a-short-noteonsupervisedunsupervised -
semi-supervised-and-aed1573ae9bb
34 | P a g e
14. Brahim Elbouchikhi. A Basic Introduction to Google Machine Learning Kit. Digital article. Published
May 09, 2018 Read 02.02.2023.
15. Stephen Perkins. On-device machine learning can make smartphones even better. Published 26.04.2023
https://www.androidpolice.com/google-ml-kit-explainer/
16. Brahim Elbouchikhi. Machine Learning Kit Main page Published 09-05-2018 Read 02.02.2023.
https://developers.google.com/ml-kit
18. Sorana. Basic Overview of Convolutional Neural Network . Published November 24th, 2020
https://www.analyticsvidhya.com/blog/2020/11/a-short-intuitive-explanation-of-
convolutionalrecurrent-neural-networks/
35 | P a g e
APPENDIX
The complete source code for Android mobile applications is publicly available on the
36 | P a g e