Hello friends,
Azure OCR (Optical Character Recognition) is a powerful AI as a Service offering that makes it easy for you to detect text from images.
In this tutorial, we will start getting our hands dirty. That is, we will begin developing real AI software that solves a genuine business problem so that you feel both learning and developing something that has a value proposition.
Let me start with a frequent problem I face; As a Muslim, I have dietary constraints regarding food components and E numbers. I spend considerable time in the supermarket attempting to translate words and look them up. It would be convenient for me (and for anyone with dietary constraints due to religious, medical, and ethical reasons) to have a mobile app that you use to scan the product. It immediately tells you if there are dietary constraints. Thanks for Azure OCR which will help us in implementing the solution.
Scoping the Business Problem
To make the expectations bar clear, I am not planning to implement a fully working E2E. The main focus here will be implementing the cognitive service that enables the core functionality. I will share the source code in GitHub, and anyone is more than welcome to expand it and develop it further.
Therefore, our solution will rely on cognitive APIs:
OCR detection API: This API will be responsible for returning text extracted from an image. We will use it to extract the food components and get the text to match it against an existing database.
Vendor Choice
As I discussed in my last blog post (Democratized AI – AI as a Service for all!), many platforms are providing AIaaS, such as GCP, AWS, and Azure. For this tutorial, I am choosing the Microsoft Azure offering. Feel free to select Amazon or Google. Make sure to compare prices, supported languages, orientation, and any parameter that could be of significance to you.
Preparing Your Azure Account
First and foremost, make sure you have a valid Azure account. Follow this tutorial from Microsoft to have your Azure account ready, and have your cognitive service ready.
Preparing Your Project
I will be using Visual Studio 2017 Enterprise edition as an IDE to develop the code sample. The code samples are written by C# and ASP.NET Core 2.2. Kindly note that you are entirely free to use your own preferred programming language/framework. The APIs are platform agnostic, as discussed previously. Our implementation will rely on the .NET SDK. Other SDKs can be found here. I will provide the URL to the GitHub project containing the code sample. The functionality's core logic relies on a code sample from Microsoft MSDN, with adaptions to meet our requirements.
Developing the Solution
- Preparing the restriction list:
First, I created a simple static class that has a property containing all restricted food components:
public static class Helpers { public static List = new List(new string[] { "tiamin" }); }
I have chosen (thiamine) as a test sample. Feel free to add any if you wish.
- Defining the OCR Service:
Even though this blog post's focus is developing dietary constraint detection solution in the fastest way, I am trying to keep things a little bit cleaner.
- OCR Service interface:
public interface IOCRService
{
Task<string[]> DetectTextInImageAsync(Stream image);
}
The interface is pretty simple. It takes an image stream (image of ingredients part of the product) and returns back a task that returns a string array. The string array will contain all detected text lines.
- The following NuGet package has to be added to the project (Microsoft.Azure.CognitiveServices.Vision.ComputerVision)
This is the computer vision SDK from Microsoft encapsulating HTTP calls to the cognitive services.
- Azure OCR Service implementation:
public class AzureOCRService : IOCRService
{
private const int numberOfCharsInOperationId= 36;
string subscriptionKey = "YOUR SUBSCRIPTION KEY"; // Change with your key
string cognitiveServiceEndPoint = "https://francecentral.api.cognitive.microsoft.com/";
// For printed text, change to TextRecognitionMode.Printed
TextRecognitionMode textRecognitionMode = TextRecognitionMode.Printed;
ComputerVisionClient computerVision;
public AzureOCRService()
{
computerVision = new ComputerVisionClient(
new ApiKeyServiceClientCredentials(subscriptionKey),
new System.Net.Http.DelegatingHandler[] { });
computerVision.Endpoint = cognitiveServiceEndPoint;
}
public async Task<string[]> DetectTextInImageAsync (Stream image)
{
return await ExtractLocalTextAsync(image);
}
// Recognize text from a local image
private async Task<string[]> ExtractLocalTextAsync
(
Stream imageStream)
{
// Start the async process to recognize the text
RecognizeTextInStreamHeaders textHeaders =
await computerVision.RecognizeTextInStreamAsync(
imageStream, textRecognitionMode);
return await GetTextAsync(computerVision, textHeaders.OperationLocation);
}
private async Task<string[]> GetTextAsync
(
ComputerVisionClient computerVision, string operationLocation)
{
string[] results;
// Retrieve the URI where the recognized text will be
// stored from the Operation-Location header
string operationId = operationLocation.Substring(
operationLocation.Length - numberOfCharsInOperationId);
TextOperationResult result =
await computerVision.GetTextOperationResultAsync(operationId);
// Wait for the operation to complete
int i = 0;
int maxRetries = 10;
while ((result.Status == TextOperationStatusCodes.Running ||
result.Status == TextOperationStatusCodes.NotStarted) && i++ < maxRetries)
{
await Task.Delay(1000);
result = await computerVision.GetTextOperationResultAsync(operationId);
}
var lines = result.RecognitionResult.Lines;
return lines.Select(l => l.Text).ToArray();
}
}
AzureOCRService class is the concrete implementation of IOCRService. The class defines the following properties:
numberOfCharsInOperationId: This is a fixed value of 36. It merely refers to the length of (Operation Id) parameter returned by cognitive services resource from Azure as part of (Operation Location)
subscriptionKey: This is an API Key that is used to authenticate requests towards cognitive services resource in the Azure portal. You can find it in the keys part when you choose your cognitive services resource.
cognitiveServiceEndPoint: This refers to the URL of the endpoint where your cognitive service is hosted. You can find it in the overview part of your cognitive service in the Azure portal.
textRecognitionMode: An enum that accepts (HANDWRITTEN) or (PRINTED), it tells whether we want to detect printed or handwritten text.
computerVision: The computer vision client used to perform API calls. It wraps all HTTP requests handily to perform the computer vision cognitive operations.
After defining the major class properties, let’s examine the class body and discuss the used functions.
public AzureOCRService()
Our constructor, which prepares the computer vision client subscription and endpoint details.
public async Task<string[]> DetectTextInImageAsync (Stream image)
The interface implementation from IOCRService simply wraps a call to ExtractLocalTextAsync.The function accepts a stream that contains the image to be text detected.
private async Task<string[]> ExtractLocalTextAsync(
Stream imageStream)
The ExtractLocalTextAsync function performs the following:
- Calls RecognizeTextInStreamAsync from computerVision, RecognizeTextInStreamAsync is an async function that accepts image stream containing the image and TextRecognitionMode. The function returns RecognizeTextInStreamHeaders which is an object that simply wraps a unique identifier for the operation valid for 48 hours. It can be used to query operation status. It stores it in OperationLocation property. The operation id itself is 36 characters long.
Sample format, where operation id is bolded:
- Calls GetTextAsync passing computerVision and operation location.
private async Task<string[]> GetTextAsync(
ComputerVisionClient computerVision, string operationLocation)
- Extracts operation Id from operation location.
- Uses computer vision client to query the status of the extracted operation Id by calling (GetTextOperationResultAsync), which returns TextOperationResult, which has the status property .: 'Not Started,' 'Running,' 'Failed' or 'Succeeded.'
- Waits for the operation to finish with a certain number of retries, when succeeds returns all detected lines from the TextOperationResult. The TextOperationResult also has word and bounding text properties which can be used to extract individual words.
Finally, our controller in ASP.NET Core 2.2 injects the OCRService and calls DetectTextInImageAsync operation and matches it against the restriction list we defined to render the detected restricted words from the image. I am not putting the code details on the ASP.NET Core part as it is not relevant. However, you can find everything in the code sample in the following URL https://github.com/cognitiveosman/DietrayRestrictionsDetector ?
Here is a screenshot of how our application looks like:
Feel free to expand upon it and innovate. Submit GitHub PR if you wish too. Here are some ideas I have:
- Optimize the application UI, highlight particular restricted words in the image.
- Make the application functionality as API.
- Develop mobile front-end.
- Redesign the restrictions list to support multiple languages, explain restriction cause.
Discussion Questions:
- Do you feel that you face a repeated similar problem?
- Is there something that you feel is too dummy to do manually every time?
- Can you think of other applications where we utilize text extraction services?
And thanks for reading. Please feel free to ask questions or share feedback.