Apache Tika

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

Blitline supports information retrieval from documents such as PDF, and XLS. Not only can Blitline rasterize documents into an image, you can now retrive the data stored within those documents with Blitline. This allows you to retrive the text of various documents (like PDF, Word or EPUB) along with the thumbnails.

A common use-case for this word be to get the text from PDF documents while thumbnailing them and then push that text and metadata into an Elasticsearch system for indexing.

HOW TO USE IT

Just add get_tika: true option to your root JSON.

{
    "application_id":"YOUR_APP_ID",
    "src":"https://s3.amazonaws.com/blitdoc/docx/Contoso.xlsx",
    "get_tika" : "true",
    "v" : 1.22,
    "functions":[
        {
            "name":"crop",
            "params":{
                "gravity": "NorthGravity",
                "width":100
            },
            "save":{
                "image_identifier":"MY_CLIENT_ID"
              }
        }
    ]
}

You can see a working example here: https://www.blitline.com/v3/home/gist?gist_id=ef8c2d03e21d18d2aa4649bb9b254da5