Apache Hadoop has great capabilities for archiving big data through its flexible distributed file system (HDFS) across several nodes. This big data solution is also powered by the MapReduce Framework which enables developers to analyze the archived data through its APIs. The big data may be structured or unstructured and may be in any file format. Keeping this in mind, we have the released first version of the Aspose for Hadoop project which enables developers to work with a number of file formats. Below is a list of the file formats supported in the initial version:
  • Microsoft Word (DOC)
  • WordprocessingML (DOCX, XML)
  • Rich Text Format (RTF)
  • HTML, XHTML and MHTML
  • OpenDocument (ODT)
  • Microsoft Excel (XLS)
  • SpreadsheetML (XLSX, XML)
  • OpenDocument Spreadsheet (ODS)
  • PresentationML (PPTX, XML)
  • Outlook Emails (MSG)

Using the Aspose for Hadoop project, the Hadoop developers can parse text from any of the above formats. The text can then be used in MapReduce analysis algorithms or for any other purpose depending on the use case. The project comes up with two packages:
  • com.aspose.hadoop.core – Provides Aspose for Java wrapper classes to parse text from the above formats. The package includes a couple of classes to override Hadoop input formats so the binary sequence files can be created.
  • com.aspose.hadoop.examples – Provides mapper examples for creating and converting binary sequence files for all the supported formats into text sequence files.

Last edited Nov 11, 2013 at 7:47 AM by asposemarketplace, version 2