Data processing libraries
The standard Java library is very rich and offers a lot of tools for data processing, such as collections, I/O tools, data streams, and means of parallel task execution.
There are very powerful extensions to the standard library such as:
- Google Guava (https://github.com/google/guava) and Apache Common Collections (https://commons.apache.org/collections/) for richer collections
- Apache Commons IO (https://commons.apache.org/io/) for simplified I/O
- AOL Cyclops-React (https://github.com/aol/cyclops-react) for richer functional-way parallel streaming
We will cover both the standard API for data processing and its extensions in Chapter 2, Data Processing Toolbox. In this book, we will use Maven for including external libraries such as Google Guava or Apache Commons IO. It is a dependency management tool and allows to specify the external dependencies with a few lines of XML code. For example, to add Google Guava, it is enough to declare the following dependency in pom.xml:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>
When we do it, Maven will go to the Maven Central repository and download the dependency of the specified version. The best way to find the dependency snippets for pom.xml (such as the previous one) is to use the search at https://mvnrepository.com or your favorite search engine.
Java gives an easy way to access databases through Java Database Connectivity (JDBC)--a unified database access protocol. JDBC makes it possible to connect virtually any relational database that supports SQL, such as MySQL, MS SQL, Oracle, PostgreSQL, and many others. This allows moving the data manipulation from Java to the database side.
When it is not possible to use a database for handling tabular data, then we can use DataFrame libraries for doing it directly in Java. The DataFrame is a data structure that originally comes from R and it allows to easily manipulate textual data in the program, without resorting to external database.
For example, with DataFrames it is possible to filter rows based on some condition, apply the same operation to each element of a column, group by some condition or join with another DataFrame. Additionally, some data frame libraries make it easy to convert tabular data to a matrix form so that the data can be used by machine learning algorithms.
There are a few data frame libraries available in Java. Some of them are as follows:
- Joinery (https://cardillo.github.io/joinery/)
- Tablesaw (https://github.com/lwhite1/tablesaw)
- Saddle (https://saddle.github.io/) a data frame library for Scala
- Apache Spark DataFrames (http://spark.apache.org/)
We will also cover databases and data frames in Chapter 2, Data Processing Toolbox and we will use DataFrames throughout the book.
There are more complex data processing libraries such as Spring Batch (http://projects.spring.io/spring-batch/). They allow creating complex data pipelines (called ETLs from Extract-Transform-Load) and manage their execution.
Additionally, there are libraries for distributed data processing such as:
- Apache Hadoop (http://hadoop.apache.org/)
- Apache Spark (http://spark.apache.org/)
- Apache Flink (https://flink.apache.org/)
We will talk about distributed data processing in Chapter 9, Scaling Data Science.