Bigdata Tech Blogs: October 2016

Scala is hybrid programming language which provides Object oriented and functional programming features. It is built on top of JVM, you may call it as 'Better Java'. Like in Java, the Scala code is also compiled and converted to .class(byte code) and runs on JVM. You can use or access Java classes inside a Scala program.

Scala is a child language which combines some of the good features of Java and Python. Java provides strong Object oriented, efficient memory management(with garbage collection). Python provides functional programming, map-reduce Apis, concise and comprehensive style of coding. Here are some of the features of Scala:

Everything is Object:

All the types are defined as object in scala. Even a function is also treated as object and can be passed as an argument( like we see in JavaScript). For eg:

   1:    def main(args :Array[String]): Unit = {

   2:      var list1: List[Int] = List(87 , 45, 56);

   3:      System.out.println("list1: " + list1);

4:

   5:      //passing fn as a parameter

   6:      val list2 = list1.map(incr)

   7:      System.out.println("list2: " + list2);

8:

   9:      //passing an anonymous fn as a paramter

  10:      val list3 = list1.map(x => x +1 )

  11:      System.out.println("list3: " + list3);

12:

  13:    }

14:

  15:    def incr(x: Int) : Int = {

  16:      x +1;

  17:    }

18:

Results:

   1:  list1: List(87, 45, 56)

   2:  list2: List(88, 46, 57)

   3:  list3: List(88, 46, 57)

Interactive style of coding: Scala has an interpreter console where you just type in and execute Scala statements. This is one of the feature which is adopted from Python. Useful for quick 'try-out' instead of building/running using eclipse or similar IDEs.

   1:  Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101).

   2:  Type in expressions for evaluation. Or try :help.

3:

   4:  scala> val str ="Scala interpretor" ;

   5:  str: String = Scala interpretor

6:

   7:  scala> val str1 = str + " is working"

   8:  str1: String = Scala interpretor is working

9:

  10:  scala>

Scala Objects: These are Singleton Object and don't have to instantiate with a new keyword. In scala, everything is Object. Unlike in Java, we can't have a class level static members in Scala. hence we have Singleton objects which is created with 'object' keyword.

   1:  object Arithematics {

   2:    def sum(a: Int , b : Int): Int= {

   3:      return a + b;

4:

   5:    }

   6:     def subtract(a: Int , b : Int): Int= {

   7:      return a - b;

   8:    }

9:

10:

11:

12:

  13:    def main(arrgs: Array[String]) = {

14:

  15:      System.out.print("sum : " + Arithematics.sum(10 , 45))

16:

  17:    }

Traits: Traits are like interfaces in Java, but can have concrete(common implementation) methods. Scala also allows traits to be partially implemented but traits may not have constructor parameters.

1:

2:

   3:  trait Equal {

   4:     def isEqual(x: Any): Boolean

   5:     def isNotEqual(x: Any): Boolean = !isEqual(x)

   6:  }

7:

   8:  class Point(xc: Int, yc: Int) extends Equal {

   9:     // constructor

  10:    var x: Int = xc

  11:     var y: Int = yc

  12:    // constructor

  13:     def isEqual(obj: Any) = obj.isInstanceOf[Point] && obj.asInstanceOf[Point].x == y

  14:  }

Scala is a shortcut language(concise): Scala code are very concise and short in size( just like python) and it provide shortcuts like below. Also the variable assignments are type inference which will avoid redundant type declarations.

   1:    def main(args :Array[String]): Unit = {

   2:      var list1: List[Int] = List(87 , 45, 56);

   3:      System.out.println("list1: " + list1);

   4:        // val list3 = list1.map(x => x +1 )

   5:      val list3 = list1.map(_ + 1) // shortcut

   6:      // val reduce = list3.reduce((x, y) => x + y );

   7:     val reduce = list3.reduce(_ + _); // shortcut

   8:      System.out.println("list3: " + list3);

   9:      System.out.println("reduce: " + reduce);

  10:    }

11:

  12:  Results:

13:

  14:  list1: List(87, 45, 56)

  15:  list3: List(88, 46, 57)

  16:  reduce: 191

def incr(x: Int) : Int = {
x +1; }
val x = incr(1);

Here the type of 'x' is automatically defined as Int upon the assignment. However this can't be reassigned to other types. Once it is typed to Int, it can be retyped to other type(even in case of var). Unlike in Java, this will reduce redundant type declarations.

Closure functions: Special function which will return values which depends on the variable declared outside of the function. It needn't be a variable, it can be a functional or anonymous reference as well.

   1:  object ClosureEg {

   2:    val factor = 5;

3:

   4:    def multiplier(a :Int) : Int = {

   5:      return a * factor;

   6:    }

   7:  }

This helps in achieving a cleaner of way of coding where you would need to separate out the redundant boiler plate code and the trivial code. It can also be used as a callback function for asynchronous process. Here is some e.g for closure.

   1:   /* With Closure. Here  makeTag() is normal method and closure fn is the

   2:    * anonymous method inside the makeTag.

   3:    */

   4:   def makeTag(openTag: String, closeTag: String)  = {

   5:        (content: String)  => openTag +content +closeTag;

   6:   }

   7:    /* Cleaner way of coding. To separate the

   8:     * non-trivial boiler plate code such as creating the Structure of the xml construct and

   9:       trivial code such injecting the content */

  10:    var table = makeTag("<table>" , "</table>" ) ;

  11:    var tr = makeTag("<tr>" , "</tr>" ) ;

  12:    var td = makeTag("<td>" , "</td>" ) ;

  13:    table(td(tr("Hi")))

   1:    /* With out Closure. Here  makeTag() is a plain method which accepts

   2:     * all 3 parameters. Here the low level dependencies has to create first

   3:     * and has to injected in the order. Not a cleaner way of coding. */

4:

   5:    def makeTag1(openTag: String, closeTag: String, content: String)  = {

   6:        openTag +content +closeTag;

   7:   }

8:

   9:    var tr1 = makeTag1("<tr>" , "</tr>", "Hi" ) ;

  10:    var td1 = makeTag1("<td>" , "</td>",  tr1) ;

  11:    var table1 = makeTag1("<table>" , "</table>", td1 ) ;

12:

13:

Pattern matching: pattern match includes a sequence of alternatives, each starting with the keyword case. Each alternative includes a pattern and one or more expressions, which will be evaluated if the pattern matches.

   1:      for (value <- 1 to 5) {

   2:        value match {

   3:          case 1 =>    System.out.println("value is " + value);

   4:          case 2 =>    System.out.println("value is " + value);

   5:          case 3 =>    System.out.println("value is " + value);

   6:          case 4 =>    System.out.println("value is " + value);

   7:        }

   8:      }

Pattern matching can be also done with 'case class' which is a special class in Scala. The default constructor, getter/setter, hashcode() , equals(), toString methods would be automatically generated for case class.

   1:   val alice = new Person("Alice", 25)

   2:        val bob = new Person("Bob", 32)

   3:        val charlie = new Person("Charlie", 32)

4:

   5:        for (person <- List(alice, bob, charlie)) {

   6:           person match {

   7:              case Person("Alice", 25) => println("Hi Alice!")

   8:              case Person("Bob", 32) => println("Hi Bob!")

   9:              case Person(name, age) => println(

  10:                 "Age: " + age + " year, name: " + name + "?")

  11:           }

  12:        }

  13:     }

  14:     case class Person(name: String, age: Int)

Scala constructors: How does it created and invoked ?
In Scala, constructors work in a different way than in java. There are two kinds of Constructor in Scala. Primary and Auxiliary constructors. Primary constructor is defined as part of the class parameters. Everything thing except the class method is considered as constructor code.

   1:    class Employee(val id : Int, val firstName : String, val lastName: String ) {

   2:   /* constructor begins.. */

   3:    var fullName = lastName + ", " + firstName ;

   4:    var this.id = id;

   5:    var location = "";

   6:    var age = 0;

7:

   8:     /* construction ends */

   9:    def method1() = {

10:

  11:    }

  12:  }

Auxiliary constructors are custom constructor which has additional constructor parameters(Optional). This is defined as this(optional constructor parameters). You need to call this(), the primary constructor inside your Auxiliary constructor.

   1:    class Employee(val id : Int, val firstName : String, val lastName: String ) {

   2:   /* constructor begins.. */

   3:    var fullName = lastName + ", " + firstName ;

   4:    var this.id = id;

   5:    var location = "";

   6:    var age = 0;

7:

   8:    def this(id : Int, firstName : String, lastName: String,

   9:        age : Int, location : String) = {

  10:      this(id,firstName,lastName );

  11:      this.age = age;

  12:      this.location = location;

  13:  }

  14:     /* construction ends */

  15:    def method1() = {

16:

  17:    }

  18:  }

For many years, relational databases are playing a pivotal role in software industry and probably RDBMS was one of the popular solution for the problem with both large and small scale data storages.

Now things got dramatically changed and we gotta a newer set of databases called noSql.

what is noSql ? what are the key features of noSql ? what are the advantages over relational databases.

In order to understand, we would need to revisit the CAP theorem.

CAP (Consistency, Availability, Partition Tolerance) theorem states that we can choose only two of these characteristics in a distributed system.

C - Consistency means that data is the same across the cluster, so you can read or write to/from any node and get the same data.

A - Availability means the ability to access the cluster even if a node in the cluster goes down.

P - Partition Tolerance means that the cluster continues to function even if there is a "partition" (communications break) between two nodes (both nodes are up, but can't communicate).

Relational Database tend to exhibit ACID properties and was provided with 2 major characteristics. Strong consistency and moderate Availability. When you enforce ACID characteristics, the database has to do so many things which is not suitable in case the data is growing in a large scale.What if we relax the ACID rules and focus only some of the characteristics what we would needed in our current business requirements like in e-Commerce. Then the database will be more manageable and provides better scalability. noSql are the BASE complaint databases which are relaxed from the ACID compliance.

BASE stands for (B,A - Basic Availability, S - Soft State, E - Eventually Consistent)

Eventually Consistent means if you are updating some record in a db cluster, it would not immediately updates the entire replicas of record in the cluster. It might update a replica in one node and rest of the replicas in others nodes would be updated in an asynchronous fashion. So this kind of approach is not suitable for sensitive activities like bank transactions. There are some use cases which this approach would suit. Suppose you have uploaded your pic in Facebook and you would like to know how many likes you are getting. It doesn't matter for you if the no of likes is 99 or 100 or 101. You just need a approximate count.

noSQL provides 'Eventually Consistent' by default. But noSql db like Cassandra provides Tun-able Consistency by tweaking some attributes in the configuration. Similarly it provides basic availability by default. However you can make it strong with help of replications and other configurable attributes.

Let see some differences between RDMS and noSql

RDMS

Not Scalable.
Highly structured and schema bound.
Data is highly normalized.
Low latency when volume of data is high.
ACID complaint

NoSql

Scalable.
Not schema bound.
Data is de-normalized.
Compression.fast Random access.
BASE complaint.

Basically RDBMS is not flexible and under performed when the data is at the large scale.

Here are 4 categories of noSql.

Key Value

Data is maintained as <Key,Value>. It uses the hash of the key for retrieving, storing the data.

e.g : accumulo, dyanamo, Riak
Column Oriented

In RDMS all the rows are sit together. But here,all the column values are sit together. In-case you need some aggregate function on some column values, there are lot of disk seeks required in RDMS. But in these type of db, aggregate function can be executed seamlessly. This kind of database is suitable for fast random access of particular cell(called a column value).

e.g : Hbase => built on HDFS(Hadoop File system) provides consistency and scale-able.
Cassendra => built on CFS( Cassandra file system) provides High availability and scale-able..

Here is an e.g how the data is stored in RDBMS and Column Oriented db.

+----+--------------+----------------------+----------+-------------+------------+-----------+

+----+--------------+----------------------+----------+-------------+------------+-----------+

| 1 | Benny Smith | 23 Workhaven Lane | 52683 | 14033335568 | Lethbridge | Canada |

| 2 | Keith Page | 1411 Lillydale Drive | 18529 | 16172235589 | Woodridge | Australia |

| 3 | John Doe | 1936 Paper Blvd. | 92512 | 14082384788 | Santa Clara| USA |

+----+--------------+----------------------+----------+-------------+------------+-----------+

In RDBMS, the row data is stored together: 1, Benny Smith, 23 Workhaven Lane, 52683, 14033335568, Lethbridge, Canada

In Column Oriented db, the column data is stored together: Benny Smith, Keith Page , John Doe

Each column value is associated with a row key(like a primary key) which will uniquely identify a column value in the table. Also the row key is indexed so that random access would be faster.

Lets take the Hbase table e.g to demonstrate some data model design concepts.

Hbase Table: Hbase table name and has to be defined upfront. Contains list of all column families defined.

Column Family: Logical collection of similar column data. This also has to created up front. Contains list of all columns.

Column name: Column name and can be added dynamically when ever needed which is why this is not-schema bound db.

Cell: Hbase supported different versions of data which can be configured. Contains List of versioned data.

It can be visualize as multidimensional sorted Map where a specific value is mapped to a key. It is just an another flavor of key-value database where a key can be a set of one or more attributes( row-key, col-family, column-name , time-stamp-version) and value can be single data or a list of data.

(Table, RowKey, Family:Column, TimeStamp) => Value

(Table, RowKey, Family:Column) => List of all versioned value

(Table, RowKey, Family) => List of all Columns data

(Table, RowKey) => List of all Column family data

Unlike in RDBMS, designing a data model has to be done in a different way in Column oriented db.

Please note that we can't query the data by just filtering a column value like in RDBMS. The only key to access the data is through providing a row-key. Your db model has to be determined based on your read/write access pattern. The row-key and the table structure has to be identified to suit your read/write access pattern.

Document Oriented

Here data is maintained in documents like xml, json.

e.g : Mongo, CounchDB
Graph

This is database which uses graph structures for semantic queries with nodes, edges and properties to represent and store data.

e.g : Neo4j , Flock

Bigdata Tech Blogs

Wednesday, 19 October 2016

Scala: An Overview and some key features

Saturday, 8 October 2016

RDMS and noSql: Some key aspects

Blog Archive