Tag Archives: Big-Data

Data Science for Business – by Foster Provost and Tom Fawcett

In a nutshell: If you are looking for a simple (but not simplistic) introduction to nearly all of the underlying data science fundamentals then look no further, because this is the book for you! You’ll learn about data-analytic thinking, correlation, segmentation, model fitting/overfitting, similarity (k-NN etc.), clustering (k-means etc.), probability (Bayes et al), mining text, result evaluation etc., and more!

I often work as a software developer, and so like to think that I have a good working knowledge of data science (‘data analytics’, ‘big data’ etc.), which has been achieved through College/Uni education and also through the modern media and blog posts etc. However, I often struggled at times to fully understand, and perhaps more importantly knit together and apply, the core fundamentals of the topic. This book has provided exactly the explanations and ‘glue’ that I required, in that it delivers a very well structured (and paced) introduction and overview of data science, and also how to think in a ‘data-analytics’ manner.

If you preview the book on Amazon with the ‘look inside’ feature then what you see in the table of contents is exactly what you get. Every chapter delivers upon its title (and promised ‘fundamental concepts’), and frequently builds superbly upon topics introduced in early chapters. You’ll move seamlessly from understanding how to frame data science questions, to learning about correlation and segmentation, to model fitting and overfitting, and on to similarity and clustering. With a brief pause to discuss exactly ‘what is a good model’ you’ll then be thrust back into learning about visualising model performance, evidence and probabilities and then how to explore mining text.

The concluding chapters draw upon and summarise how to practically choose and apply the techniques you’ve learnt, and provide great discussion on how to solve business problems through ‘analytical engineering’. There is also some bonus discussion on other tools and techniques that build upon earlier concepts which you might find useful, data science and business strategy, and some general thinking points around topics such as the need to human intervention in data analysis and privacy and ethics.

The book is superbly written and reads very easily, which for the potentially dry topic of data science is worthy of praise alone. The majority of chapters took me each approximately an hour to read, and then another couple of hours to re-read and ponder upon (and sometimes looking at other provided references) to fully understand some of the more complex topics and how everything related together. Each chapter also provided plenty of pointers and experimentation ideas if I wanted to go away and practically explore the topic further (say, with the Mahout framework, or R, or scikit-learn/Pandas etc.). The book could probably be read by dipping in and out of chapters, but I think you’ll get a whole lot more from a cover-to-cover reading.

In summary, this is a superb book for those looking for a solid and comprehensive introduction to data science and data analytics for business, and I’m sure will that even the more experienced practitioners of the art will find something useful here. The book introduces topics in a perfect order, superbly builds your knowledge chapter after chapter, and constantly relates and reinforces the various techniques and tools your learning as it progresses. I wish more text/learning books were written this well!

Click here to buy Data Science for Business: What you need to know about data mining and data-analytic thinking on Amazon UK (This is a sponsored link. Please click through and help a fellow developer buy some more books!)

JAX London Review – Part One

November 4, 2013

Conferences

Leave a comment

This was the first year I’ve attended the JAX London conference, and I must say I was impressed. The conference struck a nice balance between discussing the latest technologies in the Java space and also the ‘soft skills’ side of what we do as developers, which includes use cases and lessons learned from everyday life as a developer

Day One
Unfortunately the combination of the UK ‘Super’ Storm and South West Trains approach to business meant that I couldn’t make the first day of the conference, which was included a series of tutorials. This was a real shame, as I had been eagerly anticipating listening to Thoughtwork’s Neal Ford talking about Continuous Delivery. Thanks to Neal’s website I managed to get hold of the slides, but I’m still gutted that I couldn’t attend.

Day Two
With the Storm well and truly behind us the conference got into full swing. @monkchips provided the opening keynote “How Java got its mojo back”, which I thought was a great start to the conference. Key takeaways:

Java is still very much alive a kicking, even with all of the competition of other languages
Java may not be ‘cool’, but thanks to the JVM, the execution scales very well
Twitter et al have moved from cool technologies (Ruby on Rails) to not-so-cool JVM-based technologies (Scala + Java) in order to meet the massive scalability demands
Java may be missing a killer web-app framework (and who can deny the JSF approach is a little clumsy), and this may hinder the hipsters from adopting the Java ecosystem
The Java development ecosystem shows the full range of governance available within our industry, from dictatorship (OSGI) to direct democracy (Apache). I thought this was a fun fact!
NoSQL is rapidly becoming the de facto datastore of choice – is HDFS the new relational database?
It’s now more difficult to sell software, and instead the focus is moving towards selling services and data
Java is the architects choice, the engineers choice, but is it the developers choice?

As @monkchips himself noted, it was a slightly tough audience, and I’m not sure why. Maybe the combination of the previous day’s storm and an early start had set everyone up badly?

2013-10-29 09.50.11

Next up with @mashooq talking about “TDD at Scale”. This was an excellent talk, and there were plenty of great points made:

Doing TDD does not guarantee that good design will emerge from the codebase
…this is further exaggerated by the fact that good a TDD lifecycle should include “Red, Green, Refactor”, but it’s all too tempting to just go “Red, Green, Red, Green…” Unfortunately I’ve been there and done that…
It may be easy to split an application into components, but unless a component maps to a real business process, then it will be difficult to create real meaningful acceptance tests around this. This can lead to overly fragile tests, which simply test technical functionality.
Avoid the test fixture ‘God’ Class anti-patten at all costs. This is created with good intentions to manage test fixtures, but soon grows out of control…
…instead you could/should invest in creating a DSL to create test fixtures
Classify tests into Unit, Integration and Component/Acceptance, as this allow some flexibility to choose when expensive/slowest tests are run
Regularly email the team with the 10 slowest tests. Often the slowness is due to a simple design flaw in the test. Name and shame the perpetrators until they fix the tests 🙂
Parallelise test execution (where possible)
System tests are very valuable, and so don’t forget them. If implemented correctly then this truly can become ‘living documentation’
Lisa Crispin’s ‘Agile Testing Quadrants’ provide an invaluable insight to the range and value of testing that is available in any software development project. This was a great nugget of wisdom!
The foundations to successful software development are design, code quality, TDD and constant critique

For the third session I attended one of the Big Data Con talks about the ‘Lambda Architecture’. If you haven’t read about this architectural pattern then it’s worth doing some research, as I’m increasingly seeing this adopted by many of the Big Data shops. Although interesting, I didn’t particularly take away lots to share from this, and so I’ll skip notes here.

The fourth session I chose was an explanation (and demo) of the Apache Drill project, which is an open source clone of the much discussed Google Dremel project. The talk was more of a introduction to the project, and so I don’t have lots of notes to share here either. I would recommend having a quick read about Dremel, and also watching a few of the related videos floating around the interweb, as when completed I expect the Apache Drill project to create quite a stir in how we query Big Data sources.

For the fifth and final session of day one I attended @sandromacusso‘s talk on ‘Why other people don’t get it’, focussing on how to create a great development team, how to handle situations where ‘they’ (management) just don’t get it, and also how we as developers should take responsibility for ourselves and our careers.

I always enjoy Sandro’s talks, and this one was a personal highlight of the conference for me – the scene was perfectly set when Sandro opened the session with a question “Does anyone have complete freedom to do what they want in their day job?”. Three of us put up our hands, to which Sandro replied that we “should get the f*ck out, as we shouldn’t be in this session”. Fair comment! There was a lot to take on board here, and I was was quite engrossed to the actual talk, and so my notes are a little sketchy. I’ll try my best to summaries the key takeaways for me:

In order to create a great software development team and company you have to:
- Define the culture you want to have in your company
- Don’t make your problem bigger. Hire allies
- Help people to help you
Hiring great developers is not an easy process, but this is critical for creating a great team (and company)
Good Developers want a good company. If you’re hiring developers then please remember this obvious but often forgotten fact!
Don’t just rely on boilerplate job advertisements for recruitment (Sandro’s slides contained some great examples)
If you wouldn’t promote someone on a criteria, then don’t use it for a job advertisement i.e. please don’t use the much maligned ‘X years experience of Y’
Hiring developers should not be entrusted to the HR department. There doesn’t exist a simple process or test to find the best developers
Look for passionate developers. Passion could/should be used as a CV filter
If you have a problem developer at your company, ask yourself
- How was he hired?
- How was he nurtured when he got here?
If you are looking for a job then publicly display examples of code you have created (OSS etc), personal projects and other involvement with the larger community. Show that you’re more than a 9-to-5er – show your passion for software development! Amen to that…
Pairing is a vital skill to learn – this allows prospective colleagues to see how you think, how you code, and also how you will be able to transfer knowledge
When looking to do something beneficial within your company, such as creating ‘brown bag’ technical sessions, or running software craftsman discussion groups, it is best to “ask for forgiveness, rather than beg for permission”
If you are in a position of authority then you should lead by example
If you are responsible, then you must also be accountable
Managers of great teams must provide autonomy, mastery and purpose
You are ultimately responsible for your career, therefore you must take responsibility. If you don’t like something, then take action to change it

That’s all for this first part of the JAX London review. Stay tuned for the second part later this week (hopefully… 🙂 )

[Book Review] Spring Data

June 30, 2013

Book Review

Spring Data – by Mark Pollack et al

TLDR: If you are working with Spring Data on a daily basis and want a complete and thorough overview of the framework then this book is all you will need. It covers all aspects of Spring Data without being overly verbose, and even if you have used Spring Data quite a lot already (like me), then I still believe you’ll discover something useful from this book. You will also find bonus chapters in context with Spring Data on Spring Roo, the REST repo exporter (very cool!), ‘Big Data’ via Hadoop, Pig, Hive and Spring Batch/Integration, and also coverage of GemFire.

I’ve been working professionally with Spring Data for quite some time now, both for ‘old skool’ RDBMS and also a lot of NoSQL (primarily MongoDB and Redis). The company I was working for at the time the Spring Data projects were approaching release were somewhat early-adopters, and in combination with the fact that their applications were firmly rooted in Spring made the decision to use this framework an easy choice. After some initial problems, which should be expected with a new technology (such as config issues and incompatibly between libraries bundled in JARs etc), Spring Data has provided a massive boost to productivity, and it is now my de facto choice when implementing persistence within Spring.

About the book itself: The first few chapters provide a great introduction to Spring Data, and describe the key motivations and techniques behind the framework. If you are simply modifying an already configured Spring Data app then this is all you need (but please do keep reading to learn more!). The next few chapters cover integration with an RDBMS, and also the popular NoSQL implementations – MongoDB, Neo4j and Redis. If you are working in one specific technology then reading the corresponding chapter will get you up and running quickly. Although Spring Data provides a common abstraction layer, it allows datastore-specific functionality to bleed through the interfaces (which is a good thing in my opinion, as it allows you to leverage specific features and strengths of the underlying technology), and this book will provide an excellent grounding and explanation of key concepts within each underlying datastore technology so that you can become productive quickly. Of course, you can also head over to the Spring Source website to learn the really advanced stuff (if you want to).

Part 4 of the book covers several interesting advanced features of the framework, such as using Spring Roo to auto-generate repository code, and also a brief guide on how to use the REST Repository Exporter. Metaprogamming and RAD tools like Spring Roo (and web-frameworks such as Grails and Play) are becoming increasingly popular in the industry, and so this chapter is a nice addition to the book. The REST exporter is also a very cool feature, and essentially allows you to expose CRUD functionality on your repositories via a REST interface. For anyone building a SOA-based app (or using micro-services etc) then encapsulating datastores and exposing simply functionality via a well defined HTTP-based API is very cool.

The final two parts of the book provide detailed coverage of using Spring Data to work with ‘Big Data’ through the use of Apache Hadoop, Spring Batch, Spring Integration and GemFire. Although this content wasn’t relevant to my initial decision to buy the book the chapters are a complete bonus in my opinion, and upon reading them I was even more happy with my purchase. The content provided is obviously quite high-level (as Big Data is a huge topic, no pun intended :)), but has enough detail to get you up and running with some Hadoop Jobs and Hive and Pig etc, which is a great skill to add to your CV.

I chose this book over the only other real competition for Spring Data coverage, Petri Kainulainen’s Spring Data, purely because this book offered more content. Obviously the book under review has more pages, ~280 vs ~160, but more importantly it covers a greater amount of topics, and Petri’s book focuses primarily on Redis (for which I was already familiar with). My main motivation for buying a Spring Data book was to learn about the ‘tips and tricks’, and I think either book would have met this need, but the coverage of other NoSQL technologies in the book under review, and the bonus chapters on Big Data technologies swayed my final decision. Now that I’ve read the book I am very happy with the decision.

In summary: This book will be all you need to master Spring Data, from key concepts to advanced usage. You’ll learn all of the ‘tips and tricks’ along the way, and also become familiar with Spring Roo, the REST repo exporter and fundamental techniques within Spring Data’s ‘Big Data’ processing (Hadoop, Spring Batch/Integration etc). I would recommend the book to any Spring developer, even one like myself who is happy learning about Spring from the excellent Spring Source website This book is a little more ‘polished’ than the Spring Source docs, and also provides concepts in well-structured and bit-sized chunks of information.

Click Here to buy ‘Spring Data‘ on Amazon (This is a sponsored link. Please click through and help a fellow developer to buy some more books! )

—The Tai-Dev Blog

One software development consultant, a thousand opinions…

Archive