Monday, August 19, 2013

Introduction to Big Data

What is Big Data?

According to Wikipedia:

Big Data is the term for a collection of data set so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenge includes capture, corration, storage, search, sharing, transfer, analysis and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlation to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions." 

According to Paul Dix via Udemy:

Big Data is the combination of infrastructure, algorithms and visualizations used to make sense of user and machine generated data.
Big Data is not limited to more data than you can effectively work with on a single computer.
Big Data is about gaining insight from data regardless of the size of the data set. 

What is the possible data source for Big Data?
  1. Users
  2. Applications
  3. Systems
  4. Sensors
What kind of questions that we use Big Data to resolve?
  1. What are my users doing in my application?
  2. Is something spam?
  3. What items or users are like each other?
  4. What items might a user like?
  5. What went wrong in my application or on my servers?
  6. What effect did a change in my application have on user behavior? 
What are the types of data in Big Data?
  1. User generated
  2. Machine generated
  3. Structured (like the type of data you would find in a database...)
  4. Unstructured (which is usually free form text...)
What are the topics within Big Data?
  1. Infrastructure
  2. Algorithms
  3. Visualizations
What are the goals of a Big Data infrastructure?
  1. Scalability (the infrastructure should be able to scale with the amount of data that you have & the number of things that you need to do with it...)
  2. Experimentation (the infrastructure should facilitate experimentation so that you can try many things and do many experiments with your data sets and your users...)
  3. Accessible across the organization (the infrastructure should be open enough so that everyone can mine insight from the data...)
  4. Search and discovery (the infrastructure should aid this as when there is a problem, it should help you determine what the cause of the problem was...)
What are the tools for Big Data infrastructure?
  1. Batch Processing and Storage (e.g. Hadoop)
  2. Structured Storage
    1. e.g. Cassandra
    2. e.g. HBase
    3. e.g. Riak
    4. e.g. SQL?
  3. Messaging
    1. Kafka
    2. RabbitMQ
    3. ZeroMQ
    4. NSQ
What are the goals for Big Data Algorithms?
  1. Mine business intelligence (actionable insight that will help you improve business...)
  2. Detect spam or aberrant user activity
  3. Make recommendations to users
  4. Automatically categorize things
What are the tools for Big Data Algorithms?
  1. Math
    1. Probability
    2. Statistics
    3. Linear algebra
  2. A/B testing or split testing
  3. Regression for predicting variables
  4. Supervised learning or classification (e.g. spam detection or categorizing things...)
  5. Unsupervised learning or clustering (e.g. automatically group things based on their properties...)
What are the goals for Big Data Visualizations?
  1. Convey meaning (gain insight from your data...)
  2. Discovery (find abnormal or interesting that you can act on...)
What are the tools for Big Data Visualizations?
  1. Charts and graphs
  2. Geographical
  3. Interaction
  4. Animation
  5. Libraries
    1. D3 (i.e. javascript library)
    2. Gnuplot 
    3. Matplotlib (i.e. a python based library)
In short, Big Data is... 
  • Infrastructure for storing and working with data
  • Algorithms for discovery and making predictions
  • Visualizations for conveying meaning and aiding in in discovery

No comments:

Post a Comment