IPN CIC

    Welcome
    IPN-Dharma AI Lab

    This is an IPN CIC - DHARMA initiative to provide an Artificial Intelligence Laboratory to motivate researchers, professors and students to take advantage of the courses, resources and tools of the main technology platforms of the industry in the areas of Machine Learning, Data Science, Cloud Computing, Artificial Intelligence and Internet of Things with the purpose of generating a practical experience through a learning model between peers and by objectives.

    Level 3: Building Solutions

    Spark Fundamentals

    Ever waited over night to run a report and to come back to your computer in the morning to find it still running. When the heat is on and you have a deadline, something is not working.  With larger and larger data sets you need to be fluent in the right tools to be able to make your commitments. This learning path is your opportunity to learn from industry leaders about Spark. This path provides hands on opportunities and projects to build your confidence within this tool set.

    Solid understanding and experience, with core tools, in any field promotes excellence and innovation. Apache Spark, as a general engine for large scale data processing, is such a tool within the big data realm. This learning path addresses the fundamentals of this program's design and its application in the everyday.

    Courses in this program

    1) Spark Fundamentals I

    Ignite your interest in Spark with an introduction to the core concepts that make this general processor an essential tool set for working with Big Data.

    Learn the fundamentals of Spark, the technology that is revolutionizing the analytics and big data world! Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is the way to go.

    • Learn how it performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining.
    • Learn how it provides in-memory cluster computing for lightning fast speed and supports Java, Python, R, and Scala APIs for ease of development.
    • Learn how it can handle a wide range of data processing scenarios by combining SQL, streaming and complex analytics together seamlessly in the same application.
    • Learn how it runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3.

    Esfuerzo  Estimated effort 6 hours

    Idioma  English language

    Link  Cognitive Class

    2) Spark MLlIB

    Spark provides a machine learning library known as MLlib. Spark MLlib provides various machine learning algorithms such as classification, regression, clustering, and collaborative filtering. It also provides tools such as featurization, pipelines, persistence, and utilities for handling linear algebra operations, statistics and data handling.

    This course will start you off on your journey and walk you through some of the machine learning libraries and how to use them.

    Esfuerzo  Estimated effort 3 hours

    Idioma  English language

    Link  Cognitive Class

    3) Exploring Spark's GraphX

    Spark provides a graph-parallel computation library in GraphX. Graph-parallel is a paradigm that allows representation of your data as vertices and edges. Spark's GraphX provides a set of fundamental operators in addition to a growing collection of algorithms and builders to simplify graph analytics tasks.

    In this course, you will learn about Spark GraphX components and the background of graph-parallel operations. You will see how Spark implements this with RDDs and how it compares vs Data Parallel operations. You will get to explore how to visualize your data using various graph operators.

    Esfuerzo  Estimated effort 3 hours

    Idioma  English language

    Link  Cognitive Class

    4) Analyzing Big Data in R using Apache Spark

    Apache Spark is a popular cluster computing framework used for performing large scale data analysis. SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users.

    Master Apache Spark, a popular cluster computing framework used for performing large scale data analysis. SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users.

    • Learn why R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks.
    • Learn how SparkR, an R package that provides a light-weight frontend, uses Apache Spark from R.

    Esfuerzo  Estimated effort 3 hours

    Idioma  English language

    Link  Cognitive Class

    5) Spark Fundamentals II

    Building on your foundational knowledge of Spark, take this opportunity to move your skills to the next level. With a focus on Spark Resilient Distributed Data Set operations this course exposes you to concepts that are critical to your success in this field.

    Expand your knowledge of the concepts discussed in Spark Fundamentals I with a focus on RDDs (Resilient Distributed Datasets). RDDs are the main abstraction Spark provides to enable parallel processing across the nodes of a Spark cluster.

    • Get in-deptth knowledge on Spark’s architecture and how data is distributed and tasks are parallelized.
    • Learn how to optimize your data for joins using Spark’s memory caching.
    • Learn how to use the more advanced operations available in the API.
    • The lab exercises for this course are performed exclusively on the Cloud and using a Notebook interface.

    Esfuerzo  Estimated effort 5 hours

    Idioma  English language

    Link  Cognitive Class

    © 2015 |Laboratorio de Microtecnología y Sistemas Embebidos | Centro de Investigación en Computación | Instituto Politécnico Nacional