Skip to main content

Posts

Databrickls Lakehouse & Well Architect Notion

Let's quickly learn about Databricks, Lakehouse architecture and their integration with cloud service providers : What is Databricks? Databricks is a cloud-based data engineering platform that provides a unified analytics platform for data engineering, data science and data analytics. It's built on top of Apache Spark and supports various data sources, processing engines and data science frameworks. What is Lakehouse Architecture? Lakehouse architecture is a modern data architecture that combines the benefits of data lakes and data warehouses. It provides a centralized repository for storing and managing data in its raw, unprocessed form, while also supporting ACID transactions, schema enforcement and data governance. Key components of Lakehouse architecture: Data Lake: Stores raw, unprocessed data. Data Warehouse: Supports processed and curated data for analytics. Metadata Management: Tracks data lineage, schema and permissions. Data Governance: Ensures data quality, security ...

LIDAR Substitute in Robotics

LIDAR (Light Detection and Ranging) sensors are popular for robotic mapping, navigation and obstacle detection due to their high accuracy. However, they can be expensive and power-hungry. Here are some alternatives to LIDAR for your small robot: 1. Stereovision Stereovision uses two cameras to calculate depth information from the disparity between images. This method is less accurate than LIDAR but cheaper. 2. Structured Light Sensors Structured light sensors project a pattern onto the environment and measure distortions to calculate depth. Examples include Microsoft Kinect and Intel RealSense. 3. Time-of-Flight (ToF) Cameras ToF cameras measure depth by calculating the time difference between emitted and reflected light pulses. 4. Ultrasonic Sensors Ultrasonic sensors use sound waves to detect obstacles. They're inexpensive but less accurate and limited in range. 5. Infrared (IR) Sensors IR sensors detect obstacles by measuring reflected infrared radiation. They're simple and...

The Evolution of Software Engineering

  meta The Evolution of Software Engineering: Embracing AI-Driven Innovation Software engineering has undergone significant transformations since its inception. From manually writing code for every process to leveraging libraries, SDKs and Large Language Models (LLMs), each advancement has revolutionized the field. Rather than replacing software engineers, these innovations have consistently expanded the scope and complexity of applications, necessitating skilled professionals to develop and integrate cutting-edge technologies. A Brief Retrospective Manual Coding: The early days of software development involved writing custom code for every application, a time-consuming and labor-intensive process. Libraries: The introduction of reusable code libraries streamlined development, enabling engineers to focus on higher-level logic. Software Development Kits (SDKs): SDKs facilitated the creation of complex applications by providing pre-built components and tools. Large Language Models (L...

KNN and ANN with Vector Database

  Here are the details for both Approximate Nearest Neighbors (ANN) and K-Nearest Neighbors (KNN) algorithms, including their usage in vector databases: Approximate Nearest Neighbors (ANN) Overview Approximate Nearest Neighbors (ANN) is an algorithm used for efficient similarity search in high-dimensional vector spaces. It quickly finds the closest points (nearest neighbors) to a query vector. How ANN Works Indexing: The ANN algorithm builds an index of the vector database, which enables efficient querying. Querying: When a query vector is provided, the algorithm searches the index for the closest vectors. Approximation: ANN sacrifices some accuracy to achieve efficiency, hence "approximate" nearest neighbors. Advantages Speed: ANN is significantly faster than exact nearest neighbor searches, especially in high-dimensional spaces. Scalability: Suitable for large vector databases. Disadvantages Accuracy: May not always find the exact nearest neighbors due to approximations. Us...

Learning Apache Parquet

Apache Parquet is a columnar storage format commonly used in cloud-based data processing and analytics. It allows for efficient data compression and encoding, making it suitable for big data applications. Here's an overview of Parquet and its benefits, along with an example of its usage in a cloud environment: What is Parquet? Parquet is an open-source, columnar storage format developed by Twitter and Cloudera. It's designed for efficient data storage and retrieval in big data analytics. Benefits Columnar Storage: Stores data in columns instead of rows, reducing I/O and improving query performance. Compression: Supports various compression algorithms, minimizing storage space. Encoding: Uses efficient encoding schemes, further reducing storage needs. Query Efficiency: Optimized for fast query execution. Cloud Example: Using Parquet in AWS Here's a simplified example using AWS Glue, S3 and Athena: Step 1: Data Preparation Create an AWS Glue crawler to identify your data sche...

Entropy and Information Gain in Natural Language Processing

This is a beautiful and insightful explanation about why hashtag # Java has 2x higher entropy than hashtag # Python when processing natural language processing To uderstand we must know what is the Entropy of Programming Languages In the context of programming languages, entropy refers to the measure of randomness or unpredictability in the code. A language with higher entropy often requires more characters to express the same logic, making it less concise. Why Java Has Higher Entropy Than hashtag # Python Java's higher entropy compared to Python can be attributed to several factors: Verbosity: Java often demands more explicit syntax, such as declaring variable types and using semicolons to terminate statements. Python, on the other hand, relies on indentation and fewer keywords, reducing the overall character count. Object-Oriented Paradigm: Java is strongly object-oriented, which often leads to more verbose code as objects, classes, and methods need to be defined and instan...

Data Ingestion for Retrieval-Augmented Generation (RAG)

Data Ingestion for Retrieval-Augmented Generation (RAG) Data Ingestion is a critical initial step in building a robust Retrieval-Augmented Generation (RAG) system. It involves the process of collecting, cleaning, structuring, and storing diverse data sources into a format suitable for efficient retrieval and generation. Key Considerations for Data Ingestion in RAG: Data Source Identification: Internal Data: Company documents, reports, knowledge bases, customer support tickets, etc. Proprietary databases, spreadsheets, and other structured data. External Data: Publicly available datasets (e.g., Wikipedia, Arxiv) News articles, blog posts, research papers from various sources Social media data (with appropriate ethical considerations) Data Extraction and Cleaning: Text Extraction: Extracting relevant text from various formats (PDF, DOCX, HTML, etc.) Data Cleaning: Removing noise, inconsistencies, and irrelevant information Normalization: Standardizing text (e....