Welcome to Haven's blog

Publish Avro message to Kafka topic

This article combines high level architectural design and hands-on scripting experience from Kafka POC project.

Driver

The first step is to create Driver. In the driver we use SparkConf to set parameter, then Spark cluster can allocate resources on cluster. We also need SparkContext, when the parameter is passed to SparkContext, it will ask Spark cluster manager to allocate resource for executers. Resource manager set master node and worker nodes (like Master-Slave model) to allocate executors, for example, when we use spark streaming to publish or consume data, RDDs will be sent to different worker nodes.

val conf = new SparkConf().setAppName("SparkDriverName")

val sc = new SparkContext(conf)

val sqlc = new HiveContext(sc)

DAO

The second step is to create DAO. In Sparks model we use HiveContext to implement SQL query, if the data comes from local, we read from local path and use HiveContext to register temp table with DBName.TableName, or we connect to hive table.

Local csv: sqlc.read.format(”com.databricks.spark.csv”).option(“header”,”true”).load(LocalCSVPath).registerTempTable(tableName)

or

Hive table: Sqlc.table(DBName +”.”+ TableName).registerTempTable(tableName)

Then,

sqlc.table(tableName)

Hive has metastore service which stores metadata including relationship and partition for relational database in master node, and the data is stored in HDFS server. Mostly we create internal table, security is controlled by Hive since delete the table will cause both metadata and data lost. Compared with internal table, delete external table only causes metadata deleted.

Dataframe

With Data access object with can use Spark SQL to create dataframe, the business logic will be implemented in this step. We use expr to input SQL query as a string variable and transform input dataframe to output dataframe. For this part research and test is necessary, since Spark SQL is not exactly the same as Oracle SQL.

val SQLExpression = ”SQLQuery”

InputDataframe.as(“DfName”).join(InputDataframe2, Seq(“ColumnName”), “left_outer”).select(col(“ColumnName”, SQLExpression))

Java Object

Avro has Json like data model and it can present complex data structure with business logic. With advantages as direct mapping to JSON and great binding for wide variety of programming languages, we can download an open source avro tools to generate Java object automatically from Avro schema. Then we use Jackson mapper to convert generic class to this format. With Java class, Jackson mapper can be used to parse JSON (dataframe) content into Java object.

java -jar avro-tools-1.8.x-cdhxxx.jar compile schema AvroSchemaName.avsc FolderName

val JavaObj= mapper.readValue(OutputDataframe, classof[JavaClassName])

Serialize

After we register Avro schema on Confluent platform, we use data serialization system of Avro to convert Java object to a compact binary format which is faster and easier to publish. Please review the next step. The kafka serializer will be passed as property parameter to publisher.

Publish

The last step is publish. Spark streaming can accumulate the messages for a short period of time as RDStream rather than process the messages one by one, and then publish. Inside each DStream there are RDDs, and each RDD is composed with partitions. Messages is are divided into partitions, offset is assigned to partitions and it will be used to maintain the order and tell the publisher where to start.

Since RDDs are scaled to multiple worker nodes, only if we use collect can do count/first/top or other action functions, otherwise we can only do transformation functions like filter/groupby.

Dataframe.foreachPartition(records=> {
    records.foreach(record => {
        Val SerializedData = new ProducerRecord[String, JavaClassName](TopicName, JavaObj)
    Producer.send(SerializedData).get
    }
}

Kafka design

Reference

Convert pic to txt

Look at my nyanko sensei picture! かわいい~~

cat

# -*- coding: utf-8 -*-
from PIL import Image

grey2char = ['@', '#', '$', '%', '&', '?', '*', 'o', '/', '{', '[', '(', '|', '!', '^', '~', '-', '_', ':', ';', ',',
             '.', '`', ' ']
count = len(grey2char)


def toText(image_file):
    image_file = image_file.convert('L')  # 转灰度
    result = ''  # 储存字符串
    for h in range(0, image_file.size[1]):  # height
        for w in range(0, image_file.size[0]):  # width
            gray = image_file.getpixel ...

install devstack

Mon 19 December 2016
By Haven

In Technology.

tags: devstack

adduser stack sudo visudo Find the part of the file that is labeled "User privilege specification" and change it to the following

User privilege specification root ALL=(ALL:ALL) ALL stack ALL=(ALL:ALL) ALL Close the file, exit and log in as "stack". Download devstack and create /opt/stack ...
read more

Download Bird

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
import re


def generate_url_list(input_url):
    r = requests.get(input_url)
    soup = BeautifulSoup(r.content, 'html.parser')

    page_list = []
    for link in soup.select('a > img'):
        if "org" in link.parent.get("href"):
            page_list.append(link.parent.get("href"))
            continue
        page_list.append("http ...

break

Jumps out of the closest enclosing loop (past the entire loop statement)

continue

Jumps to the top of the closest enclosing loop (to the loop’s header line)

pass

Does nothing at all: it’s an empty statement placeholder

Loop else block

Runs if and only if the loop ...

Method of List

Method Name	Use	Explanation
append	alist.append(item)	Adds a new item to the end of a list
insert	alist.insert(i,item)	Inserts an item at the ith position in a list
pop	alist.pop()	Removes and returns the last item in a list
pop	alist ...

Use Pelican and GitHub Page to create blog
Tue 16 August 2016
By Haven

In Technology.

tags: pelican github page

This is a test

Code blocks must be indented by 4 whitespaces. Python-Markdown has a auto-guess function which works pretty well:
```
print("Hello, World")
# some comment
for letter in "this is a test":
    print(letter)
```
In cases where Python-Markdown has problems figuring out which programming language we use, we can ...
read more

Page 1 / 1

Welcome to Haven's blog

Other articles

Kafka note

Do to list

Convert pic to txt

Look at my nyanko sensei picture! かわいい~~

install devstack

Download Bird

break, continue, pass, and the Loop else

break

continue

pass

Loop else block

Method of Python

Method of List

Use Pelican and GitHub Page to create blog

This is a test

Other articles

Look at my nyanko sensei picture! かわいい~~

break

continue

pass

Loop else block

Method of List

This is a test

blogroll

social