Eclipse + Scala + Spark でjarファイル実行

一般的な開発では，

EclipseでScalaを書く
Eclipseで実行したりデバッグしたりする
最後にjar化してclusterで実行

というステップをとるのが自然だと思う．

sbtでbuildでなく，mavenを使ってbuildしたいというのが普通のjava プログラマーだと思うが， mavenでjar化してspark clusterで実行可能になるのにかなりハマったので，そのワークアラウンドをメモ．

SparkはCloudera Manger経由でいれた．

new maven project でarchetypeの選択

artifact-id = scala-archetype-simpleを選択

(以下は，scalaで行ったがjavaも同じだと思う(未検証))

spark Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'

shade-pluginに変更
- http://maven.apache.org/plugins/maven-shade-plugin/usage.html

akkaのエラー

参考URLs

pom.xmlに下記を追加

...
          	<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
  				<resource>reference.conf</resource>                 
			</transformer>
...

Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes

shade pluginでよくあるらしい

pom.xmlに下記を追記

    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.2</version>
        <configuration>
          <filters>
            <filter>
              <artifact>*:*</artifact>
              <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
              </excludes>
            </filter>
          </filters>
...

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

core-site.xmlに下記を追加

<property>
  <name>fs.file.impl</name>
  <value>org.apache.hadoop.fs.LocalFileSystem</value>
  <description>The FileSystem for file: uris.</description>
</property>
<property>
  <name>fs.hdfs.impl</name>
  <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
  <description>The FileSystem for hdfs: uris.</description>
</property>

SimpleApp.scala

package edu.kzk.spark_sample.basic

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object SimpleApp {
    def main(args: Array[String]) {
        // from local
        val logFile = "file:///home/kzk/datasets/wordcount/word_count_input.txt";
        
        // 最後の2つはoption
        //val sc = new SparkContext("local", "Simple App", "YOUR_SPARK_HOME",
        //List("target/scala-2.10/simple-project_2.10-1.0.jar"))
        
        val sc = new SparkContext("local", "Simple App", 
            "/opt/cloudera/parcels/SPARK/");
        val logData = sc.textFile(logFile, 2).cache();
        val numAs = logData.filter(line => line.contains("a")).count();
        val numBs = logData.filter(line => line.contains("b")).count();
        println("Lines with a: %s, Lines with b: %s".format(numAs, numBs));
        
        // from hdfs
        val textFile = sc.textFile("localhost:8020/user/kzk/sample/input/word_count_input.txt");
        val wordCounts = textFile.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b);
        wordCounts.take(100).foreach(e => println(e));
            
    }
}

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>edu.kzk</groupId>
  <artifactId>spark_sample</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>
  <name>${project.artifactId}</name>
  <description>My wonderfull scala app</description>
  <inceptionYear>2010</inceptionYear>
  <licenses>
    <license>
      <name>My License</name>
      <url>http://....</url>
      <distribution>repo</distribution>
    </license>
  </licenses>
  
  <repositories>
   	<repository>
      <id>Akka repository</id>
      <url>http://repo.akka.io/releases</url>
    </repository>
  	<repository>
      <id>cloudera</id>
      <name>Cloudera Maven Repository</name>
      <url>https://repository.cloudera.com/content/repositories/releases/</url>
    </repository>
  </repositories>

  <properties>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.tools.version>2.10</scala.tools.version>
    <scala.version>2.10.0</scala.version>
  </properties>

  <!-- spark and hadoop -->
  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>0.9.0-incubating</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.0.0-mr1-cdh4.6.0</version>
    </dependency>
    
    <!-- Test -->
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.specs2</groupId>
      <artifactId>specs2_${scala.tools.version}</artifactId>
      <version>1.13</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalatest</groupId>
      <artifactId>scalatest_${scala.tools.version}</artifactId>
      <version>2.0.M6-SNAP8</version>
      <scope>test</scope>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <!-- see http://davidb.github.com/scala-maven-plugin -->
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.1.3</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <arg>-make:transitive</arg>
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>
	  <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.2</version>
        <configuration>
          <filters>
            <filter>
              <artifact>*:*</artifact>
              <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
              </excludes>
            </filter>
          </filters>
          <transformers>
          	<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
  				<resource>reference.conf</resource>                 
			</transformer>
            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
              <manifestEntries>
              	<!-- 
                <mainClass>edu.kzk.spark_sample.basic.SimpleApp</mainClass>
                 -->
                <!-- 
                     <classPath>your/class/path/here</classPath>
                -->
              </manifestEntries>
            </transformer>
          </transformers>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <version>2.13</version>
        <configuration>
          <useFile>false</useFile>
          <disableXmlReport>true</disableXmlReport>
          <!-- If you have classpath issue like NoDefClassError,... -->
          <!-- useManifestOnlyJar>false</useManifestOnlyJar -->
          <includes>
            <include>**/*Test.*</include>
            <include>**/*Suite.*</include>
          </includes>
        </configuration>
      </plugin>
    </plugins>
    <pluginManagement>
      <plugins>
        <!--This plugin's configuration is used to store Eclipse m2e settings only. It has no influence on the Maven build itself.-->
        <plugin>
          <groupId>org.eclipse.m2e</groupId>
          <artifactId>lifecycle-mapping</artifactId>
          <version>1.0.0</version>
          <configuration>
            <lifecycleMappingMetadata>
              <pluginExecutions>
                <pluginExecution>
                  <pluginExecutionFilter>
                    <groupId>
                      net.alchim31.maven
                    </groupId>
                    <artifactId>
                      scala-maven-plugin
                    </artifactId>
                    <versionRange>
                      [3.1.3,)
                    </versionRange>
                    <goals>
                      <goal>testCompile</goal>
                      <goal>compile</goal>
                    </goals>
                  </pluginExecutionFilter>
                  <action>
                    <ignore></ignore>
                  </action>
                </pluginExecution>
              </pluginExecutions>
            </lifecycleMappingMetadata>
          </configuration>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

hadoopのconfとjarファイルを指定して実行

ここが一番ハマった．

core-site.xmlを書き換えてhdfsの再起動ではだめ．
実行時に，書き換えたcore-site.xmlをclasspathに指定．

$ java -cp /etc/hadoop/conf/:/${path_to}/spark_sample-0.0.1-SNAPSHOT.jar edu.kzk.spark_sample.basic.SimpleApp

この記事は，Spark0.9でStandaloneモードで起動，かつmaven-basedなので，
下記記事を参考にすること．

Execute Spark Application on Eclipse + Spark (Scala) + Gradle - KZKY's memo
<a href="http://kzky.hatenablog.com/entry/2014/12/24/Execute_Spark_Application_on_Eclipse_%2B_Spark_%28Scala%29_%2B_Gradle_">Execute Spark Application on Eclipse + Spark (Scala) + Gradle - KZKY's memo</a>

KZKY memo

自分用メモ．