Eclipse + Scala + Spark でjarファイル実行
一般的な開発では,
というステップをとるのが自然だと思う.
sbtでbuildでなく,mavenを使ってbuildしたいというのが普通のjavaプログラマーだと思うが, mavenでjar化してspark clusterで実行可能になるのにかなりハマったので,そのワークアラウンドをメモ.
SparkはCloudera Manger経由でいれた.
new maven project でarchetypeの選択
artifact-id = scala-archetype-simpleを選択
(以下は,scalaで行ったがjavaも同じだと思う(未検証))
spark Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
akkaのエラー
- 参考URLs
- http://www.google.co.jp/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDQQFjAB&url=http%3A%2F%2Fakka.io%2F&ei=QdYlU6GEEI6KkwWSg4CIBw&usg=AFQjCNGemCAfzhLSIrHMx6fGXKq_KBUAPA&sig2=mzMF3rkVvPNgDIzaHO-5sQ
- http://ameblo.jp/principia-ca/entry-11568834319.html
- http://comments.gmane.org/gmane.comp.lang.scala.spark.user/2724
- pom.xmlに下記を追加
... <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>reference.conf</resource> </transformer> ...
Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
shade pluginでよくあるらしい
- pom.xmlに下記を追記
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.2</version> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> ...
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
- core-site.xmlに下記を追加
<property> <name>fs.file.impl</name> <value>org.apache.hadoop.fs.LocalFileSystem</value> <description>The FileSystem for file: uris.</description> </property> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> <description>The FileSystem for hdfs: uris.</description> </property>
SimpleApp.scala
package edu.kzk.spark_sample.basic import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[String]) { // from local val logFile = "file:///home/kzk/datasets/wordcount/word_count_input.txt"; // 最後の2つはoption //val sc = new SparkContext("local", "Simple App", "YOUR_SPARK_HOME", //List("target/scala-2.10/simple-project_2.10-1.0.jar")) val sc = new SparkContext("local", "Simple App", "/opt/cloudera/parcels/SPARK/"); val logData = sc.textFile(logFile, 2).cache(); val numAs = logData.filter(line => line.contains("a")).count(); val numBs = logData.filter(line => line.contains("b")).count(); println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)); // from hdfs val textFile = sc.textFile("localhost:8020/user/kzk/sample/input/word_count_input.txt"); val wordCounts = textFile.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b); wordCounts.take(100).foreach(e => println(e)); } }
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>edu.kzk</groupId> <artifactId>spark_sample</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <name>${project.artifactId}</name> <description>My wonderfull scala app</description> <inceptionYear>2010</inceptionYear> <licenses> <license> <name>My License</name> <url>http://....</url> <distribution>repo</distribution> </license> </licenses> <repositories> <repository> <id>Akka repository</id> <url>http://repo.akka.io/releases</url> </repository> <repository> <id>cloudera</id> <name>Cloudera Maven Repository</name> <url>https://repository.cloudera.com/content/repositories/releases/</url> </repository> </repositories> <properties> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> <encoding>UTF-8</encoding> <scala.tools.version>2.10</scala.tools.version> <scala.version>2.10.0</scala.version> </properties> <!-- spark and hadoop --> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>0.9.0-incubating</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.0.0-mr1-cdh4.6.0</version> </dependency> <!-- Test --> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> <scope>test</scope> </dependency> <dependency> <groupId>org.specs2</groupId> <artifactId>specs2_${scala.tools.version}</artifactId> <version>1.13</version> <scope>test</scope> </dependency> <dependency> <groupId>org.scalatest</groupId> <artifactId>scalatest_${scala.tools.version}</artifactId> <version>2.0.M6-SNAP8</version> <scope>test</scope> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <!-- see http://davidb.github.com/scala-maven-plugin --> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.1.3</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> <configuration> <args> <arg>-make:transitive</arg> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.2</version> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>reference.conf</resource> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <manifestEntries> <!-- <mainClass>edu.kzk.spark_sample.basic.SimpleApp</mainClass> --> <!-- <classPath>your/class/path/here</classPath> --> </manifestEntries> </transformer> </transformers> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.13</version> <configuration> <useFile>false</useFile> <disableXmlReport>true</disableXmlReport> <!-- If you have classpath issue like NoDefClassError,... --> <!-- useManifestOnlyJar>false</useManifestOnlyJar --> <includes> <include>**/*Test.*</include> <include>**/*Suite.*</include> </includes> </configuration> </plugin> </plugins> <pluginManagement> <plugins> <!--This plugin's configuration is used to store Eclipse m2e settings only. It has no influence on the Maven build itself.--> <plugin> <groupId>org.eclipse.m2e</groupId> <artifactId>lifecycle-mapping</artifactId> <version>1.0.0</version> <configuration> <lifecycleMappingMetadata> <pluginExecutions> <pluginExecution> <pluginExecutionFilter> <groupId> net.alchim31.maven </groupId> <artifactId> scala-maven-plugin </artifactId> <versionRange> [3.1.3,) </versionRange> <goals> <goal>testCompile</goal> <goal>compile</goal> </goals> </pluginExecutionFilter> <action> <ignore></ignore> </action> </pluginExecution> </pluginExecutions> </lifecycleMappingMetadata> </configuration> </plugin> </plugins> </pluginManagement> </build> </project>
hadoopのconfとjarファイルを指定して実行
ここが一番ハマった.
$ java -cp /etc/hadoop/conf/:/${path_to}/spark_sample-0.0.1-SNAPSHOT.jar edu.kzk.spark_sample.basic.SimpleApp
この記事は,Spark0.9でStandaloneモードで起動,かつmaven-basedなので,
下記記事を参考にすること.
Execute Spark Application on Eclipse + Spark (Scala) + Gradle - KZKY's memo