Spark Reader

2017-07-20 | 阅读：次

spark读取csv文件的可以附加的所有option选项

// Loads CSV files and returns the result as a DataFrame.
def csv(paths: String*): DataFrame

This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.

You can set the following CSV-specific options to deal with CSV files:

sep (default ,): sets the single character as a separator for each field and value.

encoding (default UTF-8): decodes the CSV files by the given encoding type.

quote (default "): sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different from com.databricks.spark.csv.

escape (default \): sets the single character used for escaping quotes inside an already quoted value.

comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled.

header (default false): uses the first line as names of columns.

inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data.

ignoreLeadingWhiteSpace (default false): a flag indicating whether or not leading whitespaces from values being read should be skipped.

ignoreTrailingWhiteSpace (default false): a flag indicating whether or not trailing whitespaces from values being read should be skipped.

nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.

nanValue (default NaN): sets the string representation of a non-number" value.

positiveInf (default Inf): sets the string representation of a positive infinity value.

negativeInf (default -Inf): sets the string representation of a negative infinity value.

dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type.

timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.

maxColumns (default 20480): defines a hard limit of how many columns a record can have.

maxCharsPerColumn (default -1): defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length

mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes.
* PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets null for extra fields.
* DROPMALFORMED : ignores the whole corrupted records.
* FAILFAST : throws an exception when it meets corrupted records.

columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.

multiLine (default false): parse one record, which may span multiple lines.

实践一下

val sparkSession = SparkSession.builder.getOrCreate
val data = sparkSession
            .read
            // csv文件的第一行为列名，默认值是false
            .option("header", true)
            // csv文件的分割符，默认是英文的逗号，只能是一个字符
            .option("delimiter","|")
            // 指定要过滤掉的引号，默认是英文的双引号
            .option("quote","'")
            // 解析的模式。默认值是PERMISSIVE，支持的选项有
            //（1）PERMISSIVE：尝试解析所有的行，nulls are inserted for missing tokens and extra tokens are ignored.
            //（2）DROPMALFORMED：drops lines which have fewer or more tokens than expected
            //（3）FAILFAST: aborts with a RuntimeException if encounters any malformed line
            .option("mode","DROPMALFORMED")
            .csv("""hdfs:///data-zhangqiang/homeVisit.csv""")